Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · edit-2 9 months ago

Guide to Self Hosting LLMs Faster/Better than Ollama

thirdBreakfast@lemmy.world · 9 months ago

Guide to Self Hosting LLMs with Ollama.

Download and run Ollama
Open a terminal, type ollama run llama3.2

AliasAKA@lemmy.world · 9 months ago

Bookmarked and will come back to this. One thing that may be if interest to add is for AMD cards with 20gb of ram. I’d suppose that it would be Qwen 2.5 34B with maybe less strict quant or something.

Also, it may be interesting to look at the AllenAI molmo related models. I’m kind of planning to do this myself but haven’t had time as yet.

brucethemoose@lemmy.world · edit-2 9 months ago

Yep. 20GB is basically 24GB, though its too tight for 70B models.

One quirk for 7900 owners is that installing flash attention for long context usage can be a pain. Apparently it is doable now, I need to dig up the link, but it might just be easier to use kobold.cpp rocm with its native flash attention.

As for vision models, that is a whole different can of worms. Exllama does not support this, so you’d need a framework that does.

If you are looking for niche models, check out MiniG (which is a continued pretrain of the already very excellent GLM4-9B): https://huggingface.co/bartowski/miniG-GGUF

Llama.cpp support is recent, though I’m not 100% sure its completely fixed. It should work in Aphrodite as well.

Possibly linux@lemmy.zip · 9 months ago

Or we could all just use ollama. It is way simpler and works fine without a GPU even. I don’t really understand the problem with it.

brucethemoose@lemmy.world · edit-2 9 months ago

It’s less optimal.

On a 3090, I simply can’t run Command-R or Qwen 2.5 34B well at 64K-80K context with ollama. Its slow even at lower context, the lack of DRY sampling and some other things majorly hit quality.

Ollama is meant to be turnkey, and thats fine, but LLMs are extremely resource intense. Sometimes the manual setup/configuration is worth it to squeeze out every ounce of extra performance and quantization quality.

Even on CPU-only setups, you are missing out on (for instance) the CPU-optimized quantizations llama.cpp offers now, or the more advanced sampling kobold.cpp offers, or more fine grained tuning of flash attention configs, or batched inference, just to start.

And as I hinted at, I don’t like some other aspects of ollama, like how they “leech” off llama.cpp and kinda hide the association without contributing upstream, some hype and controversies in the past, and hints that they may be cooking up something commercial.

Possibly linux@lemmy.zip · 9 months ago

I’m not going to lie I don’t really see evidence supporting you claims. What evidence do you have?

Ollama is llama.cpp with a web wrapper and some configs to make sure it works.

brucethemoose@lemmy.world · edit-2 9 months ago

To go into more detail:

Exllama is faster than llama.cpp with all other things being equal.
exllama’s quantized KV cache implementation is also far superior, and nearly lossless at Q4 while llama.cpp is nearly unusable at Q4 (and needs to be turned up to Q5_1/Q4_0 or Q8_0/Q4_1 for good quality)
With ollama specifically, you get locked out of a lot of knobs like this enhanced llama.cpp KV cache quantization, more advanced quantization (like iMatrix IQ quantizations or the ARM/AVX optimized Q4_0_4_4/Q4_0_8_8 quantizations), advanced sampling like DRY, batched inference and such.

It’s not evidence or options… it’s missing features, thats my big issue with ollama. I simply get far worse, and far slower, LLM responses out of ollama than tabbyAPI/EXUI on the same hardware, and there’s no way around it.

Also, I’ve been frustrated with implementation bugs in llama.cpp specifically, like how llama 3.1 (for instance) was bugged past 8K at launch because it doesn’t properly support its rope scaling. Ollama inherits all these quirks.

I don’t want to go into the issues I have with the ollama devs behavior though, as that’s way more subjective.

Konraddo@lemmy.world · 9 months ago

I know this is not the theme of this post, but I wonder if there’s an LLM that doesn’t hallucinate when asked to summarize information of a group of documents. I tried Gpt4all for simple queries like finding out which documents mentioned a certain phrase. It often gave me filenames that didn’t actually exist. Hallucinating contents is one thing but making up data source is just horrible.

brucethemoose@lemmy.world · 9 months ago

That’s absolutely on topic, check out https://huggingface.co/spaces/vectara/Hallucination-evaluation-leaderboard

Command R is built for this if you have the vram to swing it, otherwise GLM4 (or MiniG as linked below) is great. The later, unfortunately, doesn’t work with TabbyAPI, so you have to use something like Kobold.cpp.

You also have to use very low (basically zero) temperature and be careful with other sampling settings, and watch your context length.

There are more sophisticated RAG setups some of these UIs (like open Web UI) integrate, and sometimes you’ll need to host an embeddings model alongside the llm for that to work.

sturlabragason@lemmy.world · 9 months ago

Frontendwise; Librechat is pretty cool.

vividspecter@lemm.ee · 9 months ago

Do you have any recommendations for a Perplexity.ai type setup? It’s one of the few recent innovations I’ve found useful. I’ve heard of Perplexica and a few others, but not sure what is the best approach.

projectmoon@lemm.ee · 9 months ago

Perplexica works. It can understand ollama and custom OpenAI providers.

LiveLM@lemmy.zip · edit-2 9 months ago

What does Perplexity do different than other AI solutions?
Heard about it but haven’t tried yet

Caboose12000@lemmy.world · 9 months ago

I haven’t heard about it before today but I tried asking it what separates it from other LLMs and apparently the answer is just that it does a google search and shows you the source its summarizing, which if true is not very compelling, and if a hallucination or missing details then its at least not very compelling as a search replacement

BaroqueInMind@lemmy.one · 9 months ago

It abstracts away llama.cpp in a way that, frankly, leaves a lot of performance and quality on the table.

OP, do you have any telemetry you can show us comparing the performance difference between what you setup on this guide and an Ollama setup? Otherwise, at face value, I’m going to assume this is another thing on the internet i have to assume is uncorroborated bullshit. Apologies for sounding rude.

I don’t like some things about the devs. I won’t rant, but I especially don’t like the hint they’re cooking up something commercial.

This concerns me. Please provide links for us to read here about this. I would like any excuse to uninstall Ollama. Thank you!

sntx@lemm.ee · 9 months ago

Thanks for the writeup! So far I’ve been using ollama, but I’m always open for trying out alternatives. To be honest, it seems I was oblivious to the existence of alternatives.

Your post is suggesting that the same models with the same parameters generate different result when run on different backends?

I can see how the backend would have an influence hanfling concurrent api calls, ram/vram efficiency, supported hardware/drivers and general speed.

But going as far as having different context windows and quality degrading issues is news to me.

brucethemoose@lemmy.world · 9 months ago

Your post is suggesting that the same models with the same parameters generate different result when run on different backends

Yes… sort of. Different backends support different quantization schemes, for both the weights and the KV cache (the context). There are all sorts of tradeoffs.

There are even more exotic weight quantization schemes (ALQM, VPTQ) that are much more VRAM efficient than llama.cpp or exllama, but I skipped mentioning them (unless somedone asked) because they’re so clunky to setup.

Different backends also support different samplers. exllama and kobold.cpp tend to be at the cutting edge of this, with things like DRY for better long-form generation or grammar.

WolfLink@sh.itjust.works · 9 months ago

Could I run larger LLMs with multiple GPUs? E.g. would 2x3090 be able to run the 48GB models? Would I need NVLink to make it work?

brucethemoose@lemmy.world · edit-2 9 months ago

Absolutely.

Only aphrodite (and other enterprise backends like vllm/sglang) can make use of NVLink, but even exllama or mlc-llm split across GPUs nicely over PCIe, no NVLink needed.

2x 3090s or P40s is indeed a popular config among local runners, and is the perfect size for a 70B model. Some try to squeeze Mistral-Large in, but IMO its too tight a fit.

sntx@lemm.ee · 9 months ago

Is there an inherent benefit for using NVLINK? Should I specifically try out Aprodite over the other recommendations when having 2x 3090 with NVLINK available?

brucethemoose@lemmy.world · 9 months ago

So there are multiple ways to split models across GPUs, (layer splitting, which uses one GPU then another, expert parallelism, which puts different experts on different GPUs), but the way you’re interested in is “tensor parallelism”

This requires a lot of communication between the GPUs, and NVLink speeds that up dramatically.

It comes down to this: If you’re more interested in raw generation speed, especially with parallel calls of smaller models, and/or you don’t care about long context (with 4K being plenty), use Aphrodite. It will ultimately be faster.

But if you simply want to stuff the best/highest quality model you can at VRAM, especially at longer context (>4K), use TabbyAPI. Its tensor parallelism only works over PCIe, so it will be a bit slower, but it will still stream text much faster than you can read. It can simply hold bigger, better models at higher quality in the same 48GB VRAM pool.

brucethemoose@lemmy.world · edit-2 9 months ago

Also, AMD is not off the table for multi-gpu. I know some LLM runners are buying used 32GB MI100s.

shaserlark@sh.itjust.works · 9 months ago

I run a Mac Mini as a home server because it’s great for hardware transcoding, I was wondering if I could host an LLM locally. I work with python so that wouldn’t be an issue but I have no idea how to do CUDA or work on low level code. Is there anything I need to consider? Would probably start with a really small model.

thirdBreakfast@lemmy.world · 9 months ago

If it’s an M1, you def can and it will work great. With Ollama.

shaserlark@sh.itjust.works · 9 months ago

Yeah it’s an M1 16GB, sounds awesome I’ll try, thanks a lot for the guide it’s super helpful. I just got the Mac Mini for jellyfin but this is an unexpected use case where the server comes in very handy.

brucethemoose@lemmy.world · 9 months ago

For that you probably want the llama.cpp server and a Qwen2 14B IQ3 quantization.

16GB is kinda tight though, especially if you’re running other stuff in the background.

Scrubbles@poptalk.scrubbles.tech · 9 months ago

I like the look of exui, but is there a way to run it without torch or needing a GPU? I have tabby running on a separate computer, like SillyTavern I just want to connect to the API, not host it locally

brucethemoose@lemmy.world · 9 months ago

Nah, I should have mentioned it but exui is it’s own “server” like TabbyAPI.

Just run exui on the host that would normally serve tabby, and access the web ui through a browser.

If you need an API server, TabbyAPI fills that role.

Eskuero@lemmy.fromshado.ws · 9 months ago

Ollama has had for a while an issue opened abou the vulkan backend but sadly it doesn’t seem to be going anywhere.

brucethemoose@lemmy.world · 9 months ago

Thats because llama.cpp’s vulkan backend is kinda slow and funky, unfortunately.

Eskuero@lemmy.fromshado.ws · 9 months ago

Better than anything. I run through vulkan on lm studio because rocm on my rx 5600xt is a heavy pain

brucethemoose@lemmy.world · edit-2 9 months ago

The best hope for you is ZLUDA’s revival. It’s explicitly targeting LLM runtimes now, and RDNA1 (aka your 5600XT) is the oldest supported generation.

https://www.phoronix.com/news/ZLUDA-Third-Life

TBH you should consider using free llama/qwen APIs as well, when appropriate.

Grimy@lemmy.world · 9 months ago

vLLM can only run on linux but it’s my personal favorite because of the speed gain when doing batch inference.

brucethemoose@lemmy.world · edit-2 9 months ago

Aphrodite is a fork of vllm. You should check it out!

If you are looking for raw batched speed, especially with some redundant context, I would actually recommend sglang instead. Check out its experimental flags too.

sleep_deprived@lemmy.world · 9 months ago

I’d be interested in setting up the highest quality models to run locally, and I don’t have the budget for a GPU with anywhere near enough VRAM, but my main server PC has a 7900x and I could afford to upgrade its RAM - is it possible, and if so how difficult, to get this stuff running on CPU? Inference speed isn’t a sticking point as long as it’s not unusably slow, but I do have access to an OpenAI subscription so there just wouldn’t be much point with lower quality models except as a toy.

brucethemoose@lemmy.world · edit-2 9 months ago

CPU inference is, unfortunately, slow, even on my 7800X3D.

The one that might be interesting is deepseek code v2 lite, as its a very fast MoE model. IIRC microsoft also released a Phi MoE thats good for CPU.

Keep an eye out for upcoming bitnet models.

Dont bother upgrading RAM though. You will be bandwidth limited anyway, and it doesn’t make a huge difference.

kwa@lemmy.zip · 9 months ago

Thanks!

For people on MacOS, is there a better alternative than croco.cpp?

brucethemoose@lemmy.world · 9 months ago

If you download the source, you should be able to build it for metal? Croco.cpp is just a fork of kobold.cpp

I think lmstudio added MLX support, but otherwise you are stuck with anything llama.cpp based. I’d probably download llama.cpp directly and use the llama server first.

kwa@lemmy.zip · 9 months ago

I tried llama.cpp with llama-server and Qwen2.5 Coder 1.5B. Higher parameters just output garbage and I can see an OutOfMemory error in the logs. When trying the 1.5B model, I have an issue where the model will just stop outputting the answer, it will stop mid sentence or in the middle of a class. Is it an issue with my hardware not being performant enough or is it something I can tweak with some parameters?

brucethemoose@lemmy.world · edit-2 9 months ago

You can only allocate so much to metal backends, and if you are on (say) an 8GB Mac there won’t be much RAM left for the LLM itself.

But still, use a tighter quantization (like an IQ4 or IQ3_KM) of Qwen Coder 7B, and close as many background programs as you can. It should be small enough to fit.

kwa@lemmy.zip · 9 months ago

I have a MacBook Pro M1 Pro with 16GB RAM. I closed a lot of things and managed to have 10GB free, but that seems to still not be enough to run the 7B model. For the answer being truncated, it seems to be a frontend issue. I tried open-webui connected to llama-server and it seems to be working great, thank you!

brucethemoose@lemmy.world · 9 months ago

Try reducing the context size, and make sure Q8/Q8 flash attention is enabled with flags.

I’d link a specific GGUF quantization, but huggingface seems to be down for me!

brucethemoose@lemmy.world · 9 months ago

Try this one at least, it should still leave plenty of RAM free: https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/blob/main/Qwen2.5-Coder-7B-Instruct-IQ4_XS.gguf

kwa@lemmy.zip · 9 months ago

Indeed, this model is working on my machine. Can you explain the difference with the one I tried before?

brucethemoose@lemmy.world · edit-2 9 months ago

It’s probably much smaller than whatever other GGUF you got, aka more tightly quantized.

Look at the filesize, thats basically how much RAM it takes.

brucethemoose@lemmy.world · 9 months ago

deleted by creator