Ollama Server Component Recommendations

Jul (they/she)@piefed.blahaj.zone · 21 days ago

Ollama Server Component Recommendations

vegetaaaaaaa@lemmy.world · edit-2 20 days ago

I suggest using llama.cpp instead of ollama, you can easily squeeze +10% in inference speed and other memory optimizations from llama.cpp. With hardware prices nowadays I think every % saved on resources matters. Here is a simple ansible role to setup llama.cpp, it should give you a good idea of how to deploy it.

A dedicated inference rig is not gonna be cheap. What I did, since I need a gaming rig; is getting 32GB DDR5 (this was before the current RAMpocalypse, if I had known I would have bought 64) and an AMD 9070 (16GB VRAM - again if I had known how crazy prices would get I’d probably ahve bought a 24GB VRAM card). The home server runs the usual/non-AI stuff, and llamacpp runs on the gaming desktop (the home server just has a proxy to it). Yeah the gaming desktop has to be powered up when I want to run inference, this is my main desktop so it’s powered on most of the time, no big deal

adeoxymus@lemmy.world · 20 days ago

Also a reason not to use ollama: https://sleepingrobots.com/dreams/stop-using-ollama/

Freeposity@lemmy.world · 18 days ago

Thanks for this. I’m definitely dropping ollama now. No wonder GGUF models always gave me issues.

I might even dump open webui for llama.cpp’s webui