Which Llama Server Hardware do you use?
from bazinga@discuss.tchncs.de to selfhosted@lemmy.world on 03 Apr 07:42
https://discuss.tchncs.de/post/57758962
from bazinga@discuss.tchncs.de to selfhosted@lemmy.world on 03 Apr 07:42
https://discuss.tchncs.de/post/57758962
I realize, I need to upgrade my little NUC to something bigger for higher inference of bigger llama models. I want something that you still can have on your living room’s tv bench, so no monster rack please, but that has also the necessary muscle when needed for llama. Budget doesn’t matter right now, want to understand what’s good and what’s out there. Thanks
#selfhosted
threaded - newest
From what I have observed, the Fediverse is against “AI”. I doubt if you will find your answer here. “AI” is using too much water, electricity and is putting people out of work.
Not the whole fediverse.
I have a good efficiency boost thanks to LLMs. They are not perfect, they lie and everything. But they write simple and good bash scripts. They know cron and regex. Stuff that I could do but I don’t want to.
Creating videos is costly. My LLM usage compared to creating videos is a joke. People playing bus simulator is much worse than me asking a llm how the function is called to calculate a mean.
Ai is not the problem. People using it are.
That entirely depends on the specific models and use case.
I use a 128GB Framework Desktop. Back when I got it, it was $2,500 with 8TB of SSD storage, but the RAM shortage has driven prices up to substantially more. That system’s interesting in that you can tell Linux to use essentially all of the memory as video memory; it has an APU with unified memory, so the GPU can access all that memory.
That’ll get you 70B models like llama 3-based stuff at Q6_K with 128K of context window, which is the model max. That’s okay for chatbot-like operation, but you won’t want to run code generation with that.
For some tasks, you may be better-off using a higher-bandwidth-but-less-memory video card and an MoE model; this doesn’t keep all of the model active and in video memory, only loading relevant expert models. I can’t suggest much there, as I’ve spent less time with that.
If you don’t care about speed — you probably do — you can run just about anything with llama.cpp using the CPU and main memory, as long as you have enough memory. That might be useful if you just want to evaluate the quality of a given model’s output, if you want to get a feel for what you can get out of a given model before buying hardware.
You might want to ask on !localllama@sh.itjust.works, as there’ll be more people familiar there (though I’m not on there myself).
EDIT: I also have a 24GB Radeon 7900 XTX, but for LLM stuff like llama.cpp, I found the lack of memory to be too constraining. It does have higher memory bandwidth, so for models that fit, it’s faster than the Framework Desktop. In my experience, GPUs were more interesting for image diffusion models like Stable Diffusion — most open-weight image diffusion models are less-memory hungry – than LLM stuff. Though if you want to do Flux v2, I wasn’t able to fit it on that card. I could run it on the Framework Desktop, but at the resolutions I wanted to run it at, the poor ol’ Framework took about 6 or 7 minutes to generate an image.
EDIT2: I use all AMD hardware, though I agree with @anamethatisnt@sopuli.xyz that Nvidia hardware is going to be easier to get working; a lot of the AMD software is much more bleeding edge, as Nvidia got on all this earlier. That being said, Nvidia also charges a premium because of that. I understand that a DGX Spark is something of an Nvidia analog to the Framework Desktop and similar AI Max-based systems, has unified memory, but you’ll pay for it, something like $4k.
When it comes to Nvidia GPUs the VRAM is the main thing to look for.
For consumer cards it is:
Entry level - RTX 5060 Ti 16GB RAM with a price point around 500-550 euro
Mid - Buying a used RTX 3090 24GB RAM with a price point around 830 euro when I look at swedish second hand markets
High - RTX 5090 32GB RAM with a price point around 3500 euro
After that you end up looking at the RTX Pro Blackwell cards:
Entry - RTX PRO 5000 Blackwell 48GB RAM ~5300 euro
Mid - RTX PRO 6000 Blackwell 96GB RAM ~10100 euro
It all depends on which models you want to run, you can definitely start playing around with Llama 3 8B and similar models with a 5060 Ti 16GB.
If you’re looking at 24B-30B models you need the 24GB VRAM that RTX 3090 offers and get a larger context window if you go for the RTX 5090.
If you’re looking to run Llama 3 70B then you need to go into the RTX Pro level of vram.
All of this is based on running it with Nvidia cards, there’s also other setups such as Mac Studios with huge amount of RAM. They’re slower but allow for much larger models at the same price point.
You could also run with AMD/Intel gpus but much software is built primarily for running CUDA (and Nvidia) gpus so it’s more work and not always compatible.
I know you said no “monster rack” but I don’t really know what you classify as a monster. :)
An ordinary gaming pc is also a good starter AI pc, so something like this allows you to do both:
pcpartpicker.com/list/sFp4qd
You might want to also ask at @localllama@sh.itjust.works
!locallama@sh.itjust.works.
@localllama@sh.itjust.works would be a user named “localllama” rather than a community named “localllama”.