Hardware for local inference?
from droopy4096@lemmy.ca to selfhosted@lemmy.world on 29 May 03:26
https://lemmy.ca/post/65589067
from droopy4096@lemmy.ca to selfhosted@lemmy.world on 29 May 03:26
https://lemmy.ca/post/65589067
I want to host some LLM’s locally and use more advanced models. Since new hardware is out of the question, I think I should be able to pull something off buying some yesteryear equipment on ebay etc. Did anybody attempt such a project? Does it scale horizontally? (I.e. can I connext two boxes to overcome single box slowness?)
#selfhosted
threaded - newest
Ram if a big driver of what models you can run with vram at a premium. Equipping 2 separate boxes with enough ram to load advanced models may be more expensive than just equipping one faster machine.
On the larger models even with ssd swap I can’t even get them to fully load on my 16gb of ram.
well, I intend on scavenging for parts as I can’t really afford today’s prices. And since I don’t really know what should I grab as minimum specs I don’t even know what to look for. I could try to look for old(er) gaming rigs people sell or maybe there are some business workstations that may be sold in bulk. Either way, knowing what’s the minimum viable set of specs for running qwen or claude locally would be helpful
It can vary a lot based on what qwen model you want to run, but generally the 27b dense or 35b MoE are currently the best balance between size and capability afaik.
If you can run two 16GB cards you can pretty much max out the context on the 27b model, but a single card like the 3060 12gb could still work well on the 35b MoE model with the excess spilling into system memory.
I saw in another comment you have cards from the 2010’s but if they don’t have at least 8gb I wouldn’t even bother
Word of advice - don’t scavenge old /server/ hardware if you plan to put GPUs in - unless you really like heat and noise. Those machines from the likes of HP will take one look at a consumer graphics card in a PCI slot and decide they need to run all the fans at 100% to ward off evil spirits. Not to mention just the general ballache of proprietary PCI riser cards, PCI power cables, etc. etc.
You’re definitely better off with taking someone’s old gaming rig off their hands.
In terms of specs - value VRAM above everything else. A slow, old 3000 series card with 24GB of VRAM is much more useful than a brand new 5000 with 16GB. If you can find old RTX3090 24GB, they’re kinda ideal.
The one thing I will say for modern cards though is that they’re much better for power efficiency - and in particular idle power (which is important if you’re running the thing always on.) For my main LLM machine I have two RTX5060Ti (32GB total), which at the time was the sweet spot for price/performance/power, and it’s very nice that they idle around 3 or 4 watts. I bought them before the world went crazy and prices went mad though, so they may not be the sweet spot any more.
One you’re in 32GB VRAM type territory, you can run really really good dense models like Qwen-3.6-27b at a decent quant, decent context size, and good performance for things like coding, or bigger MoE models for more general use (particularly then if you have good CPU and regular RAM for offloading to CPU. For use as an assistant (i.e. not an OpenClaw fully automated slop machine,) I use 3.6-27b as a daily driver in Claude Code, and basically never use Sonnet.
How old we talking? I personally wouldn’t go further back than 2000 series rtx. A friend has had good luck with Intel GPUs for ‘cheap’.
No, you absolutely cannot scale horizontally for speed. VRAM is king, with local RAM being swappable with major speed penalties. SSD is even slower than that and all those are orders of magnitude faster than ant Ethernet you’ll be connecting boxes together with. That’s not to say clustering isn’t an option, just that speed is going to be worse the more you scale out like that.
I’ve got some circa 2010 cards laying about with a 32Gb server that already has 8Gb carved out for TrueNAS, so essentially I could squeeze 16-24Gb out of it, but it’s an older i5 Intel CPU
If you can constrain yourself to MoE-based LLMs, they’ll generally deal better from a performance standpoint with not entirely fitting in VRAM better than non-MoE LLMs, as experts may not get loaded into VRAM at all.
Honestly, you’re a few months late to the whole buying GPUs for local llms party, so expect exorbitant prices even for older cards
The name of the game is vram. For the most part, more is better. If you can get your hands on multiple matching (same model) 24gb or higher cards (within price range), you’re golden.
Going for more than 2 gpus can become challenging with motherboard pcie slot heights, so make sure either your cards aren’t too tall or you have widely spaced out pcie slots.
For inference, speed (tokens/second) is limited by memory bandwidth. Go for faster bandwidth memory cards if you can afford it (e.g. GDDR6 will be faster than GDDR5).
Also with multi gpus you will need an adequate power supply, and a large enough case.
If you want to be a bit eccentric and load huge models, you can also go the CPU route and fill up a motherboard with 256 GB ram, because then you’re in the several hundred B param model territory, which could, depending on your use case, be better than having faster inference on smaller/quantized models. Even then, DDR5 with high MHz is still way slower than gpus.
I’m running Qwen 3.6 35B A3B (the MoE model) on an 8GB Vram Nvidia GPU with 32 GB of ram, with tweaking (and Turboquant) I’ve got it up to 30-40 Tokens per second and a 260k Context. It’s very usable. I’ve seen people report success with Dual 3060 Cards, but you’re still talking $1000-1500 for that kind of setup even if you have parts of it already.
I’m running Qwen 3.6 35B A3B on my Ryzen 8700g and it runs pretty well, but the bigger problem there is probably the cost of RAM
I use Instinct MI60 GPUs. They are pretty decent performance for local LLM. Connecting multiple computers is going to be impractical because severe bandwidth bottleneck.
If you’re willing to wait until 2028 when memory prices are expected to drop, and if you’re willing to get new hardware if memory prices drop, I’d give real consideration to waiting until then. There’ll also probably be better hardware and better models then.
Unless you’re going to really run a lot, this is an area where vast.ai is probably more affordable than mucking with hardware.
thank you folks. Your input gives me a decent starting point. I’ll start digging based on info/experiences shared, maybe I can find someone locally selling old GPU with enough ram for cheap
For really lightweight models, Qwen3:8b is pretty good for an 8gb graphics card.
Gemma 4:e4b is pretty good too. I usually sun that on my 16gb gpu.
Obviously the little ones aren’t as good as big ones, but you can always rely on real intelligence to fill in the gaps
What size model? I can run 8 billion parameter models on my Geforce 3070 with 8gb of vram. Bigger models need more memory. For $1-2k you can upgrade to a 16 or 32 gb video card. For $3k you can get a Framework Desktop with 128 gb unified memory. For $6k you can get a DGX Spark with a blackwell chip and 128 gb unified memory. Mac mini or Mac studio are also good choices in this price range.
The 64GB Framework Desktop runs just barely over 2k configured minimally, I went that route because I thought it was a better option than the discrete 32GB video card, but there are tradeoffs with compatibility. Something to think about at the 2k, but not quite 3k range though.
The trend I see are the Mac Minis with a lot of unified memory. These are typically very well off people though. Prices for even old GPUs like 3090s are ridiculous now. I don’t think connecting 2 machines over Ethernet would work well, but putting 2 GPUs in a single machine does.
With older hardware, once you accumulate enough vram to run it, your problem is going to shift to memory bandwidth and your question is going to shift from ‘Can I run this Model’ to ‘Can I run this Model at an acceptable speed’.