> That's small enough to run well on ~$5,000 of hardware...
Honestly curious where you got this number. Unless you're talking about extremely small quants. Even just a Q4 quant gguf is ~130GB. Am I missing out on a relatively cheap way to run models well that are this large?
I suppose you might be referring to a Mac Studio, but (while I don't have one to be a primary source of information) it seems like there is some argument to be made on whether they run models "well"?
Admittedly I've not tried running on system RAM often, but every time I've tried it's been abysmally slow (< 1 T/s) when I've tried on something like KoboldCPP or ollama. Is there any particular method required to run them faster? Or is it just "get faster RAM"? I fully admit my DDR3 system has quite slow RAM...
Considering there were two generations (around 4.5 years) of top-tier consumer GPUs (3090/4090) stuck at 24GB VRAM max, and the current one (5090) "only" bumped it up to 32GB, I think you'll be waiting more than 5 years before 128GB VRAM comes to the mid tier model GPU. 12-16GB is currently mid tier and has been since LLMs became "a thing".
I hope I'm wrong though, and we see a large bump soon. Even just 32GB in the mid tier would be huge.
I'm really tempted to try out a Mac Studio with 256+ GB Unified Memory (192 GB VRAM), but it is sadly out of my budget at the moment. I know there is a bandwidth loss, but being able to run huge models and huge contexts locally would be quite nice.