Great post as expected (I'm a really big fan of Simon), but quick note lest others get prematurely excited, this is on an Apple laptop. Personally I run these models because I care about openness, so an Apple laptop is the last I'd buy. Good post though and worth a read as there's plenty of good stuff here for a non-Apple user.
Yeah, I don't have a great perspective right now on what this looks like in non-Apple world.
The problem is always RAM. On Apple Silicon RAM is shared between CPU and GPU, which means a 64GB of RAM machine can run really big models with GPU acceleration.
On Windows/Linux machines you usually need to consider the VRAM available to your GPU. I haven't got a great feel for what kind of VRAM numbers are common for different consumer setups there, so I don't tend to write much about those.
Of course llama.cpp will happily run models on CPU as well, but again I don't have a great feel for how usable that is.
On non-Mac laptops with integrated graphics it's the same, up to 50% of the RAM can be allocated to the GPU (both AMD and Intel work the same way). Not sure if Windows allows it but on Linux you can definitely see it. So let's say you have a 96gb ram Framework laptop, that's 48gb for the GPU...
I think the main impediment is Intel and SYCL not working on all their cards... AMD is good to go, I think.
Interesting! I'll have to give this a try. On my framework 16 with the GPU (running Fedora), ollama runs super well and blazing fast, though only smaller models of course. I have 128 GB of system memory though so would love to try it a bigger model in the integrated GPU.
Btw, this is my favorite laptop of all time. Highly recommend it.
It’s not the vram number that’s the problem on windows side.
It’s the throughput. For whatever reason they’re a good 5-10x lower.
Even if it’s a large amount and gpu accessible if the throughput sucks that badly then 70B models aren’t usable
What puzzles me is that the even the newer AMD APUs have this issue. The upcoming strix halo is rumoured to be around 140 ish which puts it at very bottom end of Apple devices
Apple silicon RAM is basically all VRAM, with 150-400 GB/s bandwidth.
Windows machines with dual channel DDR5 are only around 80 GB/s bandwidth.
Nvidia cards are around 1 TB/s, but models with 40+GB of VRAM are more expensive than macs with equivalent RAM capacity.
I would expect that an Nvidia card on Windows would be around twice as fast as apple silicon if you have a model with 40+GB, and from some quick research this seems to be true.
The actual ram is lpddr5 - miles slower than what you’d find in even very old dedicated GPUs
It’s as you say though - for some tasks like LLMs that doesn’t matter if you’ve got enough bandwidth to said slow ram…ends up being almost the same thing as having actual fast ram.
It’s a neat trick & little bit frustrated that nobody is copying it. Cause using dirt cheap lpddr5 (about a dollar per gig) but paying Apple’s comical ram upgrade prices feels like a cruel joke.
In my experience with a 4070ti (12 gb vram), 7800x3d and 64gb ddr5 it will run… but seemingly at about 0.1-0.5 tokens per second with partial GPU offload.
It’s painful, but the results are night and day compared to the models that will fit entirely in vram. Of course perhaps I’m doing something wrong so if others have advice it would be great to hear!
Is it any faster with no GPU offload, given how little would fit in VRAM? Can you try giving a few prompts and getting a precise average token speed? Curious if it would be worth upgrading my RAM to at least run it on CPU.
I have a 9800x3d/48gb ddr5/3090 and this model still being ~40GB when quantized is a challenge. Sure would be nice to have a 30B version that can quantize down to ~16GB. Or I guess I could upgrade my PSU and get another 3090 and use NVlink...
Not just "an" apple laptop but one with 64GB of ram. From a quick look at Apple's website the cheapest option among the current models is 4600€ for the 14 inch macbook pro.
So, neat, I guess, but not convinced I would pay four macbook airs for that.
Why would running the model with Firefox crash his Mac? At worst, it should start swapping:
> The first time I tried it consumed every remaining bit of available memory and hard-crashed my Mac
Also, hallucinations so easily dismissed these days:
> (Deborah Penrose was the mayor of Half Moon Bay for a single year from December 2016 to December 2017—so a hint of some quite finely grained world knowledge there, even if it’s not relevant for the present day.)
I imagine it crashed because it wasn't just using the RAM for the CPU, it was using it for the GPU which might not be able to swap in the same way.
I don't think listing an outdated mayor of a tiny town really counts as a hallucination, especially since Half Moon Bay appears to change mayors about once a year.
It told me I used to work for GitHub which is definitely a hallucination, that's never been true. Weirdly a bunch of models make that mistake.
It’s great that this can run on a laptop but FWIW, llama 70B model is no where near “GPT-4 class” in my own use cases. 405B might be, though I haven’t tested it.
When I say GPT-4 class I'm talking about being comparable to the GPT-4 that was released in March 2023.
The Llama 3.3 70B model is clearly no way near as good as today's GPT-4o family of models, or the other top-ranking models today like Gemini 1.5 Pro and Claude 3.5 Sonnet.
To my surprise, Llama 3.3 70B is ranking higher than Claude 3 Opus on https://livebench.ai/ - I'm suspicious of that result, personally. I think Opus was the best available model for a few months earlier this year.
I guess it's because it has the highest score of all models in instruction following, 20 points higher then Opus, which compensates for shortcomings elsewhere (e.g. in language), and which wouldn't necessarily translate to human evaluation of usefulness.
The model you are running isnt the one used in the benchmarks you link.
The default llama3.3 model in ollama is heavily quantized (~4 bit). Running the full fp16 model, or even an 8-bit quant wouldnt be possible on your laptop with 64G RAM.
Vibes, based on what I can remember using that model for.
There's still a gpt-4 model available via the OpenAI API, but it's gpt-4-0613 from June 2023 - the March 2023 snapshot gpt-4-0314 is no longer available.
I'm not going to try for an extensive evaluation comparing it with Llama 3.3 though, life's too short and that's already been done better than I could by https://livebench.ai/
I am not particularly interested in those benchmarks that deliberately expose weaknesses in models: I know that models have weaknesses already!
What I care about is the things that they're proven to be good at - can I do those kinds of things (RAG, summarization, code generation, language translation) directly on my laptop?