I can now run a GPT-4 class model on my laptop

freedomben · 2024-12-09T17:12:40 1733764360

Great post as expected (I'm a really big fan of Simon), but quick note lest others get prematurely excited, this is on an Apple laptop. Personally I run these models because I care about openness, so an Apple laptop is the last I'd buy. Good post though and worth a read as there's plenty of good stuff here for a non-Apple user.

simonw · 2024-12-09T17:18:31 1733764711

Yeah, I don't have a great perspective right now on what this looks like in non-Apple world.

The problem is always RAM. On Apple Silicon RAM is shared between CPU and GPU, which means a 64GB of RAM machine can run really big models with GPU acceleration.

On Windows/Linux machines you usually need to consider the VRAM available to your GPU. I haven't got a great feel for what kind of VRAM numbers are common for different consumer setups there, so I don't tend to write much about those.

Of course llama.cpp will happily run models on CPU as well, but again I don't have a great feel for how usable that is.

dismalaf · 2024-12-09T17:24:04 1733765044

On non-Mac laptops with integrated graphics it's the same, up to 50% of the RAM can be allocated to the GPU (both AMD and Intel work the same way). Not sure if Windows allows it but on Linux you can definitely see it. So let's say you have a 96gb ram Framework laptop, that's 48gb for the GPU...

I think the main impediment is Intel and SYCL not working on all their cards... AMD is good to go, I think.

freedomben · 2024-12-09T23:59:21 1733788761

Interesting! I'll have to give this a try. On my framework 16 with the GPU (running Fedora), ollama runs super well and blazing fast, though only smaller models of course. I have 128 GB of system memory though so would love to try it a bigger model in the integrated GPU.

Btw, this is my favorite laptop of all time. Highly recommend it.

buildbot · 2024-12-09T17:33:16 1733765596

The bandwidth is (potentially, depending on the Mac in question) much lower on the framework, which will limit the decode tokens/s

dismalaf · 2024-12-09T18:27:30 1733768850

That's fair but still better than running on the CPU, and still enough to hold the model in VRAM.

Havoc · 2024-12-09T17:32:29 1733765549

It’s not the vram number that’s the problem on windows side.

It’s the throughput. For whatever reason they’re a good 5-10x lower.

Even if it’s a large amount and gpu accessible if the throughput sucks that badly then 70B models aren’t usable

What puzzles me is that the even the newer AMD APUs have this issue. The upcoming strix halo is rumoured to be around 140 ish which puts it at very bottom end of Apple devices

hnuser123456 · 2024-12-09T19:06:07 1733771167

Apple silicon RAM is basically all VRAM, with 150-400 GB/s bandwidth.

Windows machines with dual channel DDR5 are only around 80 GB/s bandwidth.

Nvidia cards are around 1 TB/s, but models with 40+GB of VRAM are more expensive than macs with equivalent RAM capacity.

I would expect that an Nvidia card on Windows would be around twice as fast as apple silicon if you have a model with 40+GB, and from some quick research this seems to be true.

edit: "Any Macbook 64GB can run it at about 6-7 Tok/s using Q4 quantization. 2x3090 can do it much faster at about 18 tok/s" https://www.reddit.com/r/LocalLLaMA/comments/1h9kci3/comment...

Havoc · 2024-12-10T01:46:42 1733795202

> Apple silicon RAM is basically all VRAM

The actual ram is lpddr5 - miles slower than what you’d find in even very old dedicated GPUs

It’s as you say though - for some tasks like LLMs that doesn’t matter if you’ve got enough bandwidth to said slow ram…ends up being almost the same thing as having actual fast ram.

It’s a neat trick & little bit frustrated that nobody is copying it. Cause using dirt cheap lpddr5 (about a dollar per gig) but paying Apple’s comical ram upgrade prices feels like a cruel joke.

hnuser123456 · 2024-12-11T15:16:01 1733930161

Ah, so do they use 2 channels on standard / 4 on pro / 8 on max?

Rumors are there may be a X3D Threadripper in the works, would be nice to not have to choose between RAM bandwidth and IPC again.

transcriptase · 2024-12-09T17:27:45 1733765265

In my experience with a 4070ti (12 gb vram), 7800x3d and 64gb ddr5 it will run… but seemingly at about 0.1-0.5 tokens per second with partial GPU offload.

It’s painful, but the results are night and day compared to the models that will fit entirely in vram. Of course perhaps I’m doing something wrong so if others have advice it would be great to hear!

hnuser123456 · 2024-12-09T17:48:00 1733766480

Is it any faster with no GPU offload, given how little would fit in VRAM? Can you try giving a few prompts and getting a precise average token speed? Curious if it would be worth upgrading my RAM to at least run it on CPU.

I have a 9800x3d/48gb ddr5/3090 and this model still being ~40GB when quantized is a challenge. Sure would be nice to have a 30B version that can quantize down to ~16GB. Or I guess I could upgrade my PSU and get another 3090 and use NVlink...

diffeomorphism · 2024-12-09T19:30:51 1733772651

Not just "an" apple laptop but one with 64GB of ram. From a quick look at Apple's website the cheapest option among the current models is 4600€ for the 14 inch macbook pro.

So, neat, I guess, but not convinced I would pay four macbook airs for that.

woleium · 2024-12-22T21:32:54 1734903174

an m1 off evay can be had for less

philshem · 2024-12-09T19:47:19 1733773639

The newest Mac Mini with M4 Pro and 64GB of ram costs ~ €2000

diffeomorphism · 2024-12-09T20:48:59 1733777339

The title says "my laptop". A Mac mini is not a laptop.

ec109685 · 2024-12-11T07:30:25 1733902225

Why would running the model with Firefox crash his Mac? At worst, it should start swapping:

> The first time I tried it consumed every remaining bit of available memory and hard-crashed my Mac

Also, hallucinations so easily dismissed these days:

> (Deborah Penrose was the mayor of Half Moon Bay for a single year from December 2016 to December 2017—so a hint of some quite finely grained world knowledge there, even if it’s not relevant for the present day.)

simonw · 2024-12-11T11:54:01 1733918041

I imagine it crashed because it wasn't just using the RAM for the CPU, it was using it for the GPU which might not be able to swap in the same way.

I don't think listing an outdated mayor of a tiny town really counts as a hallucination, especially since Half Moon Bay appears to change mayors about once a year.

It told me I used to work for GitHub which is definitely a hallucination, that's never been true. Weirdly a bunch of models make that mistake.

ec109685 · 2024-12-12T07:09:06 1733987346

But it confidently implied they are the mayor, which would make you look stupid if you sent the letter. That is the Achilles heal of these models :(

simonw · 2024-12-12T17:37:02 1734025022

Honestly, that mostly only matters if you haven't yet learned how to use this stuff responsibly.

LLMs have incomplete world knowledge and frequently make mistakes. The challenge is learning how to use them anyway.

stevedekorte · 2024-12-09T17:26:09 1733765169

It’s great that this can run on a laptop but FWIW, llama 70B model is no where near “GPT-4 class” in my own use cases. 405B might be, though I haven’t tested it.

simonw · 2024-12-09T17:33:17 1733765597

Are you sure about that?

When I say GPT-4 class I'm talking about being comparable to the GPT-4 that was released in March 2023.

The Llama 3.3 70B model is clearly no way near as good as today's GPT-4o family of models, or the other top-ranking models today like Gemini 1.5 Pro and Claude 3.5 Sonnet.

To my surprise, Llama 3.3 70B is ranking higher than Claude 3 Opus on https://livebench.ai/ - I'm suspicious of that result, personally. I think Opus was the best available model for a few months earlier this year.

mmiyer · 2024-12-09T17:40:06 1733766006

I guess it's because it has the highest score of all models in instruction following, 20 points higher then Opus, which compensates for shortcomings elsewhere (e.g. in language), and which wouldn't necessarily translate to human evaluation of usefulness.

simonw · 2024-12-09T17:42:25 1733766145

Wow, yeah I think you're right - 3.3 somehow gets top position on the entire leaderboard for that category, I bet that skews the average up a lot.

ac29 · 2024-12-10T21:33:42 1733866422

The model you are running isnt the one used in the benchmarks you link.

The default llama3.3 model in ollama is heavily quantized (~4 bit). Running the full fp16 model, or even an 8-bit quant wouldnt be possible on your laptop with 64G RAM.

simonw · 2024-12-11T02:52:32 1733885552

Thanks - yeah, I should have mentioned that. I just added a note directly above this heading https://simonwillison.net/2024/Dec/9/llama-33-70b/#honorable...

MichaelZuo · 2024-12-09T20:08:03 1733774883

How do you reliably compare it with the GPT-4 released in March 2023?

simonw · 2024-12-09T20:20:08 1733775608

Vibes, based on what I can remember using that model for.

There's still a gpt-4 model available via the OpenAI API, but it's gpt-4-0613 from June 2023 - the March 2023 snapshot gpt-4-0314 is no longer available.

I ran one of my test prompts against that old June 2023 GPT-4 model here: https://gist.github.com/simonw/de4951452df2677f2a1a3cd415168...

I'm not going to try for an extensive evaluation comparing it with Llama 3.3 though, life's too short and that's already been done better than I could by https://livebench.ai/

MichaelZuo · 2024-12-09T21:56:42 1733781402

Why not ask it to solve math questions?

The bar for GPT-4 was so low that unambiguously clearing that threshold should be pretty easy.

simonw · 2024-12-09T23:11:50 1733785910

I am not particularly interested in those benchmarks that deliberately expose weaknesses in models: I know that models have weaknesses already!

What I care about is the things that they're proven to be good at - can I do those kinds of things (RAG, summarization, code generation, language translation) directly on my laptop?

buildbot · 2024-12-09T17:34:01 1733765641

The new 3.3 70B model has comparable benchmarks to the 405B model, which is probably what people mean by GPT-4 class.

zamadatix · 2024-12-09T18:35:34 1733769334

> when I ran Llama 3.3 70B on the same laptop for the first time.

There is no llama 3.3 405B to test, 3.3 only comes in 70B. Are you sure you aren't thinking of llama 3 or 3.1?

simonw · 2024-12-09T19:10:06 1733771406

No, I meant Llama 3.3 70B.

OutOfHere · 2024-12-10T02:59:58 1733799598

What can I run with the 128 GB Mac laptop?