Hacker News new | past | comments | ask | show | jobs | submit login
I can now run a GPT-4 class model on my laptop (simonwillison.net)
96 points by simonw 15 days ago | hide | past | favorite | 34 comments



Great post as expected (I'm a really big fan of Simon), but quick note lest others get prematurely excited, this is on an Apple laptop. Personally I run these models because I care about openness, so an Apple laptop is the last I'd buy. Good post though and worth a read as there's plenty of good stuff here for a non-Apple user.


Yeah, I don't have a great perspective right now on what this looks like in non-Apple world.

The problem is always RAM. On Apple Silicon RAM is shared between CPU and GPU, which means a 64GB of RAM machine can run really big models with GPU acceleration.

On Windows/Linux machines you usually need to consider the VRAM available to your GPU. I haven't got a great feel for what kind of VRAM numbers are common for different consumer setups there, so I don't tend to write much about those.

Of course llama.cpp will happily run models on CPU as well, but again I don't have a great feel for how usable that is.


On non-Mac laptops with integrated graphics it's the same, up to 50% of the RAM can be allocated to the GPU (both AMD and Intel work the same way). Not sure if Windows allows it but on Linux you can definitely see it. So let's say you have a 96gb ram Framework laptop, that's 48gb for the GPU...

I think the main impediment is Intel and SYCL not working on all their cards... AMD is good to go, I think.


Interesting! I'll have to give this a try. On my framework 16 with the GPU (running Fedora), ollama runs super well and blazing fast, though only smaller models of course. I have 128 GB of system memory though so would love to try it a bigger model in the integrated GPU.

Btw, this is my favorite laptop of all time. Highly recommend it.


The bandwidth is (potentially, depending on the Mac in question) much lower on the framework, which will limit the decode tokens/s


That's fair but still better than running on the CPU, and still enough to hold the model in VRAM.


It’s not the vram number that’s the problem on windows side.

It’s the throughput. For whatever reason they’re a good 5-10x lower.

Even if it’s a large amount and gpu accessible if the throughput sucks that badly then 70B models aren’t usable

What puzzles me is that the even the newer AMD APUs have this issue. The upcoming strix halo is rumoured to be around 140 ish which puts it at very bottom end of Apple devices


Apple silicon RAM is basically all VRAM, with 150-400 GB/s bandwidth.

Windows machines with dual channel DDR5 are only around 80 GB/s bandwidth.

Nvidia cards are around 1 TB/s, but models with 40+GB of VRAM are more expensive than macs with equivalent RAM capacity.

I would expect that an Nvidia card on Windows would be around twice as fast as apple silicon if you have a model with 40+GB, and from some quick research this seems to be true.

edit: "Any Macbook 64GB can run it at about 6-7 Tok/s using Q4 quantization. 2x3090 can do it much faster at about 18 tok/s" https://www.reddit.com/r/LocalLLaMA/comments/1h9kci3/comment...


> Apple silicon RAM is basically all VRAM

The actual ram is lpddr5 - miles slower than what you’d find in even very old dedicated GPUs

It’s as you say though - for some tasks like LLMs that doesn’t matter if you’ve got enough bandwidth to said slow ram…ends up being almost the same thing as having actual fast ram.

It’s a neat trick & little bit frustrated that nobody is copying it. Cause using dirt cheap lpddr5 (about a dollar per gig) but paying Apple’s comical ram upgrade prices feels like a cruel joke.


Ah, so do they use 2 channels on standard / 4 on pro / 8 on max?

Rumors are there may be a X3D Threadripper in the works, would be nice to not have to choose between RAM bandwidth and IPC again.


In my experience with a 4070ti (12 gb vram), 7800x3d and 64gb ddr5 it will run… but seemingly at about 0.1-0.5 tokens per second with partial GPU offload.

It’s painful, but the results are night and day compared to the models that will fit entirely in vram. Of course perhaps I’m doing something wrong so if others have advice it would be great to hear!


Is it any faster with no GPU offload, given how little would fit in VRAM? Can you try giving a few prompts and getting a precise average token speed? Curious if it would be worth upgrading my RAM to at least run it on CPU.

I have a 9800x3d/48gb ddr5/3090 and this model still being ~40GB when quantized is a challenge. Sure would be nice to have a 30B version that can quantize down to ~16GB. Or I guess I could upgrade my PSU and get another 3090 and use NVlink...


Not just "an" apple laptop but one with 64GB of ram. From a quick look at Apple's website the cheapest option among the current models is 4600€ for the 14 inch macbook pro.

So, neat, I guess, but not convinced I would pay four macbook airs for that.


an m1 off evay can be had for less

The newest Mac Mini with M4 Pro and 64GB of ram costs ~ €2000


The title says "my laptop". A Mac mini is not a laptop.


Why would running the model with Firefox crash his Mac? At worst, it should start swapping:

> The first time I tried it consumed every remaining bit of available memory and hard-crashed my Mac

Also, hallucinations so easily dismissed these days:

> (Deborah Penrose was the mayor of Half Moon Bay for a single year from December 2016 to December 2017—so a hint of some quite finely grained world knowledge there, even if it’s not relevant for the present day.)


I imagine it crashed because it wasn't just using the RAM for the CPU, it was using it for the GPU which might not be able to swap in the same way.

I don't think listing an outdated mayor of a tiny town really counts as a hallucination, especially since Half Moon Bay appears to change mayors about once a year.

It told me I used to work for GitHub which is definitely a hallucination, that's never been true. Weirdly a bunch of models make that mistake.


But it confidently implied they are the mayor, which would make you look stupid if you sent the letter. That is the Achilles heal of these models :(

Honestly, that mostly only matters if you haven't yet learned how to use this stuff responsibly.

LLMs have incomplete world knowledge and frequently make mistakes. The challenge is learning how to use them anyway.


It’s great that this can run on a laptop but FWIW, llama 70B model is no where near “GPT-4 class” in my own use cases. 405B might be, though I haven’t tested it.


Are you sure about that?

When I say GPT-4 class I'm talking about being comparable to the GPT-4 that was released in March 2023.

The Llama 3.3 70B model is clearly no way near as good as today's GPT-4o family of models, or the other top-ranking models today like Gemini 1.5 Pro and Claude 3.5 Sonnet.

To my surprise, Llama 3.3 70B is ranking higher than Claude 3 Opus on https://livebench.ai/ - I'm suspicious of that result, personally. I think Opus was the best available model for a few months earlier this year.


I guess it's because it has the highest score of all models in instruction following, 20 points higher then Opus, which compensates for shortcomings elsewhere (e.g. in language), and which wouldn't necessarily translate to human evaluation of usefulness.


Wow, yeah I think you're right - 3.3 somehow gets top position on the entire leaderboard for that category, I bet that skews the average up a lot.


The model you are running isnt the one used in the benchmarks you link.

The default llama3.3 model in ollama is heavily quantized (~4 bit). Running the full fp16 model, or even an 8-bit quant wouldnt be possible on your laptop with 64G RAM.


Thanks - yeah, I should have mentioned that. I just added a note directly above this heading https://simonwillison.net/2024/Dec/9/llama-33-70b/#honorable...


How do you reliably compare it with the GPT-4 released in March 2023?


Vibes, based on what I can remember using that model for.

There's still a gpt-4 model available via the OpenAI API, but it's gpt-4-0613 from June 2023 - the March 2023 snapshot gpt-4-0314 is no longer available.

I ran one of my test prompts against that old June 2023 GPT-4 model here: https://gist.github.com/simonw/de4951452df2677f2a1a3cd415168...

I'm not going to try for an extensive evaluation comparing it with Llama 3.3 though, life's too short and that's already been done better than I could by https://livebench.ai/


Why not ask it to solve math questions?

The bar for GPT-4 was so low that unambiguously clearing that threshold should be pretty easy.


I am not particularly interested in those benchmarks that deliberately expose weaknesses in models: I know that models have weaknesses already!

What I care about is the things that they're proven to be good at - can I do those kinds of things (RAG, summarization, code generation, language translation) directly on my laptop?


The new 3.3 70B model has comparable benchmarks to the 405B model, which is probably what people mean by GPT-4 class.


> when I ran Llama 3.3 70B on the same laptop for the first time.

There is no llama 3.3 405B to test, 3.3 only comes in 70B. Are you sure you aren't thinking of llama 3 or 3.1?


No, I meant Llama 3.3 70B.


What can I run with the 128 GB Mac laptop?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: