I have 4090 and M1 Max 64GB. 4090 is far superior on Llama 2.

jb1991 · on Dec 13, 2023

But are you using the newly released Apple MLX optimizations?

ps · on Dec 13, 2023

It's been approximately 2 months since I have tested it, so probably not.

jb1991 · on Dec 13, 2023

But those optimizations are the subject of the article you are commenting on.

astrodust · on Dec 13, 2023

On models < 24GB presumably. "Faster" depends on the model size.

brucethemoose2 · on Dec 13, 2023

In this case, the 4090 is far more memory efficient thanks to ExLlamav2.

70B in particular is indeed a significant compromise on the 4090, but not as much as you'd think. 34B and down though, I think Nvidia is unquestionably king.

michaelt · on Dec 13, 2023

Doesn't running 70B in 24GB need 2 bit quantisation?

I'm no expert, but to me that sounds like a recipe for bad performance. Does a 70B model in 2-bit really outperform a smaller-but-less-quantised model?

brucethemoose2 · on Dec 14, 2023

2.65bpw, on a totally empty 3090 (and I mean totally empty).

I woukd say 34B is the performance sweetspot, yeah. There was a long period where allow we had in the 33B range was llamav1, but now we have Yi and Codellamav2 (among others).