HN2new | past | comments | ask | show | jobs | submitlogin

opt-175B weights are already openly available as I understand. Hugging-face also has openly available weights for a 176B parameter LLM called Bloom. Is LLAMA offering something over and above these?


Yeah, their recent papers show the smaller LLAMA models outperforming the major LLMs today, and they also have bigger models. This isn't just an alternative, it's a multi order of magnitude optimization.

https://aibusiness.com/meta/meta-s-llama-language-model-outp...


Can I spend $5K and run it at home? What GPU(s) do I need?


In principal you can run it on just about any hardware with enough storage space. It's just a question of how fast it will run. This readme has some benchmarks with a similar set of models (and the code has support for even swapping data out to disk if needed): https://github.com/FMInference/FlexGen

And here are some benchmarks running OPT-175B purely on (a very beefy) CPU machine. Note that the biggest llama model is only 65.2B: https://github.com/FMInference/FlexGen/issues/24


As the models proliferate, I guess we'll be finding out soon. The torrent has been going pretty slow for me for the past couple hours, but it looks like there are a couple seeders, so eventually it'll hit that inflection point where there are enough seeders to give all the leechers full speed downloads.

Looking forward to the YouTube videos of random tinkerers seeing what sort of performance they can squeeze out of cheaper hardware.


the 7B model runs on a CUDA-compatible card with 16GB of VRAM (assuming your card has 16-bit float support).

I only got the 30b model running on a 4 x Nvidia A40 setup though.


The 30B is 64.8GB and the A40s have 48GB NVRAM ea - so does this mean you got it working on one GPU with an NVLink to a 2nd, or is it really running on all 4 A40s?

Is there a sub/forum/discord where folks talk about the nitty-gritty?


> so does this mean you got it working on one GPU with an NVLink to a 2nd, or is it really running on all 4 A40s?

it's sharded across all 4 GPUs (as per the readme here: https://github.com/facebookresearch/llama). I'd wait a few weeks to a month for people to settle on a solution for running the model, people are just going to be throwing pytorch code at the wall and seeing what sticks right now.


> people are just going to be throwing pytorch code at the wall

The pytorch 2.0 nightly has a number of performance enhancements as well as ways to reduce the memory footprint needed.

But also, looking at the README, it appears that model alone needs 2x the model size, eg 65B needs 130GB NVRAM, PLUS the decoding cache which stores 2 * 2 * n_layers * max_batch_size * max_seq_len * n_heads * head_dim bytes = 17GB for the 7B model (not sure if it needs to increase for the 65B model), but maybe a total of 147GB total NVRAM for the 65B model.

That should fit on 4 Nvidia A40s. Did you get memory errors, or you haven't tried yet?


So since making that comment I managed to get 65B running on 1 x A100 80GB using 8-bit quantization. Though I did need ~130GB of regular RAM on top of it.


So is the model any good?


It seems to be about as good as gpt3-davinci. I've had it generate React components and write crappy poetry about arbitrary topics. Though as expected, it's not very good at instructional prompts since it's not tuned for instruction.

People are also working on adding extra samplers to FB's inference code, I think a repetition penalty sampler will significantly improve quality.

The 7B model is also fun to play with, I've had it generate Youtube transcriptions for fictional videos and it's generally on-topic.


opt-175B doesn't exist; the largest one is opt-66B. And, at least in the tests I've run (not with the biggest one, but only up to a dozen billion parameters), all the opt models severely underperform with respect to even much smaller models. To the point that the launch of OPT (before BLOOM) was literally advertised as "the biggest OpenSource language model released to date", because they couldn't push on much else.

BLOOM goes indeed up to 175B parameters, and is certainly better than OPT. However, at least in my specific tests, it's still significantly inferior to OpenAI models, and actually on par with a few smaller models. There's also a "newer" fine-tuned model, called BLOOMZ, but at least in my tests it's even worse. Of course, that depends a lot on what you ask the model to do...

If LLAMA can indeed match OpenAI products, and do so with much fewer parameters, then it would be really great, and I'd really like to test it. However, even if the weights are now in the wild, using them would be clearly against the user agreement, and there's no way I'm going to do that in my work time :-) so let's hope Meta will come to sense and release them with a more friendly set of terms...


> opt-175B doesn't exist;

It doesn't exist for practical purposes because it is gatekept behind the same Facebook application process



Yes, LLaMA is state of the art in several domains. The model was trained on a much larger data set than most models which is why it is higher scoring vs other models with similar numbs of parameters. This represents millions of dollars in compute time alone for the training.

This should lead to quite a lot of innovation and it’s inevitable that someone will get these working slowly on your average MacBook.


according to Facebook Llama beats GPT3 on multiple benchmarks with smaller models that can be fine tune on a single A100 GPU

EDIT: correcting the type of GPU




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: