Give us an update on the 30B model! I have 13B running easily on my M2 Air (24GB ram), just waiting until I'm on an unmetered connection to download the 30B model and give it a go.
I am running the 30B model on my m1 Mac Studio with 32gb of ram.
(venv) bherman@Rattata ~/llama.cpp$ ./main -m ./models/30B/ggml-model-q4_0.bin -
t 8 -n 128
main: seed = 1678666507
llama_model_load: loading model from './models/30B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 6656
llama_model_load: n_mult = 256
llama_model_load: n_head = 52
llama_model_load: n_layer = 60
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 17920
llama_model_load: n_parts = 4
llama_model_load: ggml ctx size = 20951.50 MB
llama_model_load: memory_size = 1560.00 MB, n_mem = 30720
llama_model_load: loading model part 1/4 from './models/30B/ggml-model-q4_0.bin'
llama_model_load: ................................................................... done
llama_model_load: model size = 4850.14 MB / num tensors = 543
llama_model_load: loading model part 2/4 from './models/30B/ggml-model-q4_0.bin.1'
llama_model_load: ................................................................... done
llama_model_load: model size = 4850.14 MB / num tensors = 543
llama_model_load: loading model part 3/4 from './models/30B/ggml-model-q4_0.bin.2'
llama_model_load: ................................................................... done
llama_model_load: model size = 4850.14 MB / num tensors = 543
llama_model_load: loading model part 4/4 from './models/30B/ggml-model-q4_0.bin.3'
llama_model_load: ................................................................... done
llama_model_load: model size = 4850.14 MB / num tensors = 543
main: prompt: 'When'
main: number of tokens in prompt = 2
1 -> ''
10401 -> 'When'
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000
When you need the help of an Auto Locksmith Kirtlington look no further than our team of experts who are always on call 24 hours a day 365 days a year.
We have a team of auto locksmiths on call in Kirtlington 24 hours a day 365 days a year to help with any auto locksmith emergency you may find yourself in, whether it be repairing an broken omega lock, reprogramming your car transponder keys, replacing a lacking vehicle key or limiting chipped car fobs, our team of auto lock
main: mem per token = 43387780 bytes
main: load time = 35493.44 ms
main: sample time = 281.98 ms
main: predict time = 34094.89 ms / 264.30 ms per token
main: total time = 74651.21 ms
It definitely runs. It uses almost 20GB of RAM so I had to exit my browser and VS Code to keep the memory usage down.
But it produces completely garbled output. Either there's a bug in the program, or the tokens are different to 13B model, or I performed the conversion wrong, or the 4bit quantization breaks it.
I've finally managed to download the model and it seems to be working well for me. There's been some updates to the quantization code, so maybe if you do a 'git pull && make' and rerun the quantization script it will work for you. I'm getting about 350ms per token with the 30B model.