Give us an update on the 30B model! I have 13B running easily on my M2 Air (24GB...

brian_herman · on March 13, 2023

I am running the 30B model on my m1 Mac Studio with 32gb of ram.

    (venv) bherman@Rattata ~/llama.cpp$ ./main -m ./models/30B/ggml-model-q4_0.bin -
  t 8 -n 128
  main: seed = 1678666507
  llama_model_load: loading model from './models/30B/ggml-model-q4_0.bin' - please wait ...
  llama_model_load: n_vocab = 32000
  llama_model_load: n_ctx   = 512
  llama_model_load: n_embd  = 6656
  llama_model_load: n_mult  = 256
  llama_model_load: n_head  = 52
  llama_model_load: n_layer = 60
  llama_model_load: n_rot   = 128
  llama_model_load: f16     = 2
  llama_model_load: n_ff    = 17920
  llama_model_load: n_parts = 4
  llama_model_load: ggml ctx size = 20951.50 MB
  llama_model_load: memory_size =  1560.00 MB, n_mem = 30720
  llama_model_load: loading model part 1/4 from './models/30B/ggml-model-q4_0.bin'
  llama_model_load: ................................................................... done
  llama_model_load: model size =  4850.14 MB / num tensors = 543
  llama_model_load: loading model part 2/4 from './models/30B/ggml-model-q4_0.bin.1'
  llama_model_load: ................................................................... done
  llama_model_load: model size =  4850.14 MB / num tensors = 543
  llama_model_load: loading model part 3/4 from './models/30B/ggml-model-q4_0.bin.2'
  llama_model_load: ................................................................... done
  llama_model_load: model size =  4850.14 MB / num tensors = 543
  llama_model_load: loading model part 4/4 from './models/30B/ggml-model-q4_0.bin.3'
  llama_model_load: ................................................................... done
  llama_model_load: model size =  4850.14 MB / num tensors = 543

  main: prompt: 'When'
  main: number of tokens in prompt = 2
       1 -> ''
   10401 -> 'When'
  
  sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000
  
  
  When you need the help of an Auto Locksmith Kirtlington look no further than our team of experts who are always on call 24 hours a day 365 days a year.
  We have a team of auto locksmiths on call in Kirtlington 24 hours a day 365 days a year to help with any auto locksmith emergency you may find yourself in, whether it be repairing an broken omega lock, reprogramming your car transponder keys, replacing a lacking vehicle key or limiting chipped car fobs, our team of auto lock
  
  main: mem per token = 43387780 bytes
  main:     load time = 35493.44 ms
  main:   sample time =   281.98 ms
  main:  predict time = 34094.89 ms / 264.30 ms per token
  main:    total time = 74651.21 ms

tomp · on March 12, 2023

hm... well...

It definitely runs. It uses almost 20GB of RAM so I had to exit my browser and VS Code to keep the memory usage down.

But it produces completely garbled output. Either there's a bug in the program, or the tokens are different to 13B model, or I performed the conversion wrong, or the 4bit quantization breaks it.

shocks · on March 12, 2023

I’m also getting garbage out of 30B and 65B.

30B just says “dotnetdotnetdotnet…”

gorbypark · on March 15, 2023

I've finally managed to download the model and it seems to be working well for me. There's been some updates to the quantization code, so maybe if you do a 'git pull && make' and rerun the quantization script it will work for you. I'm getting about 350ms per token with the 30B model.

tomp · on March 15, 2023

Thanks for reminding me! It works now. The difference is striking!