Perhaps they’re hoping some enterprises will be willing to pay extra for a 3.5 g...

v4dok · on Dec 11, 2023

Niche market?? You have no idea how big that market is!

visarga · on Dec 11, 2023

Almost no serious user - private or company - wants to slurp their private data to cloud providers. Sometimes it is ethically or contractually impossible.

michaelt · on Dec 11, 2023

The success of AWS and Gmail and Google docs and Azure and Github and Cloudflare make me think this... probably not an up-to-date opinion.

By and large, companies actually seem perfectly happy to hand pretty much all their private data over to cloud providers.

b4ke · on Dec 11, 2023

yet they don't provide access to their children, there may be something in that.

evantbyrne · on Dec 11, 2023

We can't use LLMs at work at all right now because of IP leakage, copyright, and regulatory concerns. Hosting locally would solve one of those issues for us.

mepiethree · on Dec 11, 2023

Yeah I would venture to say it’s closer to “the majority of the market” than “niche”

anentropic · on Dec 11, 2023

and according to the article this model behaves like a 12B model in terms of speed and cost while matching or outperforming Llama 2 70B in output

viraptor · on Dec 11, 2023

In terms of speed per token. What they don't say explicitly is that choosing the mix per token means you may need to reload the active model multiple times in a single sentence. If you don't have memory available for all the experts at the same time, that's a lot of memory swapping time.

anon373839 · on Dec 11, 2023

Tim Dettmers stated that he thinks this one could be compressed down to a 4GB memory footprint, due to the ability of MoE layers to be sparsified with almost no loss of quality.

jlokier · on Dec 11, 2023

If your motivation is to be able to run the model on-prem, with parallelism for API service throughput (rather than on a single device), you don't need large memory GPUs or intensive memory swapping.

You can architect it as cheaper, low-memory GPUs, one expert submodel per GPU, transferring state over the network between the GPUs for each token. They run in parallel by overlapping API calls (and in future by other model architecture changes).

Th MoE model reduces inter-GPU communication requirements for splitting the model, in an addition to reducing GPU processing requirements, compared with a non-MoE model with the same number of weights. There are pros and cons to this splitting, but you can see the general trend.