HN2new | past | comments | ask | show | jobs | submitlogin

Perhaps they’re hoping some enterprises will be willing to pay extra for a 3.5 grade model that can run on prem?

A niche market but I can imagine some demand there.

Biggest challenge would be Llama models.



Niche market?? You have no idea how big that market is!


Almost no serious user - private or company - wants to slurp their private data to cloud providers. Sometimes it is ethically or contractually impossible.


The success of AWS and Gmail and Google docs and Azure and Github and Cloudflare make me think this... probably not an up-to-date opinion.

By and large, companies actually seem perfectly happy to hand pretty much all their private data over to cloud providers.


yet they don't provide access to their children, there may be something in that.


We can't use LLMs at work at all right now because of IP leakage, copyright, and regulatory concerns. Hosting locally would solve one of those issues for us.


Yeah I would venture to say it’s closer to “the majority of the market” than “niche”


and according to the article this model behaves like a 12B model in terms of speed and cost while matching or outperforming Llama 2 70B in output


In terms of speed per token. What they don't say explicitly is that choosing the mix per token means you may need to reload the active model multiple times in a single sentence. If you don't have memory available for all the experts at the same time, that's a lot of memory swapping time.


Tim Dettmers stated that he thinks this one could be compressed down to a 4GB memory footprint, due to the ability of MoE layers to be sparsified with almost no loss of quality.


If your motivation is to be able to run the model on-prem, with parallelism for API service throughput (rather than on a single device), you don't need large memory GPUs or intensive memory swapping.

You can architect it as cheaper, low-memory GPUs, one expert submodel per GPU, transferring state over the network between the GPUs for each token. They run in parallel by overlapping API calls (and in future by other model architecture changes).

Th MoE model reduces inter-GPU communication requirements for splitting the model, in an addition to reducing GPU processing requirements, compared with a non-MoE model with the same number of weights. There are pros and cons to this splitting, but you can see the general trend.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: