HN2new | past | comments | ask | show | jobs | submitlogin

There is a neat potential speedup here for the case where the bandwidth to your model weights is the limiting factor.

If you have a guess what the model will output, then you can verify that your guess is correct very cheaply, since you can do it in parallel.

That means there is the possibility to have a highly quantized small model in RAM, and then use the big model only from time to time. You might be able to get a 10x speedup this way if your small model agrees 90% of the time.



This is an interesting concept, could you share a paper or some writeup about this?


It looks like a description of Speculative Sampling. There's a recent paper from DeepMind about this in the context of LLM [0], although it's not a completely new idea of course.

The potential for speedup according to their paper is closer to 2x than 10x however.

0: https://arxiv.org/abs/2302.01318




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: