TurboQuant has a specific benefit by compressing the KV cache at a negligible cost to quality. That mainly means that the context lengths can go up in models for the same amount of memory, however the KV cache only accounts for something like 20% of the overall model size, and this will not dramatically decrease memory demands in the way that some of the more sensationalist reporting has stated.
Still active, but many fewer resources than in the past. Many backends like CUDA for Windows have been dropped and others pushed off to partners with varying levels of support. TensorFlow 2.19 is going to release soon without Python 3.13 support, it's hard not to imagine that resource constraints are at play.
Exciting concept! Note that the LLM corrected version does drop a full paragraph from the output at the bottom of the second page (starting with an asterisk and "My views regarding inflationary possibilities". I'm not sure if there is a simple way to mitigate this risk but would be nice to fall back on uncorrected text if the LLM can't produce valid results for some region of the document.
I recently had occasion to evaluate a database of 1200+ NVIDIA GPUs and can tell you that the only thing consistent about the model numbers is their inconsistency. For example, what is an RTX 4000? It could be the 2018 Quadro RTX 4000, the Quadro RTX 4000 Max-Q, or Quadro RTX 4000 Mobile (all Turing cards), but it could also be the RTX 4000 Mobile Ada Generation (Ada Lovelace card released 2023).
reply