Hacker News .hnnew | past | comments | ask | show | jobs | submit | omneity's commentslogin

The Open in OpenRouter is the same as in OpenSea, as it's the same founder. Make of that what you will.

I make of it they are good at riding hype cycles

You can increase the context window beyond its max trained context using RoPE scaling[0] which will require more VRAM.

But you can increase your context window for the same VRAM by quantizing the KV cache with FP8 (double the context) or TurboQuant (more than double)[1].

0: https://medium.com/@leannetan/extending-context-length-with-...

1: https://docs.vllm.ai/en/latest/features/quantization/quantiz...


Funny, I’ve been cracking[0] at this exact problem with a purpose-built model[1]:

0: https://huggingface.co/posts/omarkamali/593639295164067

1: https://omneitylabs.com/models/sawtone


Strong vibes from the novel Manna.

https://marshallbrain.com/manna1


Glad I'm not the only one to immediately think of it. It's a great story, but did feel unlikely when I first read it; should it prove largely true it would be terrifying.


I'm pretty sure it should be possible to distill HS-TasNet into a version approximate and fast enough for the purpose of animating LEDs.

At the end it's "just" chunking streamed audio into windows and predicting which LEDs a window should activate. One can build a complex non-realtime pipeline, generate high-quality training data with it, and then train a much smaller model (maybe even an MLP) with it to predict just this task.


Hey, this is super cool! I’ve been working on a similar problem, focusing on low-resource and underserved languages including the Mayan family, and have published some research and open resources around that [0, 1].

On the data side, I’ve found that the biggest bottleneck isn’t collecting text (it’s out there!) but reliable language identification. It’s often difficult or ambiguous to separate languages cleanly in datasets like Common Crawl, Fineweb, or others. I worked on improving this a bit for Fineweb 2 for my native language, that might inspire you [3].

Many of the challenges you mention seem to recur across regions and language families, so I’d love to connect and compare notes sometime. Feel free to reach me at omar [at] the labs site below.

0: https://wikilangs.org

1: https://omneitylabs.com

2: https://huggingface.co/blog/omarkamali/gherbal-multilingual-...


You both might find it useful - https://hackernews.hn/item?id=44950661

I’ve also recently started in this space: building an agent, for a client, who can communicate in multiple languages.


Excellent, thank you mandeepj! Curious about the language coverage of your agent and if / how you plan to eval your agent, if you're willing to share more.


Regarding language coverage, we will start with the most frequently spoken languages first.

evaluating your agent: we are documenting the details, but it should give you some idea about an approach https://hackernews.hn/item?id=47232903

Also, you might find this useful - https://open.substack.com/pub/bytebytego/p/how-roblox-uses-a...


Or your willingness to put up with power banks.


This is a great project. FYI all you need is the size of an LLM and the memory amount & bandwidth to know if it fits and the tok/s

It’s a simple formula:

llm_size = number of params * size_of_param

So a 32B model in 4bit needs a minimum of 16GB ram to load.

Then you calculate

tok_per_s = memory_bandwidth / llm_size

An RTX3090 has 960GB/s, so a 32B model (16GB vram) will produce 960/16 = 60 tok/s

For an MoE the speed is mostly determined by the amount of active params not the total LLM size.

Add a 10% margin to those figures to account for a number of details, but that’s roughly it. RAM use also increases with context window size.


> RAM use also increases with context window size.

KV cache is very swappable since it has limited writes per generated token (whereas inference would have to write out as much as llm_active_size per token, which is way too much at scale!), so it may be possible to support long contexts with quite acceptable performance while still saving RAM.

Make sure also that you're using mmap to load model parameters, especially for MoE experts. It has no detrimental effect on performance given that you have enough RAM to begin with, but it allows you to scale up gradually beyond that, at a very limited initial cost (you're only replacing a fraction of your memory_bandwidth with much lower storage_bandwidth).


Well mmap can still cause issues if you run short on RAM, and the disk access can cause latency and overall performance issues. It's better than nothing though.

Agree that k/v cache is underutilized by most folks. Ollama disables Flash Attention by default, so you need to enable it. Then the Ollama default quantization for k/v cache is fp16, you can drop to q8_0 in most cases. (https://mitjamartini.com/posts/ollama-kv-cache-quantization/) (https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...)


This is a good rule of thumb. I would also include that in most cases, RAM use exponentially increases with context window size.


There's zero exponential scaling involved. There is quadratic compute and reasonably log-linear storage, though.


Thanks for the formula, I wasn't aware of it.


It’s a trivial calculation to make (+/- 10%).

Number of params == “variables” in memory

VRAM footprint ~= number of params * size of a param

A 4B model at 8 bits will result in 4GB vram give or take, same as params. At 4 bits ~= 2GB and so on. Kimi is about 512GB at 4 bits.


Attention is calculated during the forward pass of the model, which happens in both inference (forward only) and training (forward & backward).


Dumb question: Can inference be done in a reverse pass? Outputs predicting inputs?


Strictly speaking: no. The "forward pass" terminology does not imply that there exists a "reverse pass" that does the same kind of computation. Rather, it's describing two different kinds of computation, and the direction they occur in.

The forward pass is propagating from inputs to outputs, computing the thing the model was trained for. The reverse/backwards pass is propagating from outputs back to inputs, but it's calculating the gradients of parameters for training (rougly: how much changing each parameter in isolation affects the output, and whether it makes the output closer to the desired training output). The result of the "reverse pass" isn't a set of inputs, but a set of annotations on the model's parameters that guide their adjustment.

The computations of the forward pass are not trivially reversible (e.g. they include additions, which destroys information about the operand values). As a sibling thread points out, you can still probabilistically explore what inputs _could_ produce a given output, and get some information back that way, but it's a lossy process.

And of course, you could train a "reverse" model, one that predicts the prefix of a sequence given a suffix (trivially: it's the same suffix prediction problem, but you train it on reversed sequences). But that would be a separate model trained from scratch on that task, and in that model the prefix prediction would be its forward pass.


I do want to see ChatGPT running upwards on my screen now, predicting earlier and earlier words in a futile attempt to explain a nonsense conclusion. We could call it ChatJeopardy.


Not as trivially as the forwards direction, unsurprisingly information is lost, but better than you might expect. See for example https://arxiv.org/pdf/2405.15012


Sounds like a great premise for a sci-fi short story.


Sci-fi ? You mean historical fiction!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: