More

omneity · 2026-05-30T19:43:48 1780170228

The Open in OpenRouter is the same as in OpenSea, as it's the same founder. Make of that what you will.

dnnddidiej · 2026-05-31T01:19:06 1780190346

I make of it they are good at riding hype cycles

omneity · 2026-05-18T19:30:01 1779132601

You can increase the context window beyond its max trained context using RoPE scaling[0] which will require more VRAM.

But you can increase your context window for the same VRAM by quantizing the KV cache with FP8 (double the context) or TurboQuant (more than double)[1].

0: https://medium.com/@leannetan/extending-context-length-with-...

1: https://docs.vllm.ai/en/latest/features/quantization/quantiz...

omneity · 2026-05-13T17:43:16 1778694196

Funny, I’ve been cracking[0] at this exact problem with a purpose-built model[1]:

0: https://huggingface.co/posts/omarkamali/593639295164067

1: https://omneitylabs.com/models/sawtone

omneity · 2026-04-16T15:45:43 1776354343

Strong vibes from the novel Manna.

https://marshallbrain.com/manna1

Little_Kitty · 2026-04-16T19:33:04 1776367984

Glad I'm not the only one to immediately think of it. It's a great story, but did feel unlikely when I first read it; should it prove largely true it would be terrifying.

omneity · 2026-04-08T18:01:03 1775671263

I'm pretty sure it should be possible to distill HS-TasNet into a version approximate and fast enough for the purpose of animating LEDs.

At the end it's "just" chunking streamed audio into windows and predicting which LEDs a window should activate. One can build a complex non-realtime pipeline, generate high-quality training data with it, and then train a much smaller model (maybe even an MLP) with it to predict just this task.

omneity · 2026-03-21T22:36:31 1774132591

Hey, this is super cool! I’ve been working on a similar problem, focusing on low-resource and underserved languages including the Mayan family, and have published some research and open resources around that [0, 1].

On the data side, I’ve found that the biggest bottleneck isn’t collecting text (it’s out there!) but reliable language identification. It’s often difficult or ambiguous to separate languages cleanly in datasets like Common Crawl, Fineweb, or others. I worked on improving this a bit for Fineweb 2 for my native language, that might inspire you [3].

Many of the challenges you mention seem to recur across regions and language families, so I’d love to connect and compare notes sometime. Feel free to reach me at omar [at] the labs site below.

0: https://wikilangs.org

1: https://omneitylabs.com

2: https://huggingface.co/blog/omarkamali/gherbal-multilingual-...

mandeepj · 2026-03-22T04:30:42 1774153842

You both might find it useful - https://hackernews.hn/item?id=44950661

I’ve also recently started in this space: building an agent, for a client, who can communicate in multiple languages.

omneity · 2026-03-23T01:21:14 1774228874

Excellent, thank you mandeepj! Curious about the language coverage of your agent and if / how you plan to eval your agent, if you're willing to share more.

mandeepj · 2026-04-01T11:14:58 1775042098

Regarding language coverage, we will start with the most frequently spoken languages first.

evaluating your agent: we are documenting the details, but it should give you some idea about an approach https://hackernews.hn/item?id=47232903

Also, you might find this useful - https://open.substack.com/pub/bytebytego/p/how-roblox-uses-a...

omneity · 2026-03-02T10:48:04 1772448484

Or your willingness to put up with power banks.

omneity · 2026-03-02T08:38:33 1772440713

This is a great project. FYI all you need is the size of an LLM and the memory amount & bandwidth to know if it fits and the tok/s

It’s a simple formula:

llm_size = number of params * size_of_param

So a 32B model in 4bit needs a minimum of 16GB ram to load.

Then you calculate

tok_per_s = memory_bandwidth / llm_size

An RTX3090 has 960GB/s, so a 32B model (16GB vram) will produce 960/16 = 60 tok/s

For an MoE the speed is mostly determined by the amount of active params not the total LLM size.

Add a 10% margin to those figures to account for a number of details, but that’s roughly it. RAM use also increases with context window size.

zozbot234 · 2026-03-02T08:54:54 1772441694

> RAM use also increases with context window size.

KV cache is very swappable since it has limited writes per generated token (whereas inference would have to write out as much as llm_active_size per token, which is way too much at scale!), so it may be possible to support long contexts with quite acceptable performance while still saving RAM.

Make sure also that you're using mmap to load model parameters, especially for MoE experts. It has no detrimental effect on performance given that you have enough RAM to begin with, but it allows you to scale up gradually beyond that, at a very limited initial cost (you're only replacing a fraction of your memory_bandwidth with much lower storage_bandwidth).

0xbadcafebee · 2026-03-02T15:23:33 1772465013

Well mmap can still cause issues if you run short on RAM, and the disk access can cause latency and overall performance issues. It's better than nothing though.

Agree that k/v cache is underutilized by most folks. Ollama disables Flash Attention by default, so you need to enable it. Then the Ollama default quantization for k/v cache is fp16, you can drop to q8_0 in most cases. (https://mitjamartini.com/posts/ollama-kv-cache-quantization/) (https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...)

kittikitti · 2026-03-02T16:08:27 1772467707

This is a good rule of thumb. I would also include that in most cases, RAM use exponentially increases with context window size.

namibj · 2026-03-03T09:42:30 1772530950

There's zero exponential scaling involved. There is quadratic compute and reasonably log-linear storage, though.

escapeteam · 2026-03-02T19:48:59 1772480939

Thanks for the formula, I wasn't aware of it.

omneity · 2026-02-05T05:07:23 1770268043

It’s a trivial calculation to make (+/- 10%).

Number of params == “variables” in memory

VRAM footprint ~= number of params * size of a param

A 4B model at 8 bits will result in 4GB vram give or take, same as params. At 4 bits ~= 2GB and so on. Kimi is about 512GB at 4 bits.

omneity · 2026-02-04T16:07:42 1770221262

Attention is calculated during the forward pass of the model, which happens in both inference (forward only) and training (forward & backward).

SubiculumCode · 2026-02-04T16:41:44 1770223304

Dumb question: Can inference be done in a reverse pass? Outputs predicting inputs?

dave_universetf · 2026-02-04T18:30:48 1770229848

Strictly speaking: no. The "forward pass" terminology does not imply that there exists a "reverse pass" that does the same kind of computation. Rather, it's describing two different kinds of computation, and the direction they occur in.

The forward pass is propagating from inputs to outputs, computing the thing the model was trained for. The reverse/backwards pass is propagating from outputs back to inputs, but it's calculating the gradients of parameters for training (rougly: how much changing each parameter in isolation affects the output, and whether it makes the output closer to the desired training output). The result of the "reverse pass" isn't a set of inputs, but a set of annotations on the model's parameters that guide their adjustment.

The computations of the forward pass are not trivially reversible (e.g. they include additions, which destroys information about the operand values). As a sibling thread points out, you can still probabilistically explore what inputs _could_ produce a given output, and get some information back that way, but it's a lossy process.

And of course, you could train a "reverse" model, one that predicts the prefix of a sequence given a suffix (trivially: it's the same suffix prediction problem, but you train it on reversed sequences). But that would be a separate model trained from scratch on that task, and in that model the prefix prediction would be its forward pass.

direwolf20 · 2026-02-04T23:54:55 1770249295

I do want to see ChatGPT running upwards on my screen now, predicting earlier and earlier words in a futile attempt to explain a nonsense conclusion. We could call it ChatJeopardy.

gpm · 2026-02-04T17:02:21 1770224541

Not as trivially as the forwards direction, unsurprisingly information is lost, but better than you might expect. See for example https://arxiv.org/pdf/2405.15012

root_axis · 2026-02-04T16:47:40 1770223660

Sounds like a great premise for a sci-fi short story.

anu7df · 2026-02-04T17:14:49 1770225289

Sci-fi ? You mean historical fiction!