"unused tokens" are the force driving token cost down. If everyone used all of t...

lrvick · 2026-04-16T07:58:54 1776326334

And that is why the only winning move is owning a GPU.

jeroenhd · 2026-04-16T09:49:27 1776332967

With current GPU prices, I find it difficult to find hardware to run competent models. gemma4's 26B MoE model seems to offer the best performance per megabyte of RAM, but it's not good enough to use the way one would use cloud models.

The big, impressive models all scale well for multi-customer setups because of the efficiency batching provides, but the base cost to run models like that as even a small business is incredibly high. If you can't saturate your LLM hardware almost 24/7, the time to earn back your investment is high unless you choose inferior models that are worse at their job.

lrvick · 2026-04-17T08:21:11 1776414071

Assuming one does not value privacy, sovereignty, etc.

But also the Strix Halo 128 is pretty hard to beat.

Guillaume86 · 2026-04-16T09:41:49 1776332509

I think sometimes about this, does it really make sense? Financially I mean. The is just my impressions and I'm glad to be corrected if someone has hard numbers and some experience going this route:

At the moment LLMs vendors are in market grab mode and take a loss on big subscription users, they are starting to try to move to profit but they must move carefully to not let a competitor steal their users so we will still have "cheap" tokens for a while.

Even if prices go up by a bit, they have the scale in their favor to optimize costs.

If commercial model providers go into "not competitive" territory with their prices compared to open models, wouldn't it always be cheaper to use an open models inference provider? They can take advantage of scale as well, and with no model moat, competition should keep prices honest.

And last ressort, renting GPU time in the cloud seem like a safer bet than buying a GPU to me?