They're certainly welcome to do whatever they're like, and for a microkernel based OS it might make sense--I think there's probably pretty "Meh" output from a lot of LLMs.
I think part of the battle is actually just getting people to identify which LLM made it to understand if someones contribution is good or not. A javascript project with contributions from Opus 4.6 will probably be pretty good, but if someone is using Mistral small via the chat app, it's probably just a waste of time.
I have a recurring problem where I can't even read one of my favorite recipe websites (seriouseats.com) from my phone because the series of popups completely blocks the page, and can't be dismissed.
But if I ask Claude or Gemini for a nice version of the recipe, it works perfectly. I think there's a lot of own goals out there.
For what it's worth, most people already are doing this! Some of the subagents in Claude Code (Explore, I think even compaction) default to Haiku and then you have to manually overwrite it with an env variable if you want to change it.
Imagine the quality of life upgrade of getting compaction down to a few second blip, or the "Explore" going 20 times faster! As these models get better, it will be super exciting!
> Imagine the quality of life upgrade of getting compaction down to a few second blip, or the "Explore" going 20 times faster! As these models get better, it will be super exciting!
I'm awaiting the day the small and fast models come anywhere close to acceptable quality, as of today, neither GPT5.3-codex-spark nor Haiku are very suitable for either compaction or similar tasks, as they'll miss so much considering they're quite a lot dumber.
Personally I do it the other way, the compaction done by the biggest model I can run, the planning as well, but then actually following the step-by-step "implement it" is done by a small model. It seemed to me like letting a smaller model do the compaction or writing overviews just makes things worse, even if they get a lot faster.
I think the fast inference options have historically been only marginally more expensive then their slow cousins. There's a whole set of research about optimal efficiency, speed, and intelligence pareto curves. If you can deliver even an outdated low intelligence/old model at high efficiency, everyone will be interested. If you can deliver a model very fast, everyone will be interested. (If you can deliver a very smart model, everyone is obviously the most interested, but that's the free space.)
But to be clear, 1000 tokens/second is WAY better. Anthropic's Haiku serves at ~50 tokens per second.
Do you guys all think you'll be able to convert open source models to diffusion models relatively cheaply ala the d1 // LLaDA series of papers? If so, that seems like an extremely powerful story where you get to retool the much, much larger capex of open models into high performance diffusion models.
(I can also see a world where it just doesn't make sense to share most of the layers/infra and you diverge, but curious how you all see the approach.)
I think there's clearly a "Speed is a quality of it's own" axis. When you use Cereberas (or Groq) to develop an API, the turn around speed of iterating on jobs is so much faster (and cheaper!) then using frontier high intelligence labs, it's almost a different product.
Also, I put together a little research paper recently--I think there's probably an underexplored option of "Use frontier AR model for a little bit of planning then switch to diffusion for generating the rest." You can get really good improvements with diffusion models! https://estsauver.com/think-first-diffuse-fast.pdf
Cerebras requires a $3K/year membership to use APIs.
Groq's been dead for about 6 months, even pre-acquisition.
I hope Inception is going well, it's the only real democratic target at this. Gemini 2.5 Flash Lite was promising but it never really went anywhere, even by the standards of a Google preview
I do wonder if there are tasks where 16k garbage words/s are more useful than 200 good words per second. Does anyone have any ideas? Data extraction perhaps?
I don't think it's a good comparison given Inception work on software and Cerebras/Groq work on hardware. If Inception demonstrate that diffusion LLMs work well at scale (at a reasonable price) then we can probably expect all the other frontier labs to copy them quickly, similarly to OpenAI's reasoning models.
Definitely depends on what you're buying, maybe some of the audience here was buying Groq and Cerebras chips? I don't think they sold them but can't say for sure.
If you're a poor schmoke like me, you'd be thinking of them as API vendors of ~1000 token/s LLMs.
Especially because Inception v1's been out for a while and we haven't seen a follow-the-leader effect.
Coincidentally, that's one of my biggest questions: why not?
No new model since GPT-OSS 120B, er maybe Kimi K2 not-thinking? Basically there were a couple models it normally obviously support, and it didn't.
Something about that Nvidia sale smelled funny to me because the # was yuge, yet, the software side shut down decently before the acquisition.
But that's 100% speculation, wouldn't be shocked if it was:
"We were never looking to become profitable just on API users, but we had to have it to stay visible. So, yeah, once it was clear an Nvidia sale was going through, we stopped working 16 hours a day, and now we're waiting to see what Nvidia wants to do with the API"
The groq purchase was designed to not trigger federal oversight of mergers, so you buy out the ‘interesting’ part, leave a skeleton team and a line of business you don’t care about -> no CFIUS, no mandatory FTC reporting -> smoother process.
Once again, it's a tech that Google created but never turned into a product. AFAIK in their demo last year, Google showed a special version of Gemini that used diffusion. They were so excited about it (on the stage) and I thought that's what they'd use in Google search and Gmail.
Good to know it works for some people! I think it's another issue where they focus too much on MacOS and neglect Windows and Linux releases. I use WSL for Claude Code since the Windows release is far worse and currently unusable do to several neglected issues.
Hoping to see several missing features land in the Linux release soon.
I'm also feeling weak and the pull of getting a Mac is stronger. But I also really don't like the neglect around being cross-platform. It's "cross-platform" except a bunch of crap doesn't work outside MacOS. This applies to Claude Code, Claude Desktop (MacOS and Windows only - no Linux or WSL support), Claude Cowork (MacOS only). OpenAI does the same crap - the new Codex desktop app is MacOS only. And now I'm ranting.
I'm on v2.1.37 and I have it set to auto-update, which it does. I also tend to run `claude update` when I see a new release thread on Twitter, and usually it has already updated itself.
Claude Code CLI 2.1.39 released a few hours ago fixes the problem. They didn't note it in the changelog though. Seems like a significant bug fix. ¯\_(ツ)_/¯
I think the closest that this has come is in the form of GitLab, which pretty famously did a ton of the corporate work in the format of a very open Handbook (https://handbook.gitlab.com/)
In the early years, it was extremely, extremely open and comprehensive. I've definitely looked through it when I wasn't sure how to handle something at work.
That's pretty cool. Wonder if it is deployed and updated religiously still. If they wanted to deploy an 'Agent' worker that source is goldmine for context.
I have a hard time believing that the right move for most organizations that aren't already bought into an OpenAI enterprise plan is going to be building their entire business around something like this. This ties you to one model provider that has been having issues keeping up with the other big labs and provides what looks like superficially some extremely useful tools but with unclear amounts of rigor. I don't think I would want to build my business on this if I was an AI-native company that was just starting right now unless they figure out how to make this much more legible and transparent to people.
I think part of the battle is actually just getting people to identify which LLM made it to understand if someones contribution is good or not. A javascript project with contributions from Opus 4.6 will probably be pretty good, but if someone is using Mistral small via the chat app, it's probably just a waste of time.