Totally agree with you. What a shame. But when I look at the national debt that seems even more out of reach, I do tend to consider that maybe the stars should wait till we have our s..tuff together here on earth. Privately funded, no issues, go for it at warp speed!
im currently running a custom Gemma4 26b MoE model on my 24gb m2... super fast and it beat deepseek, chatgpt, and gemini in 3 different puzzles/code challenges I tested it on. the issue now is the low context... I can only do 2048 tokens with my vram... the gap is slowly closing on the frontier models
From what I understand you shouldn't wait more than 5min between prompts without compacting or clearing or you'll pay for reinitializing the cache. With compaction you still pay but it's less input tokens.
(Is compaction itself free?)
Only if you set `ENABLE_PROMPT_CACHING_1H`, which was mentioned in the release notes for a recent Claude Code release but doesn't seem to be in the official docs.
That'd be awesome but it doesn't reflect what I see. Do you have a source for that?
What I see is if take a quick break the session loses ~5% right at the start of the next prompt processing. (I'm currently on max 5x)
Not at my workstation right now, but simply ask claude to analyze jsonl transcript of any session, there are two cache keys there, one is 5m, another 1h. Only 1h gets set. There are also some entries there that will tell you if request was a cache hit or miss, or if cache rewrite happened. I've had claude test another claude and on max 5x subscription, cache miss only happened if message was sent after 1h, or if session was resumed using /resume or --resume (this is a bug that exists since January - all session resumes will cause a full cache rewrite).
However, cache being hit doesn't necessarily mean Anthropic won't just subtract usage from you as if it wasn't hit. It's Anthropic we're talking about. They can do whatever they want with your usage and then blame you for it.
I have heard that if you have telemetry disabled the cache is 5 minutes, otherwise 1h. No clue how true that is however my experience (with telemetry enabled) has been the 1h cache.
It's true as far as I can tell, just by my own checking using `/status`. You can also tell by when the "clear" reminder hint shows up. Also if you look at the leaked claude code you can see that almost everything in the main thread is cached with 1H TTL (I believe subagents use 5 minute TTL)
Isn't that how the kv cache currently works? Of course they could decide to hold on to cache items for longer than an hour, but the storage requirements are pretty significant while the chance of sessions resumption slinks rapidly.
The storage requirements for large-model KV caches are actually comparatively tiny: the per-token size grows far less than model parameters. Of course, we're talking "tiny" for stashing them on bulk storage and slowly fetching them back to RAM. But that should still be viable for very long context, since the time for running prefill is quadratic.
We only have open models to go by, so looking at GLM 5.1 for instance, we're talking about almost 300 GB of kv-cache for a full context window of 200k tokens.
The point of prompt caching is to save on prefill which for large contexts (common for agentic workloads) is quite expensive per token. So there is a context length where storing that KV-cache is worth it, because loading it back in is more efficient than recomputing it. For larger SOTA models, the KV cache unit size is also much smaller compared to the compute cost of prefill, so caching becomes worthwhile even for smaller context.
If I have a conversation with claude then come back 30 minutes later to resume the conversation, the KV values for that prefill prefix are going to be exactly the same. That's the whole point of this caching in the first place.
If you're willing to incur a latency penalty on a "cold resume" (which is fine for most use-cases), why couldn't they just move it to disk. The size of the KV cache should scale on the order of something like (context_length * n_layers * residual_length). I think for a standard V3-MoE model at 1M token length, this should be on the order of 100G at FP16? And you can surely play tricks with KV compression (e.g. the recent TurboQuant paper). It doesn't seem like an outrageous amount of data to put onto cheap scratch HDD (and it doesn't grow indefinitely since really old conversations can be discarded).
> If I have a conversation with claude then come back 30 minutes later to resume the conversation, the KV values for that prefill prefix are going to be exactly the same.
Correct, when you’re using the API you can choose between 60 minute or 5 minute cache writes for this reason, but I believe the subscription doesn’t offer this. 60 minute cache writes are about 25% more expensive than regular cache writes.
I don’t have insights into internals at Anthropic so I don’t know where the pain point is for increasing cache sizes.
Ah I can see how my phrasing might be misleading, but these prompts were made within 5 minutes of each other, the timing I mentioned were what Claude spent working.
is it 5 mins between constant prompting/work or 5 mins as in if i step away from the comp for 5 mins and comp back and prompt again im not subject to reinit?
if it's the latter that's crazy. i dont even know what to do there, compactions already feel like a memory wipe
The challenge is not if you could do all of it without AI but any of it that you couldn't before.
Not everyone learns at the same pace and not everyone has the same fault tolerance threshold. In my experiencd some people are what I call "Japanese learners" perfecting by watching. They will learn with AI but would never do it themselves out of fear of getting something wrong while they understand most of it, others that I call "western learners" will start right away and "get their hands dirty" without much knowledge and also get it wrong right away. Both are valid learning strategies fitting different personalities.
yup, after the token-increase from CC from two weeks ago, I'm now consistently filling the 1M context window that never went above 30-40% a few days ago. Did they turn it off? I used to see the Co-Authored by Opus 4.6 (1M Context Window) in git commits, now the advert line is gone. I never turned it on or off, maybe the defaults changed but /model doesn't show two different context sizes for Opus 4.6
I never asked for a 1M context window, then I got it and it was nice, now it's as if it was gone again .. no biggie but if they had advertised it as a free-trial (which it feels like) I wouldn't have opted in.
Anyways, seems I'm just ranting, I still like Claude, yes but nonetheless it still feels like the game you described above.
We defaulted to medium [reasoning] as a result of user feedback about Claude using too many tokens. When we made the change, we (1) included it in the changelog and (2) showed a dialog when you opened Claude Code so you could choose to opt out. Literally nothing sneaky about it — this was us addressing user feedback in an obvious and explicit way.
Off topic, but I found Sonnet useless. It can't do the simplest tasks, like refactoring a method signature consistently across a project or following instructions accurately about what patterns/libraries should be used to solve a problem.
It's crazy because when Sonnet came out it was heralded as the best thing since sliced bread, and now people are literally saying it's "useless". I wonder if this is our collective expectations increasing or the models are getting worse.
New models come out with inflated expectations, then they are adjusted/nerfed/limited for whatever reason. Our expectations remain at previous levels.
New models come out with once again inflated expectations, but now it's double inflation, because we're still on the previous level of expectations. And so on.
I think it's likely to get worse. Providers are running out of training data, and running bigger and bigger models to more and more people is prohibitively expensive. So they will try to keep the hype up while the gains are either very small or non-existent.
I like not running into the mandatory compaction but I do try to actively keep it under too. From an Anthropic standpoint with the new(ish) 5min cache timeout, it's a great way to get people to burn tokens on reinitializing the cache without having them occupy TPU time.. Esp. the larger the context gets.
hmm, I just reverted to 2.1.98 and now with /model default has the (1M context) and opus is without (200k) .. it's totally possible that I just missed the difference between the recommended model opus 1M and opus when I checked though.
This is what makes HN great: We get to hear from the people and not (only) the media dept. Thanks for your honesty and openness. I trust OpenAI a lot more when I hear balanced accounts like this.
Makes me wonder why solar is not on the list.. I thought all gore said that was gonna solve all energy problems. (Of course not, he's a politician, but I'd have expected to at least see it with some relevant percentage in the African countries)
Or could it be that solar is distributed enough to not appear because it's set up directly by/with the consumer rather than the grid producer?
Awesome, I didn't know about the car wash question.
Totally true, also tokens seem to burn through much faster. More parallelism could explain some of it but where I could work on 3-5 projects at once on the max plan a month ago, I can't even get one to completion now on the same Opus model before the 5h session locks me up..
reply