As the author of the now (in)famous report in https://github.com/anthropics/claude-code/issues/42796 issue (sorry stella :) all I can say is... sigh. Reading through the changelog felt as if they codified every bad experiment they ran that hurt Opus 4.6. It makes it clear that the degradation was not accidental.
I'm still sad. I had a transformative 6 months with Opus and do not regret it, but I'm also glad that I didn't let hope keep me stuck for another few weeks: had I been waiting for a correction I'd be crushed by this.
Hypothesis: Mythos maintains the behavior of what Opus used to be with a few tricks only now restricted to the hands of a few who Anthropic deems worthy. Opus is now the consumer line. I'll still use Opus for some code reviews, but it does not seem like it'll ever go back to collaborator status by-design. :(
You can watch for these yourself - they are strong indicators of shallow thinking. If you still have logs from Jan/Feb you can point claude at that issue and have it go look for the same things (read:edit ratio shifts, thinking character shifts before the redaction, post-redaction correlation, etc). Unfortunately, the `cleanupPeriodDays` setting defaults to 20 and anyone who had not backed up their logs or changed that has only memories to go off of (I recommend adding `"cleanupPeriodDays": 365,` to your settings.json). Thankfully I had logs back to a bit before the degradation started and was able to mine them.
The frustrating part is that it's not a workflow _or_ model issue, but a silently-introduced limitation of the subscription plan. They switched thinking to be variable by load, redacted the thinking so no one could notice, and then have been running it at ~1/10th the thinking depth nearly 24/7 for a month. That's with max effort on, adaptive thinking disabled, high max thinking tokens, etc etc. Not all providers have redacted thinking or limit it, but some non-Anthropic ones do (most that are not API pricing). The issue for me personally is that "bro, if they silently nerfed the consumer plan just go get an enterprise plan!" is consumer-hostile thinking: if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that. Today there is zero indication from Anthropic that the limitation exists, the redaction was a deliberate feature intended to hide it from the impacted customers, and the community is gaslighting itself with "write a better prompt" or "break everything into tiny tasks and watch it like a hawk same you would a local 27B model" or "works for me <in some unmentioned configuration>" - sucks :/
The "this test failure is preexisting so I'm going to ignore it" thing has been happening a lot for me lately, it's so annoying. Unless it makes a change and then immediately runs tests and it's obvious from the name/contents that the failing test is directly related to the change that was made it will ignore it and not try to fix.
This should be part of the system prompt. It's absolutely unacceptable to just to not at least try to investigate failures like this. I absolutely hate when it reaches this conclusion on its own and just continues on as if it's doing valid work.
Based on the recent leaks, their system prompt explicitly nudges the model not to do anything outside of what was asked. That could very well explain why it’s not fixing preexisting broken tests.
“Don't add features, refactor code, or make "improvements" beyond what was asked.”
And it's very valid. Because otherwise you would ask Claude to trim a tree and it would go raze the whole forest and plant new seeds. This was the primary pain point last year, especially with Sonnet.
I will note that this "out" that Claude takes was a) less frequent in Opus 4.5 and that time frame and b) notably not something that Codex does.
I don't trust the code that Claude writes at all, if I have to use it (they gave me a free month recently, so I use it...) I not only review it carefully but have Codex do a thorough review.
Claude "cheats" and leaves hacks and has Dunning-Kruger.
All of this is very exhausting. I am enjoying writing my own code with these tools (to get long running personal projects out the door) but the effect that these tools are having on teams is terrifyingly corrosive and it's making me want to take an early retirement from the profession.
Yes we can write a lot of code quickly. But at what cost? And what even use is all this code now anyways?
Usually these were the developers who said their code didn’t need tests because it’s obviously correct/too simple to need them. And then their bug causes a crash that needs to be fixed over the weekend :/
I can't believe that's where we're at, as software devs. I miss predictable outputs, state machines. All those LLM (prompt) based rules make no sense to me. Same with AI WAL. All of it, at some point, will fail.
> I can't believe that's where we're at, as software devs
Agree wholeheartedly.
The premise of the bug did not make any sense to me. For instance, "unusable for complex engineering tasks", why would someone who understands these tools use them for complex engineering tasks ? Also, this phrase in the bug appears too jargon-ny "Extended Thinking Is Load-Bearing for Senior Engineering Workflows" - what does this even mean ? Am I the only one who is looking at this with bewilderment. I think there is group of folks producing almost-working proof of concept code with these tools, and will face a reckoning at some point - as the bug illustrates. I see this as a storm in a teacup with wonder and amusement.
There is also a larger commentary on: when you dont understand why things work (ie, have a causal model), you wont know why they broke (find root causes). We are at a point in our craft where we throw magic dust and chant spells at claude and hope and pray it works.
I've been saying this with many of my friends but, I feel like it's also probably illegal: you paid for a subscription where you expect X out of, and if they changed the terms of your subscription (e.g. serving worse models) after you paid for it, was that not false advertising? Could we not ask for a refund, or even sue?
probably not. the engineers dont even know how these things work (see: black box) so how could you even prove that its not doing what it's 'supposed' to be doing?
I'm curious about your subscription/API comparison with respect to thinking. Do you have a benchmark for this, where the same set of prompts under a Claude Code subscription result in significantly different levels of effective thinking effort compared to a Claude Code+API call?
Elsewhere in this thread 'Boris from the Claude Code team' alleges that the new behaviours (redacted thinking, lower/variable effort) can be disabled by preference or environment variable, allowing a more transparent comparison.
I wonder if they’ve had so many new signups lately that they just don’t have enough capacity, so they fiddled with the defaults so they could respond to everyone? Could it be as simple as that?
It's gotten very bad. It was degrading since late Feb and since March 8th has become unusable. "Simplest fix" and "You're right, I'm sorry" are strong indicators. It went from senior engineer to entitled intern, and I went from having a team of peers to a lazy jerk who only tries to cut corners. I've got quantitative analytics of it, too. Briefly the other day for about 24 hours it returned to normal, and then someone flipped the switch again mid-session. I was a massive proponent of Claude/Opus, and for the last several weeks have felt rug-pulled. It's such an obvious degradation that even non-technical friends have noticed it. It's optimizing for minimum effort instead of correct and clean solutions. It sucks, because had I experienced it like this from the start I'd have bounced from agentic coding and never looked back - unfortunately, I thought it'd only get better and adjusted my workflow around it. When my Qwen3.5 27B local model gets into fewer reasoning loops than Opus does, it makes me wonder if anyone there cares or if they are just chasing IPO energy from scaling.
I had to build a stop hook to catch it's garbage, and even then it's not enough. I had 30min-1hr uninterrupted sessions (some slipstreamed comments), and now I can't get a single diff that I can accept without comment. Half of the work it does is more destructive than helpful (removing comments from existing code, ignoring directives and wandering off into nowhere, etc).
From 2 weeks after installing the stop hook (around March 8th):
```
Breakdown of the 173 violations:
73x ownership dodging (caught saying variants of "not caused by my changes")
40x unnecessary permission-seeking ("should I continue?", "want me to keep going?")
18x premature stopping ("good stopping point", "natural checkpoint")
14x "known limitation" dodging
14x "future work" / "known issue" labeling
Various: "next session", "pause here", etc.
Peak day: March 18 with 43 violations in a single day.
```
Other one is loops in reasoning, which are something I'm familiar with on small local models, not frontier ones:
```
Sessions containing 5+ instances of reasoning-loop phrases ("oh wait", "actually,", "let me reconsider", "I was wrong"):
Period Sessions with 5+ loops
Before March 8 0
After March 8 7 (up to 23 instances in one session)
```
(I've even had it write code where it has "Wait, actually, we should do X" in comments in the code!)
The worst is the dodging; it said, literally, "not my code, not my problem" to a build failure it created 5 messages ago in the same session.
```
I had to tell Claude "there's no such thing as [an issue that existed before your changes]" on average:
Once per week in January
2-3 times per week in February
Nearly daily from March 8 onward
```
Honestly, just venting, because I'm extremely depressed. I had the equivalent of a team of engineers I could trust, and overnight someone at Anthropic flicked a switch and killed them. I'm getting better results from random models on OpenRouter now (and OmniCoder 9B! 9B!). They aren't _good_ results, mind you, but they aren't idiotic.
Reading your report made me quite depressed too. This is world changing technology. It feels like I was only allowed to have a glimpse of it before it was taken away. I hope things get better in the future...
I hear you and I am really hoping more people notice this obvious degradation than dismiss this as workflow or prompt or context saturation issues.
It isn’t obvious but hope the guys managing this realize what kind of confusion and doubt (or self doubt) that this creates in people and will have a long term impact on usage of their models.
I am going to try removing every and all plugins (i only have all Anthropic’s plugins like superpowers) and see if that makes any difference.
Yeah, I went through a week or two of configuration changes trying to figure out what I could have done to make it behave that way, and it wasn't until it repaired itself and then the next morning went back to idiot-mode mid-response that I finally knew it was not me. Same task, same session, same cc version, same prompts, same context, so I'm confident it was a configuration change on their end.
In case anyone can correlate, the recovery happened on March 24th and then re-regressed at approximately 3:09 PM PST (23:09 UTC) on March 25. Flipped right back into "simplest" solutions, and "You're right, I'm sorry" mode:
> "You're right. That was lazy and wrong. I was trying to dodge a code generator issue instead of fixing it."
> "You're right — I rushed this and it shows. Let me be deliberate about the structure before writing."
> "You're right, and I was being sloppy. The CPU slab provider's prefault is real work."
No joke - I've used Windows (and a bit of OS X) my entire life and am old enough now that I didn't think I'd ever be able to switch. A few weeks back I hit the point where I had to upgrade from Windows 10 to 11 and just could not stomach the UX so in frustration I setup Kubuntu w/ Plasma... and it's been amazing. I've tried switching before without the same luck and I think agents like Claude/Codex/etc are the only reason it has stuck this time. Something that's always been unique to Linux is that if there's something I want to change I can generally do that, but now when I want something customized I can _actually_ do it instead of just slotting it into the infinite "if only I had time" bucket. There are quirks for sure (I'm looking at you, PipeWire) but the tinkery-ness of Linux on the desktop went from being friction to a super power for me just this month - maybe others will catch on next year.
I've distro hopped and DE hopped a lot before settling, but it's been amazing for me as somoeone who has switched over from Windows. It just doesn't get in the way, is super familiar for me, AND lets me do a lot of things I wish I had in Windows.
I was worried about the "choice fatigue" due to it being super configurable and all, but honestly the defaults are so sensible I haven't really had a reason to tinker with it much if at all.
+1. I switched from Pop OS to Debian + KDE last week, and KDE has been solid. I too read a handful of articles calling out the choice fatigue, and other than a few tweaks (maybe half an hour?) I was ready to go. I run old-ish hardware (circa 2013) without any issues.
Something notable is that the all the hotkeys felt 'just right'. I had to tinker a bunch in Pop OS to get satisfying hotkey combos, and the COSMIC upgrade reset them all.
As a Settlers 1/2 fan I spent quite a bit of time in The Colonists - can recommend it if you liked the road building/flag mechanics and the chill gameplay.
It also feels like they couldn't use the GOOGLE ANTIGRAVITY logo enough times in this blog post. Gigantic image with the logo and a subtitle, plastered over and over again.
I no longer bother reading their press releases. I'd much rather read the comments and threads like these to get the real story. And I say that as a former googler.
Neat! As someone working in this space and feeling like I've been taking crazy pills from how these "duh, CPU solved this 30 years ago" things keep slipping it's great to see more people bridging the gap! Unfortunately CUDA/HIP (and the entire stack beneath them) virtual memory management ops are very expensive host APIs (remapping a big block of pages can be O(n^2) with page count and fully synchronize host/device (forced wait idle), take kernel locks, etc) so it hasn't been viable in all cases. If your workloads are submit/wait with host in the loop the VM tricks are ok but if you are trying to never block the GPU (pipeline depth > 0) you really want to avoid anything that does a page table modification (until we get GPUs that can pipeline those). vkQueueBindSparse is one of the few async APIs I've seen, and CUDA has cuMemMapArrayAsync but I haven't yet used it (because arrays are annoying and without being able to inspect the driver I'm sure it's probably doing the wrong thing).
I've had good luck with indirection tables used during lookup inside of the kernels consuming/producing the kvcache data - it's essentially user-mode remapping like they do here: you can publish a buffer offset table and threads are uniform, have coalesced reads to the table, and cache the offsets no problem. You have the same memory locality issues as VM (contiguous virtual but potentially random physical) but are not limited to device page sizes and since you can update while work is in-flight you can be much more aggressive about reuse and offload (enqueue DMA to cold storage to evict from VRAM, enqueue DMA to copy from cold memory into reused VRAM, enqueue offset table update, enqueue work using them, repeat - all without host synchronization). You can also defrag in-flight if you do want to try to restore the physical locality. It's nothing crazy and fairly normal in CPU land (or even classic virtual texturing), but in ML GPU land I could write a big paper on it and call it SuperDuperFancyAttention4 and publish press releases...
(Disclaimer: I am one of the authors of the project) Thank you for the thoughtful and insightful comment. I really love the depth of your first paragraph. You highlighted a concern in this space that is often overlooked, and I am glad you raised it. We spent a significant amount of time dealing with the cost of dynamic GPU memory operations.
One useful observation is that LLM inference has almost no host API calls during steady state, since the GPU must stay busy with continuous kernel launches or CUDA graph replay. You are absolutely right that CUDA and HIP virtual memory operations are expensive on the host side and involve heavy driver work. However, they introduce only small stalls in the GPU pipeline, because most of the cost is paid on the host. These operations are also relatively infrequent compared to kernel launches in practice, so we offload them to a background thread to keep them off the critical path. The APIs are not cheap in general, but they happen to fit LLM inference surprisingly well.
On your second point, I guess I follow your idea, although please correct me if I misunderstood. Virtual memory does open the door to paging and offloading, which is also important for LLM systems. We are actively working on this direction in kvcached. Your defragmentation point also reminds me of classic techniques such as compaction and garbage collection. They could certainly help, though I guess the trade off between benefit and complexity would need more careful evaluation.
Thank you again for the thoughtful analysis. It was a pleasure to read. I would be happy to continue the discussion.
As an old school TT/TTD fan this gives me so many good vibes :)
Been fun watching the progress and I do recommend people check out the demos on Steam if you just want to have a good nostalgia break even if the game isn't fully there yet.
I'm still sad. I had a transformative 6 months with Opus and do not regret it, but I'm also glad that I didn't let hope keep me stuck for another few weeks: had I been waiting for a correction I'd be crushed by this.
Hypothesis: Mythos maintains the behavior of what Opus used to be with a few tricks only now restricted to the hands of a few who Anthropic deems worthy. Opus is now the consumer line. I'll still use Opus for some code reviews, but it does not seem like it'll ever go back to collaborator status by-design. :(
reply