So I guess the key takeaway is basically that the better Claude gets at producing polished output, the less users bother questioning it. They found that artifact conversations have lower rates of fact-checking and reasoning challenges across the board. That's kind of an uncomfortable loop for a company selling increasingly capable models.
This makes me think of checklists. We have decades of experience in uncountable areas showing that checklists reminding users to question the universe improve outcomes: Is the chemical mixture at the temperature indicated by the chart? Did you get confirmation from Air Traffic Control? Are you about to amputate the correct limb? Is this really the file you want to permanently erase?
Yet our human brains are usually primed to skip steps, take shortcuts, and see what we expect rather than what's really there. It's surprisingly hard to keep doing the work both consistently and to notice deviations.
> lower rates of fact-checking and reasoning challenges
Now here we are with LLMs, geared to produce a flood of superficially-plausible output which strikes at our weak-point, the ability to do intentional review in a deep and sustained way. We've automated the stuff that wasn't as-hard and putting an even greater amount of pressure on the remaining bottleneck.
Rather than the old definition involving customer interaction and ads, I fear the new "attention economy" is going to be managing the scarce resource of human inspection and validation.
Sounds like having a strong checklist of steps to take for every pull request will be crucial for creating reliable and correct software when AIs write most of the code.
But the temptation to short change this step when it becomes the bottleneck for shipping code will become immense.
> So I guess the key takeaway is basically that the better Claude gets at producing polished output, the less users bother questioning it.
This is exactly what I worry about when I use AI tools to generate code. Even if I check it, and it seems to work, it's easy to think, "oh, I'm done." However, I'll (often) later find obvious logical errors that make all of the code suspect. I don't bother, most of the time though.
I'm starting to group code in my head by code I've thoroughly thought about, and "suspect" code that, while it seems to work, is inherently not trustworthy.
Sure, but the study is saying something slightly different, it's not that people write bad prompts for artifacts, they actually write better ones (more specific, more examples, clearer goals,...). They just stop evaluating the result. So the input quality goes up but the quality control goes down.
Seems like it’s impossible for output to be good if the prompt is bad. Unless the AI is ignoring the literal instructions and just guessing “what you really want” which would be bad in a different way.
> On two occasions I have been asked, — "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" In one case a member of the Upper, and in the other a member of the Lower, House put this question. I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
EDIT: This is a new iteration of an old problem. Even GIGO [1] arguably predates computers and describes a lot of systemic problems. It does seem a lot more difficult to distinguish between a "garbage" or "good" prompt though. Perhaps this problem is just going to keep getting harder.
What does prompting quality even mean, empirically? I feel like the LLM providers could/should provide prompt scoring as some kind of metric and provide hints to users on ways they can improve (possibly including ways the LLM is specifically trained to act for a given prompt).
The real insight buried in here is "build what programmers love and everyone will follow." If every user has an agent that can write code against your product, your API docs become your actual product. That's a massive shift.
I'm very much looking forward to this shift. It is SO MUCH more pro-consumer than the existing SaaS model. Right now every app feels like a walled garden, with broken UX, constant redesigns, enormous amounts of telemetry and user manipulation. It feels like every time I ask for programmatic access to SaaS tools in order to simplify a workflow, I get stuck in endless meetings with product managers trying to "understand my use case", even for products explicitly marketed to programmers.
Using agents that interact with APIs represents people being able to own their user experience more. Why not craft a frontend that behaves exactly the the way YOU want it to, tailor made for YOUR work, abstracting the set of products you are using and focusing only on the actual relevant bits of the work you are doing? Maybe a downside might be that there is more explicit metering of use in these products instead of the per-user licensing that is common today. But the upside is there is so much less scope for engagement-hacking, dark patterns, useless upselling, and so on.
> Right now every app feels like a walled garden, with broken UX, constant redesigns, enormous amounts of telemetry and user manipulation
OK, but: that's an economic situation.
> so much less scope for engagement-hacking, dark patterns, useless upselling, and so on.
Right, so there's less profit in it.
To me it seems this will make the market more adversarial, not less. Increasing amounts of effort will be expended to prevent LLMs interacting with your software or web pages. Or in some cases exploit the user's agentic LLM to make a bad decision on their behalf.
the "exploit the user's agentic LLM" angle is underappreciated imo. we already see prompt injection attacks in the wild -- hidden text on web pages that tells the agent to do things the user didn't ask for. now scale that to every e-commerce site, every SaaS onboarding flow, every comparison page.
it's basically SEO all over again but worse, because the attack surface is the user's own decision-making proxy. at least with google you could see the search results and decide yourself. when your agent just picks a vendor for you based on what it "found," the incentive to manipulate that process is enormous.
we're going to need something like a trust layer between agents and the services they interact with. otherwise it's just an arms race between agent-facing dark patterns and whatever defenses the model providers build in.
Maybe. Or maybe services will switch to charging per API call or whatever instead of monthly or per-seat. Who can predict the future?
I mean, services _could_ make it harder to use LLMs to interact with them, but if agents are popular enough they might see customers start to revolt over it.
This extends further than most people realize. If agents are the primary consumers of your product surface, then the entire discoverability layer shifts too. Right now Google indexes your marketing page -- soon the question is whether Claude or GPT can even find and correctly describe what your product does when a user asks.
We're already seeing this with search. Ask an LLM "what tools do X" and the answer depends heavily on structured data, citation patterns, and how well your docs/content map to the LLM's training. Companies with great API docs but zero presence in the training data just won't exist to these agents.
So it's not just "API docs = product" -- it's more like "machine-legible presence = existence." Which is a weird new SEO-like discipline that barely has a name yet.
The "start over in an hour" philosophy is underrated. I've been running my own infrastructure for years and the single most empowering thing isn't the setup, it's the peace of mind that you can just nuke it and spin up somewhere else.
Knowing that, I started looking at every SaaS subscription very differently.
At the lower or easier end, there’s your standard containerisation tools like Docker Compose or the Podman equivalents. Just move your compose files and zip the mount folders and you can move stuff easily enough.
Middle ground you’ve got stuff like Ansible for if you want to install things without containers, but still want it to be scripted. I don’t use these much since they feel like the worst of both worlds.
Higher end in terms of effort is using something like NixOS, where you get basically Terraform for everything in your distro.
The benchmarks are cool and all but 1M context on an Opus-class model is the real headline here imo. Has anyone actually pushed it to the limit yet? Long context has historically been one of those "works great in the demo" situations.
Boris Cherny, creator of Claude Code, posted about how he used Claude a month ago. He’s got half a dozen Opus sessions on the burners constantly. So yes, I expect it’s unmetered.
Opus 4.5 starts being lazy and stupid at around the 50% context mark in my opinion, which makes me skeptical that this 1M context mode can produce good output. But I'll probably try it out and see
Has a "N million context window" spec ever been meaningful? Very old, very terrible, models "supported" 1M context window, but would lose track after two small paragraphs of context into a conversation (looking at you early Gemini).
Umm, Sonnet 4.5 has a 1m context window option if you are using it through the api, and it works pretty well. I tend not to reach for it much these days because I prefer Opus 4.5 so much that I don't mind the added pain of clearing context, but it's perfectly usable. I'm very excited I'll get this from Opus now too.
If you're getting on along with 4.5, then that suggests you didn't actually need the large context window, for your use. If that's true, what's the clear tell that it's working well? Am I misunderstanding?
Did they solve the "lost in the middle" problem? Proof will be in the pudding, I suppose. But that number alone isn't all that meaningful for many (most?) practical uses. Claude 4.5 often starts reverting bug fixes ~50k tokens back, which isn't a context window length problem.
Things fall apart much sooner than the context window length for all of my use cases (which are more reasoning related). What is a good use case? Do those use cases require strong verification to combat the "lost in the middle" problems?
Living in the EU, I'm skeptical any of this happens. Our leaders have been pretty reluctant to push back on anything so far and most of these assets are private anyway.
Hi Troy, just wanted to let you know that I just sent you an email! :)
Also, just to be sure, I sent it to on-board.ai domain as well, as that seemed like the correct website (onboard.ai just showed "for sale" page). Might help some others too.
Google login also seems to be having issues, multiple people reported to me that the login isn’t working and they’ve been logged out of their Google accounts.
Yes, I tried logging in today in two distinct Google accounts on separate Chrome profiles and it would sign me out in about ~ 5 seconds after logging in. And the login process was very sluggish.