This article is a really good summary of current thinking on the “world model” conundrum that a lot of people are talking about, either directly or indirectly with respect to current day deployments of LLMs.
The benchmarks look good. Slide decks and spreadsheets look better. The people must use Claude Cowork and have their Claude Code moment and figure out the consequences. It will be really interesting to see articles like this (https://mitchellh.com/writing/my-ai-adoption-journey) written by people who actually care about accuracy in places like KPMG to get their perspective on things.
I remember over hearing some normal people on the bus talking about essentially orchestrating some agent scraper to pull and summarise news from 40 different sites he identified as important which put him quite ahead of his peers. These were non-technical people orchestrating an agent workflow to make them better at work.
Though there’s not much that tickles my software brain here. But the agents are coming for us all.
This is OpenAI taking the concept of AI coworkers seriously down to the level of “identity” for these agents.
This reminded me of Kairos which came up a few days ago (https://www.kairos.computer/) however I actually feel much better and more inspired at the angle OpenAI took than the angle kairos took. OpenAI’s genuinely feels like a platform for a coworker while Kairos is yet another cool landing page, yet another agent platform with X amount of data integrations. The use cases in OpenAI’s article also felt more concrete and impressive to be honest.
The fact that “as agents have gotten more capable, the opportunity gap between what models can do and what teams can actually deploy has grown.” is definitely true. I think the analogy whose source I have forgot commented that we have F1 cars driving at 60/kmh so for a lot of enterprises they are not even at the deployment limit where improving benchmarks matter. They are still at the level of not being able to provide the right info, not having the right evaluation and improvement frameworks .etc.
Using “Opening the AI Frontier” as a heading would be in really poor taste before OpenAI released their OSS models (earning their ClosedAI moniker) but I guess it’s a bit less offensive now. I think this product combined with OpenAI FDEs is going to make a lot of large industries inaccessible to startups but there may still be value in companies like Kairos watching what OpenAI does in this space and copying them.
Large scale AI deployment has led to a complete change in what signal code actually conveys and what it means for maintainers. Code is no longer a yardstick for effort, care or expertise. If anything a large amount of it can be the opposite.
I read an article a while ago about how “taste is a moat” (https://wangcong.org/2026-01-13-personal-taste-is-the-moat.h...) and it kind of applies here. In that article a technically correct kernel patch was rejected since it actually just re-implemented functionality htat was available elsewhere. In the tldraw repo, users seem to clone the repo, spin up claude and then make a PR without any kind of “taste” involved.
What confuses me is the fact that tldraw is actually very good for trying to get the best out of models, and indeed, internal to tldraw, models are expected to be used and the author gets value out of them. And yet, people leave sloppy unvetted PRs. This is a social issue that we didn’t really have before since it was producing code was the difficult part. Now producing code and PRs is easy the signal v.s. Noise ratio has collapsed completely and it’s just not worth it for people to actually review this stuff.
It would be better for people to leave one line issues with video demonstrations and allow the internal team to /fix them: “In a world of AI coding assistants, is code from external contributors actually valuable at all? If writing the code is the easy part, why would I want someone else to write it?”. Is code really needed to convey problems with open source repos or is it something unnecessary that we are now unshackled from? In the case of tldraw a lot of the PRs are just the result of people running claude on issues and therefore they add absolutely zero value.
A compiler is another thing whose honor and pride that the models have taken from the nerds. In the past, people would debate for hours about the “dragon book” v.s. “writing interpreters” and present their cool bespoke compilers in Show HN articles. Now models can produce 100,000 lines of code over two weeks with no human intervention that actually work and can compile significant project. Which way now nerd? The models are getting better, are you?
The article has some really odd low level descriptions of bash orchestration which I suppose are important to illustrate how barebones it was. However I always feel it odd when we’re talking about agents that are lauded as borderline super intelligence and there is still low level bash being slung around – feels like we’re talking about things at the wrong level.
The point about writing extremely high quality tests reminds me a bit of the “hot mess theory of AI” (https://alignment.anthropic.com/2026/hot-mess-of-ai/) also made by anthropic where they essentially say that long horizon tasks are more likely to fall to incoherency than for a model to purposefully pursue incorrect results. This is phrased in the article as “Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem”.
The author also observes something that I’ve realised after the initial joy of seeing an agent one shot a task wore off – for a 30 minute agent task, 25 minutes may be spent doing exploration of the environment. While it would be an offence to give a human unvetted model generated documentation and runbooks (I’m looking at you emoji ridden README.md files becoming more common across Show HN), models should commit things like this to memory for themselves to avoid repeatedly paying the “discovery tax” on every new action. Errors, hallucinations or changes cause the generated docs to fail create more busywork for the agent but agent time is less valuable than finite human life.
The author makes a point that you should redo every manual commit with AI to align you mental model of actions with how models work. This is something that I’m going to need to try. It’s related to my desire to reduce things like “discovery tax” (the phenomenon whereby a 5 minute agent task is 4 minutes of environment exploration and 1 minute of execution) and makes sure that models get things right the first time around, however, my AI improvement plan didn’t really account for how to improve the model in cases where I ended up manually resolving issues or implementing features.
Some arguments are made about retaining focus and single-mindedness while working on AI. I think these points are important. It’s related to the article on cutting out over-eager orchestration and focusing on validation work (https://sibylline.dev/articles/2026-01-27-stop-orchestrating...). There are a few sides to this covered in the article. You should always have high value task to switch to when the agent is working (instead of scrolling tiktok, instagram,X, youtube, facebook, hackernews .etc). In my case I might try start to read some books that I have on the backburner like Ghost in the Wires. You should disable agent notifications and take control of when you return to check the model context to be less ADHD ridden when programming with agents and actually make meaningful progress on the side task since you only context switch when you are satisfied. The final one is to always have at least one agent and preferably only one agent running in the background. The idea is that always having an agent results in a slow burn of productivity improvements and a process where you can slowly improve the background agent performance. Generally, always having some agent running is a good way to stay on top of what current model capabilities are.
I also really liked the idea of overnight agents for library research, redevelopment of projects to test out new skills, tests and AGENTS.md modifications.
A 77% score on terminal-bench 2 is really impressive. I remember reading the article about the pi coding agent (https://mariozechner.at/posts/2025-11-30-pi-coding-agent/) getting into the top ten percent of agents on that benchmark. It got about 50%. While it may still be in the top ten, that category just turned into one champion and a long of inferior offerings.
I was shocked to see that in the prompt for one of the landing pages the text “lavender to blue gradient” was included as if that’s something that anybody actually wants. It’s like going to the barber and saying “just make me look awful”.
This was my first time actually seeing what the GDPval benchmark looked like. Essentially they benchmark for all the artifacts that HR/finance might make or work on (onboarding documents, accounting spreadsheets, powerpoint presentations .etc). I think it’s good that models are trained to generate things like this well since people are going to use AI to do such anyway. If the middlemen passing AI ouputs around are going to be lazy I’m grateful that at least OpenAI researchers are cooking something behind the scenes.
This article convinced me to try to set up OpenClaw locally on the my raspberry pi but I realised that it had no micro SD card installed AND it used micro HDMI instead of a regular HDMI for display which I didn't have…
Some of the takes in this article relate to the "Agent Native Architecture" (https://every.to/guides/agent-native), an article that I critiqued quite heavily for being AI generated. This article presents many of the concepts explored there in a real-world, pragmatic lens. In this case, the author brings up how initially they wanted their agent to invoke specific pre-made scripts but ultimately found out that letting go of the process is where the inner model intelligence was able to really shine. In this case, parity, the property whereby anything a human can do an agent can do was achieved most powerfully buy simply giving the agent a browser-use agent which cracked open the whole web for the agent to navigate through.
The gradual improvement property of agent native architectures was also directly mentioned by the article, where the author commented on giving the model more and more context allowed him to “feel the AGI”.
ClawdBot is often reduced to “just AI and cron” but that might be overly reductive in the same way that one could call it a “GPT wrapper” in the same way that one could call a laptop an “electricity wrapper”. It seems like the scheduler is a significant aspect of what makes ClawdBot so powerful. For example the author, instead of looking for sophisticated scraper apps online to monitor prices of certain items will simply ask ClawdBot something like: “Hey, monitor hotel prices” and ClawdBot will handle the rest asynchronously and communicate back with the author over slack. Any performance issues due to repeated agent invocations are ameliorated by problem context and runbooks that are automatically generated and probably cost less time than maintaining pipelines written in plain code for a single individual who wants a hands-off agent solution.
Also, the article actually explains the obsessions with Mac Mini’s which I thought was some kind of convoluted scam (though apple doesn’t need scams to sell Macs…). Essentially you need it to run a browser or multiple browsers for your agents. Unfortunately that’s the state of the modern web.
I actually have my own note taking system and a pipeline to give me an overview of all of the concepts, blogs and daily events that have happened over the past week for me to look at. But it is much more rigid than ClawdBot: 1) I can only access it from my laptop, 2) it only supports text at the moment, 3) the actions that I can take are hard coded as opposed to agent-refined and naturally occuring (e.g. tweet pipeline, lessons pipeline, youtube video pipeline), 4) there’s no intelligent scheduler logic or agent at all so I manually run the script every evening. Something like ClawdBot could replace this whole pipeline.
Long story short, I need to try this out at some point.
This is one of those “don’t be evil” like articles that companies remove when the going gets tough but I guess we should be thankful that things are looking rosy enough for Anthropic at the moment that they would release a blog like this.
The point about filtering signal vs. noise in search engines can’t really be stated enough. At this point using a search engine and the conventional internet in general is an exercise in frustration. It’s simply a user hostile place – infinite cookie banners for sites that shouldn’t collect data at all, auto play advertisements, engagement farming, sites generated by AI to shill and produce a word count. You could argue that AI exacerbates this situation but you also have to agree that it is much more pleasant to ask perplexity, ChatGPT or Claude a question than to put yourself through the torture of conventional search. Introducing ads into this would completely deprive the user of a way of navigating the web in a way that actually respects their dignity.
I also agree in the sense that the current crop of AIs do feel like a space to think as opposed to a place where I am being manipulated, controlled or treated like some sheep in flock to be sheared for cash.
The current crop of LLM-backed chatbots do have a bit of that “old, good internet” flavor. A mostly unspoiled frontier where things are changing rapidly, potential seems unbounded, the people molding the actual tech and discussing it are enthusiasts with a sort of sorcerer’s apprentice vibe. Not sure how long it can persist, since I’ve seen this story before and we all understand the incentive structures at play.
Does anyone know how if there are precedents for PBCs or B-Corp type businesses to be held accountable for betraying their stated values? Or is it just window dressing with no legal clout? Can they change to a standard corporation on a whim and ditch the non-shareholder maximization goals?
There’s nothing old internet about these AI companies. Old internet was about giving out and asking for nothing in return. These companies take everything and give back nothing, unless you are willing to pay that is.
I get the sentiment, but if you can't acknowledge that AI is useful and currently a lot better than search for a great many things, then it's hard to have a rational conversation.
No, they don't. They soak up tons of your most personal and sensitive information like a sponge, and you don't know what's done with it. In the "good old Internet", that did not happen. Also in the good old Internet, it wasn't the masses all dependent on a few central mega-corporations shaping the interaction, but a many-to-many affair, with people and organizations of different sizes running the sites where interaction took place.
Ok, I know I'm describing the past with rosy glasses. After all, the Internet started as a DARPA project. But still, current reality is itself rather dystopic in many ways.
> This is one of those “don’t be evil” like articles that companies remove when the going gets tough but I guess we should be thankful that things are looking rosy enough for Anthropic at the moment that they would release a blog like this.
Exactly this. Show me the incentive, and I'll show you the outcome, but at least I'm glad we're getting a bit more time ad-free.
Right, if there's no legal weight to any of their statements then they mean almost nothing. It's a very weak signal and just feels like marketing. All digital goods can and will be made worse over time if it benefits the company.
Current LLMs often produce much, much worse results than manually searching.
If you need to search the internet on a topic that is full of unknown unknowns for you, they're a pretty decent way to get a lay of the land, but beyond that, off to Kagi (or Google) you go.
Even worse is that the results are inconsistent. I can ask Gemini five times at what temperature I should take a waterfowl out of the oven, and get five different answers, 10°C apart.
What do you mean, "are you sure"? I literally saw and see it happen in front of my eyes. Just now tested it with slight variations of "ideal temperature waterfowl cooking", "best temperature waterfowl roasting", etc. and all these questions yield different answers, with temperatures ranging from 47c-57c (ignoring the 74c food safety ones).
That's my entire point. Even adding an "is" or "the" can get you way different advice. No human would give you different info when you ask "what's the waterfowl's best cooking temperature" vs "what is waterfowl's best roasting temperature".
Did you point that out to one of them… like “hey bro, I’ve asked y’all this question in multiple threads and get wildly different answers. Why?”
And the answer is probably because there is no such thing as an ideal temperature for waterfowl because the answer is “it depends” and you didn’t give it enough context to better answer your question.
Context is everything. Give it poor prompts, you’ll get poor answers. LLMs are no different than programming a computer or anything else in this domain.
And learning how to give good context is a skill. One we all need to learn.
But that isn't how normal people interact with search engines. Which is the whole argument everyone is saying here, how LLMs are now better 'correct answer generators' than search engine. They're not. My mother directly experienced that. Her food would have come out much better if she completely ignored Gemini and checked a site.
One of the best things LLMs could do (and that no one seems to be doing) is allow it to admit uncertainty. If the average weight of all tokens in a response drops below X, it should just say "I don't know, you should check a different source."
At any rate, if my mother has to figure out some 10 sentence stunted multiform question for the LLM to finally get a good consistent answer, or can just type "best Indian restaurant in Brooklyn" (maybe even with site:restaurant reviews.com"), which experience is superior?
> LLMs are no different than programming a computer or anything else in this domain.
Just feel like reiterating against this: virtually no one programs their search queries or query engineers a 10 sentence search query.
If I made a new, not-AI tool called 'correct answer provider' which provided definitive, incorrect answers to things you'd call it bad software. But because it is AI we're going to blame the user for not second guessing the answers or holding it wrong ie. bad prompting.
I created an account just to point out that this is simply not true. I just tried it! The answers were consistent across all 5 samples with both "Fast" mode and Pro (which I think is really important to mention if you're going to post comments like this - I was thinking maybe it would be inconsistent with the Flash model)
It obviously takes discipline, but using something like Perplexity as an aggregator typically gets me better results, because I can click through to the sources.
It's not a perfect solution because you need the discipline/intuition to do that, and not blindly trust the summary.
My mother did, for Christmas. It was a goose that ended up being raw in a lot of places.
I then pointed out this same inconsistency to her, and that she shouldn't put stock in what Gemini says. Testing it myself, it would give results between 47c-57c. And sometimes it would just trip out and give the health-approved temperature, which is 74c (!).
Edit: just tested it again and it still happens. But inconsistency isn't a surprise for anyone who actually knows how LLMs work.
I just asked gemini 3 5 times: `what temperature I should take a waterfowl out of the oven`
and received generic advice every single time it gave nearly identical charts. 165F was in every response. LLMs are unpredictable yes. But I am more skeptical it would give incorrect answers (raw goose) rather than your mother preparing the fowl wrong.
Cooking correctly is a skill, just as prompting is. Ask 10 people how to cook fowl and their answers will mimic the LLM.
> But inconsistency isn't a surprise for anyone who actually knows how LLMs work
Exactly. These people saying they've gotten good results for the same question aren't countering your argument. All they're doing is proving that sometimes it can output good results. But a tool that's randomly right or wrong is not a very useful one. You can't trust any of its output unless you can validate it. And for a lot of the questions people ask of it, if you have to validate it, there was no reason to use the LLM in the first place.
Despite having read articles discussing when to delegate to AI discussing agent completion time, agent success probability and human verification time the thought of genuinely systematising and solving the problem of verification and QA never occurred to me. My mind is still in the mode where “building” and “shipping” are noble goals that are to be sought after even though that era is dead due to how low the difficulty bar has dropped (the bar is six feet deep). We should build and we should ship faster, but only considering those aspects is irresponsible and childish. With these new automated reasoning systems we ought to validate in as much as possible before presenting anything to the user.
Possibly the most salient point in the article is the following: “for the love of god, put [...] whatever tool du jour you're using to blow up your codebase, and make sure every claim in your README, every claim in your docs (you have docs, right?), every claim on your website is 100% tested and validated. Run actual rigorous benchmarks. Set up E2E tests driven by behavioral specs. Take your users seriously enough to deliver a good experience out of the box rather than trying to use hype to drive uptake then hoping they'll provide you with free QA”.
Personally this really resonated with the absolute fatigue I feel inside when I see a new “Show HN” to a GitHub repository in the year of our lord 2026. I’ve been burned by “slop” repos so much that my I already feel the Claude emoji drivel coming and sure enough a lot of the time that’s all a repo is, the abandoned and uncared for orphan child born of a passionate one night stand with Claude Code. Not a single screenshot or demo video in sight, just plausible promises dumped into a file for end users to figure out.
It synthesizes comments on “RL Environments” (https://ankitmaloo.com/rl-env/), “World Models” (https://ankitmaloo.com/world-models/) and the real reason that the “Google Game Arena” (https://blog.google/innovation-and-ai/models-and-research/go...) is so important to powering LLMs. In a sense it also relates to the notion of “taste” (https://wangcong.org/2026-01-13-personal-taste-is-the-moat.h...) and how / if it’s moat-worthiness can be eliminated by models.
reply