Much of the world believed in geocentrism for a very long time, yet we've known better for a long time now. Currently, this "Ptolemaic" idea of human intelligence being the center of everything sounds similar. What if human intelligence is not at the center? What if other forms, such as AI, emerge as the new "center"?
Basically, as long as you are using an Anthropic library or tool, you can use your OAuth credentials. For example, you can use the Claude Agent SDK with your OAuth credentials. This is sweet because I can prototype all sorts of agents with Claude Code embedded inside, at a predictable monthly cost. One nice use case is turning skills into standalone tools or apps.
You can also do convoluted things like run Claude Code within tmux and send input to it and read the output.
MCP Channels are interesting too for bidirectional communication between your app and a running Claude Code instance, with an MCP server sitting in between. It's slow, but allows for some interesting use cases when you want to step out of an existing CLI session to do work that is easier in a graphical interface, have Claude Code respond and do work, then when you're done, go back to the CLI session and continue, never losing context.
Says this:
"Unless previously approved, Anthropic does not allow third party developers to offer claude.ai login or rate limits for their products, including agents built on the Claude Agent SDK. Please use the API key authentication methods described in this document instead."
Again, it seems Anthropic prefers to bill API token rates (long run), not subscriber effective token rates.
You don't really need to tmux at all for Claude Code CLI. Claude Code CLI supports streaming json input, and streaming json output; you can use stdin/out as a pipe to control Claude Code CLI.
1. playwright-cli for exploration and ad-hoc scraping, in order to determine what works.
2. playwright code generation based on 1, which captures a repeatable workflow
3. agent skills - these can be playwright based, but in some cases if I can just rely on built-in tools like Web Search and Web Fetch, I will.
playwright is one of the unsung heroes of agentic workflows. I heavily rely on it. In addition to the obvious DOM inspection capabilities, the fact that the console and network can be inspected is a game changer for debugging. watching an agent get rapid feedback or do live TDD is one of the most satisfying things ever.
Browser automation and being able to record the graphics buffer as video, during a run, open up many possibilities.
"Claude, reverse engineer the APIs of this website and build a client. Use Dev Tools."
I have succeed 8/8 websites with this.
Sites like Booking.com, Hotels.com, try to identify real humans with their AWS solution and Cloudflare, but you can just solve the captcha yourself, login and the session is in disguishable from a human. Playwright is detected and often blocked.
Agreed! One thing that we felt was missing from the existing MCP tools was user recording. For old and shitty healthcare websites it's easier to just show the workflow than explain it
The playwright codegen tool exists, but the script it generates is super simple and it can't handle loops or data extraction.
So for libretto we often use a mix of instructions + recording my actions for the agent. Makes the process faster than just relying on a description and waiting for the agent to figure out the whole flow
Same playwright is phenomenal. You can also have the agent browse with MCP to figure out the workflow, then bang out a repeatable playwright script for it. It's a great combo
"Test harness is everything, if you don't have a way of validating the work, the loop will go stray"
This is the most important piece to using AI coding agents. They are truly magical machines that can make easy work of a large number of development, general purpose computing, and data collection tasks, but without deterministic and executable checks and tests, you can't guarantee anything from one iteration of the loop to the next.
I'd encourage you to try the -codex family with the highest reasoning.
I can't comment on Opus in CC because I've never bit the bullet and paid the subscription, but I have worked my way up to the $200/month Cursor subscription and the 5.2 codex models blow Opus out of the water in my experience (obviously very subjective).
I arrived at making plans with Opus and then implementing with the OpenAI model. The speed of Opus is much better for planning.
I'm willing to believe that CC/Opus is truly the overall best; I'm only commenting because you mentioned Cursor, where I'm fairly confident it's not. I'm basing my judgement on "how frequently does it do what I want the first time".
Thanks, I'll try those out. I've used Codex CLI itself on a few small projects as well, and fired it up on a feature branch where I had it implement the same feature that Claude Code did (they didn't see each other's implementations). For that specific case, the implementation Codex produced was simpler, and better for the immediate requirements. However, Claude's more abstracted solution may have held up better to changing requirements. Codex feels more reserved than Claude Code, which can be good or bad depending on the task.
I've heard Codex CLI called a scalpel, and this resonates. You wouldn't use a scalpel for a major carving project.
To come back to my earlier comment, though, my main approach makes sense in this context. I let Opus do the abstract thinking, and then OpenAI's models handle the fine details.
On a side note, I've also spent a fair amount of time messing around around in Codex CLI as I have a Pro subscription. It rapidly becomes apparent that it does exactly what you tell it even if an obvious improvement is trivial. Opus is on the other end of the spectrum here. you have to be fairly explicit with Opus intructing it to not add spurious improvements.
"To come back to my earlier comment, though, my main approach makes sense in this context. I let Opus do the abstract thinking, and then OpenAI's models handle the fine details."
Very interesting. I'm going to try this out. Thanks!
I've tried nearly all the models, they all work best if and only if you will never handle the code ever again. They suck if you have a solution and want them to implement that solution.
I've tried explaining the implementation word and word and it still prefers to create a whole new implementation reimplementing some parts instead of just doing what I tell it to. The only time it works is if I actually give it the code but at that point there's no reason to use it.
There's nothing wrong with this approach if it actually had guarantees, but current models are an extremely bad fit for it.
Yes, I only plan/implement on fully AI projects where it's easy for me to tell whether or not they're doing the thing I want regardless of whether or not they've rewritten the codebase.
For actual work that I bill for, I go in with intructions to do minimal changes, and then I carefully review/edit everything.
That being said, the "toy" fully-AI projects I work with have evolved to the point where I regularly accomplish things I never (never ever) would have without the models.
There are domains of programming (web front end) where lots of requests can be done pretty well even when you want them done a certain way. Not all, but enough to make it a great tool.
> Claude Opus 4.5 by far is the most capable development model.
At the moment I have a personal Claude Max subscription and ChatGPT Enterprise for Codex at work. Using both, I feel pretty definitively that gpt-5.2-codex is strictly superior to Opus 4.5. When I use Opus 4.5 I’m still constantly dealing with it cutting corners, misinterpreting my intentions and stopping when it isn’t actually done. When I switched to Codex for work a few months ago all of those problems went away.
I got the personal subscription this month to try out Gas Town and see how Opus 4.5 does on various tasks, and there are definitely features of CC that I miss with Codex CLI (I can’t believe they still don’t have hooks), but I’ve cancelled the subscription and won’t renew it at the end of this month unless they drop a model that really brings them up to where gpt-5.2-codex is at.
I have literally the opposite experience and so does most of AI pilled twitter and the AI research community of top conferences (NeurIPS, ICLR, ICML, AAAI) Why does this FUD keep appearing on this site?
Edit: It's very true that the big 4 labs silently mess with their models and any action of that nature is extremely user hostile.
I agree with all posts in the chain: Opus is good, Anthropic have burned good will, I would like to use other models...but Opus is too good.
What I find most frustrating is that I am not sure if it is even actual model quality that is the blocker with other models. Gemini just goes off the rails sometimes with strange bugs like writing random text continuously and burning output tokens, Grok seems to have system prompts that result in odd behaviour...no bugs just doing weird things, Gemini Flash models seem to output massive quantities of text for no reason...it is often feels like very stupid things.
Also, there are huge issues with adopting some of these open models in terms of IP. Third parties are running these models and you are just sending them all your code...with a code of conduct promise from OpenRouter?
I also don't think there needs to be a huge improvement in models. Opus feels somewhat close to the reasonable limit: useful, still outputs nonsense, misses things sometimes...there are open models that can reach the same 95th percentile but the median is just the model outputting complete nonsense and trying to wipe your file system.
The day for open models will come but it still feels so close and so far.
Hey HN! In 2025, I've spent more time than ever conversing with AI coding agents, particularly Claude Code. These conversations are an intimate look into how we think and solve problems. Every chat with the agent contains valuable solutions, patterns, decisions, and mistakes. So being able to search, analyze, and learn from those interactions isn't just convenient, it's becoming essential.
To help me do this, I built a tool to process Claude Code conversations:
* Import and search your entire conversation history across projects
* Analyze sessions, choosing from over 300 LLM models, via OpenRouter, to extract insight and patterns (decisions made, error patterns, how you use AI agents)
* Share insights as GitHub Gists (as long as the text passes a security scan)
* View basic aggregate statistics on Claude Code usage
The tool is built with Python, Streamlit, SQLite with FTS5, OpenRouter, and Gitleaks.
I made this for myself, and sharing it in case it helps you too. Once your conversations are in a database, you can start asking questions like “What were the key technical decisions on this project?”, “How did the agent help to research and prototype this feature?”, "What steps did I take to implement this?" and “What errors does the agent commonly make?”
It’s a work in progress, and I'm planning on adding more features. Currently only tested on macOS 14.7 with Claude Code 2.0.21. If you’re curious what your Claude Code sessions may reveal, take it for a spin!
But the AI coding agent can then ask you follow up questions, consider angles you may not have, and generate other artifacts like documentation, data generation and migration scripts, tests, CRUD APIs, all in context. If you can reliably do all that from plain pseudo code, that's way less verbose than having to write out every different representation of the same underlying concept, by hand.
Sure, some of that, like CRUD APIs, you can generate via templates as well. Heck, you can even have the coding agent generate the templates and the code that will process/compile them, or generate the code that generates the templates given a set of parameters.