(blogpost author here)
I've noticed this too. My top guess for any such thing would be that this type of sub-agent routing is outside the training distribution. Its possible that this gets better overnight with a model update. The second reason is that sub-agents make it very hard to debug - was the issue with the router prompt or the agent prompt? Flat tools and loop make this a non-issue without loss of any real capability.
(blogpost author here)
Haha, that's totally fair. I've read a whole bunch of posts comparing CC to other tools, or with a dump of the the architecture. This post was mainly for people who've used CC extensively, know for a fact that it is better and wonder how to ship such an experience in their own apps.
I've used Claude Code, Cursor, and Copilot is Vscode and I don't "know" that Claude Code is better apart from the fact that it runs in the terminal, which makes it a little faster but less ergonomic than tools running inside the editor. All of the context tricks can be done with Copilot instructions as well, so I simply can't see how Claude Code is superior.
I’ve been so into Claude code that I haven’t used cursor or copilot in vs code in a while.
Do they also allow you to view the thinking process and planning, and hit ESC to correct if it’s going down a wrong path? I’ve found that to be one of my favorite features of Claude code. If it says “ah, the the implementation isn’t complete, I’ll update test to use mocks” I can interrupt it and say no, it’s fine for the test to fail until the implementation is finished, so not mock anything. Etc.
It may be that I just discovered this after switching, but I don’t recall that being an interaction pattern on cursor or copilot. I was always having to revert after the fact (which might have been me not seeing the option).
Cursor does show the “thinking” in smaller greyer text, then hides it behind a small grey “thought for 30 seconds” note. If it’s off track, you just hit the stop button and correct the agent, or scroll up and restart from an earlier interaction (same thing as double-ESC in Claude Code).
For code generation, nothing so far beats Opus. More likely than not it generated working code and fixed bugs that Gemini 2.5 pro couldn't solve or even Gemini Code Assist. Gemini Code Assist is better than 2.5 pro, but has way more limits per prompt and often truncates output.
I found Anthropic’s models untrustworthy with SQL (e.g. confused AND and OR operator precedence - or simply forgot to add parens, multiple times), Gemini 2.5 pro has no such issues and identified Claude’s mistakes correctly.
Don’t sleep on Codex-CLI + gpt-5. While the Codex-CLI scaffolding is far behind CC, the gpt-5 code seems solid from what I’ve seen (you can adjust thinking level using /model).
(blogpost author here)
I actually found none of them useful. I think MCP is an incomplete idea. Tools and the system prompt cannot be so cleanly separated (at least not yet). Just slapping on tools hurts performance more than it helps.
I've now gone back to just using vanilla CC with a really really rich claude.md file.
(blogpost author here)
You're right! I did make the distinction in an earlier draft, but decided to use "RAG" interchangeably with vector search, as it is popularly known today in code-gen systems. I'd probably go back to the previous version too.
But I do think there is a qualitative different between getting candidates and adding them to context before generating (retrieval augmented generation) vs the LLM searching for context till it is satisfied.
(author of the blogpost here)
Yeah, you can extract a LOT of performance from the basics and don't have to do any complicated setup for ~99% of use cases. Keep the loop simple, have clear tools (it is ok if tools overlap in function). Clarity and simplicity >>> everything else.
Function / tool calling is actually super simple. I'd honestly recommend either doing it through a single LLM provider (e.g., OpenAI or Gemini) without a hard framework first, and then moving to one of the simpler frameworks if you feel the need to (e.g., LangChain). Frameworks like LangGraph and others can get really complicated really quickly.
Check the OpenAI REST API reference. Most engines implement that and you can see how tool calls work. It’s just a matter of understanding the responses they give you, how to put them in the messages history and how to invoke a tool when the LLM asks for it.
There may be other reasons to use ai sdk, but I'd highly recommend starting with a simple loop + port most relevant tools from Claude Code before using any framework.
Nice, do share a link, would love to check out your agent!
We love that feature too and use it quite a bit ourselves!
> Not quite sure if this should be a separate category?
We see ourselves at the intersection of generic browser-automation agents and generic coding agents. MinusX integrates deeply into jupyter/metabase (we had to do a lot of shenanigans to get the entire jupyter app context) and has more context than RPA agents do today. It is possible that eventually all these apps will converge, but we think MinusX will be more useful for anything data related than any of them for the foreseeable future.
To paraphrase geohot, we think that the path to advanced agents runs through specialized, useful intermediaries.
I really like you retrofit analogy - not sure if you coined it or geohot has.
It seems to me that's where a ton of start-ups are currently converging - not repairing the old, which would be too complicated, but understanding and "mending" for new usages, or functionalities.
Thanks! Not sure, I think the term has been in the ether for a while.
Yeah, I see that too. I think for the longest time there was no leverage in doing this sort of retrofitting (except for grammarly type of use cases). But with better intent capture (llms help here), we can actually fix up any existing gaps!
Haha, the analogy is totally yours to use :)
Nice, yeah, there is a lot of leverage in building agent-like hooks into current workflows. Even if the agents are pretty mid right now (they are for any complex use case that needs long horizon planning), it's a great place to be in time for the next generation models to drop!
Haha, yes! We were doing the exact same thing. Also, there is so much context you can't capture with just table schema that you can if you integrate the extension deep into the tool. It also unlocks cross-app contexts (we're working on a way to import context from a doc to a metabase query, or from a sheet/dashboard to a jupyter notebook etc.
> Is there a way to select which model is being used?
Not at the moment, but this is in our pipeline! We will enable this (and the ability to edit the prompts, etc.) very soon.
I totally share your concerns about data (especially data that may be sensitive). We have a simple non-legal-speak privacy policy here: https://minusx.ai/privacy-simplified.
> Are your LLMs running entirely locally on your own hardware, and if not, how can you say the data is not shared with third parties? (EDIT: you mentioned GPT-4o in another comment so this statement cannot be correct.)
We're currently only using API providers (OAI + Claude) that do not themselves train on data accessed through APIs. Although they are technically third parties, they're not third parties that harvest data.
I recognize that even this may just be empty talk. We're currently working on 2 efforts that I think will further help here:
- opensourcing the entire extension so that users can see exactly what data is being used as LLM context (and allow users to extend the app further)
- support local models so that your data never leaves your computer (ETA for both is ~1-2 weeks)
We are genuinely motivated by the excitement + concerns you may have. We want to give an assistant-in-the-browser alternative to people who don't want to move to AI-native-data-locked-in platforms. I regret that was not transparent in our copy.
Thanks for pointing the error in the FAQs, we somehow missed it. It is fixed now!