HN2new | past | comments | ask | show | jobs | submitlogin
What I learned building an opinionated and minimal coding agent (mariozechner.at)
391 points by SatvikBeri 1 day ago | hide | past | favorite | 166 comments




The OpenClaw/pi-agent situation seems similar to ollama/llama-cpp, where the former gets all the hype, while the latter is actually the more impressive part.

This is great work, I am looking forward how it evolves in the future. So far Claude Code seems best despite its bugs given the generous subscription, but when the market corrects and the prices will get closer to API prices, then probably the pay-per-token premium with optimized experience will be a better deal than to suffer Claude Code glitches and paper cuts.

The realization is that at the end agent framework kit that is customizable and can be recursively improved by agents is going to be better than a rigid proprietary client app.


> but when the market corrects and the prices will get closer to API prices

I think it’s more likely that the API prices will decrease over time and the CC allowances will only become more generous. We’ve been hearing predictions about LLM price increases for years but I think the unit economics of inference (excluding training) are much better than a lot of people think and there is no shortage of funding for R&D.

I also wouldn’t bet on Claude Code staying the same as it is right now with little glitches. All of the tools are going to improve over time. In my experience the competing tools aren’t bug free either but they get a pass due to underdog status. All of the tools are improving and will continue to do so.


> I think it’s more likely that the API prices will decrease over time and the CC allowances will only become more generous.

I think this is absolutely true. There will likely be caps to stop the people running Ralph loops/GasTown with 20 clients 24/7, but for general use they will probably start to drop the API prices rather than vice-versa.

> We’ve been hearing predictions about LLM price increases for years but I think the unit economics of inference (excluding training) are much better than a lot of people think

Inference is generally accepted to be a very profitable business (outside the HN bubble!).

Claude Code subscriptions are more complicated of course but I think they probably follow the general pattern of most subscription software - lots of people who hardly use it, and a few who push it very hard can they lose money on. Capping the usage solves the "losing money" problem.


FWIW, you can use subscriptions with pi. OpenAI has blessed pi allowing users to use their GPT subscriptions. Same holds for other providers, except Flicker Company.

And I'm personally very happy that Peter's project gets all the hype. The pi repo already gets enough vibesloped PRs from openclaw users as is, and its still only 1/100th of what the openclaw repository has to suffer through.


Good to know, that makes it even better. I still find Opus 4.5 to be the best model currently. But if next generation of GPT/Gemini close the gap that will cross the inflection point for me and make 3rd party harnesses viable. Or if they jump ahead, that should put more pressure on the Flicker Company to fix the flicker or relax the subscriptions.

Is this something that OpenAI explicitly approves per project? I have had a hard time understanding what their exact position is.


This is basically identical to the ChatGPT/GPT-3 situation ;) You know OpenAI themselves keep saying "we still don't understand why ChatGPT is so popular... GPT was already available via API for years!"

ChatGPT is quite different from GPT. Using GPT directly to have a nice dialogue simply doesn't work for most intents and purposes. Making it usable for a broad audience took quite some effort, including RLHF, which was not a trivial extension.

This is the first I'm hearing of this pi-agent thing and HOW DO PEOPLE TECH DECIDE TO NAME THINGS?

Seriously. Is creator not aware that "pi" absolutely invokes the name of another very important thing? sigh.


The creator is very aware. Its original name was "shitty coding agent".

https://shittycodingagent.ai/


then do SCA and backronym it into something acceptable! That's even better lore :)

You mean Software Component Architecture? Do you want to bring down the wrath of IBM!

Good call, he'll have to name it Shitty COdingagent, or "SCO". No one will sue over that name.

Developers are the worst at naming things. This is a well known fact.

From the article: "So what's an old guy yelling at Claudes going to do? He's going to write his own coding agent harness and give it a name that's entirely un-Google-able, so there will never be any users. Which means there will also never be any issues on the GitHub issue tracker. How hard can it be?"

And like ollama it will no doubt start to get enshittified.

Only if it enters YC (like Ollama).

Really awesome and thoughtful thing you've built - bravo!

I'm so aligned on your take on context engineering / context management. I found the default linear flow of conversation turns really frustrating and limiting. In fact, I still do. Sometimes you know upfront that the next thing you're to do will flood/poison the nicely crafted context you've built up... other times you realise after the fact. In both cases, you didn't have that many alternatives but to press on... Trees are the answer for sure.

I actually spent most of Dec building something with the same philosphy for my own use (aka me as the agent) when doing research and ideation with LLMs. Frustrated by most of the same limitations - want to build context to a good place then preserve/reuse it over and over, fire off side quests etc, bring back only the good stuff. Be able to traverse the tree forwards and back to understand how I got to a place...

Anyway, you've definitely built the more valuable incarnation of this - great work. I'm glad I peeled back the surface of the moltbot hysteria to learn about Pi.


> want to build context to a good place then preserve/reuse it over and over, fire off side quests etc, bring back only the good stuff

My attempt - a minimalist graph format that is a simple markdown file with inline citations. I load MIND_MAP.md at the start of work, and update it at the end. It reduces context waste to resume or spawn subagents. Memory across sessions.

https://pastebin.com/VLq4CpCT


Very very cool. Going to try this out on some of my codebases. Do you have the gist that helps the agent populate the mindmap for an existing codebase? Your pastebin mentions it, but I dont see it linked anywhere.


Thank you!

This is incredible. It never occurred to me to even think of marrying memory gather and update slash commands as a mindmap that follows the appropriate node and edge. It makes so much sense.

I was using table structure with column 1 as a key, and col 2 as the data, and told the agents to match key before looking at Col 2. It worked, but sometimes it failed spectacularly.

I’m going to try this out. Thanks for sharing your .md!


I love this idea, and have immediately put it to use in my own work.

Would you mind publishing the `PROJECT_MIND_MAPPING.md` file that's referenced in `MIND_MAP.md'?


> Special shout out to Google who to this date seem to not support tool call streaming which is extremely Google.

Google doesn't even provide a tokenizer to count tokens locally. The results of this stupidity can be seen directly in AI studio which makes an API call to count_tokens every time you type in the prompt box.


AI studio also has a bug that continuously counts the tokens, typing or not, with 100% CPU usage.

Sometimes I wonder who is drawing more power, my laptop or the TPU cluster on the other side.


Same for clause code. It’s constantly sending token counting requests

tbf neither does anthropic

> If you look at the security measures in other coding agents, they're mostly security theater. As soon as your agent can write code and run code, it's pretty much game over.

At least for Codex, the agent runs commands inside an OS-provided sandbox (Seatbelt on macOS, and other stuff on other platforms). It does not end up "making the agent mostly useless".


Approval should be mandatory for any non-read tool call. You should read everything your LLM intends to do, and approve it manually.

"But that is annoying and will slow me down!" Yes, and so will recovering from disastrous tool calls.


You’ll just end up approving things blindly, because 95% of what you’ll read will seem obviously right and only 5% will look wrong. I would prefer to let the agent do whatever they want for 15 minutes and then look at the result rather than having to approve every single command it does.

Works until it has access to write to external systems and your agent is slopping up Linear or GitHub without you knowing, identified as you.

Sure; I mean this is what I _would like_; I’m not saying this would work 100% of the time.

> I would prefer to let the agent do whatever they want

Lol, good luck to you!


This is like having a firewall on your desktop where you manually approve each and every connection.

Secure, yes? Annoying, also yes. Very error-prone too.


It’s not just annoying; at scale it makes using the agent clis impossible. You can tell someone spends a lot of time in Claude Code: they can type —dangerously-skip-permissions with their eyes closed.

Yep. The agent CLIs have the wrong level of abstraction. Needs more human in the loop.

That kind of blanket demand doesn't persuade anyone and doesn't solve any problem.

Even if you get people to sit and press a button every time the agent wants to do anything, you're not getting the actual alertness and rigor that would prevent disasters. You're getting a bored, inattentive person who could be doing something more valuable than micromanaging Claude.

Managing capabilities for agents is an interesting problem. Working on that seems more fun and valuable than sitting around pressing "OK" whenever the clanker wants to take actions that are harmless in a vast majority of cases.


I don't mean to sound like I'm demanding this. I'm saying you will get better outcomes if you choose to do this as a developer.

You're right it's an interesting problem that seems fun to work on. Hopefully we'll get better harnesses. For now I'm checking everything.


It's not reliable. The AI can just not prompt you to approve, or hide things, etc. AI models are crafty little fuckers and they like to lie to you and find secret ways to do things with alterior motives. This isn't even a prompt injection thing, it's an emergent property of the model. So you must use an environment where everything can blow up and it's fine.

The harness runs the tool call for the LLM. It is trivial to not run the tool call without approval, and many existing tools do this.

My codex just uses python to write files around the sandbox when I ask it to patch a sdk outside its path.

Is it asking you permission to run that python command? If so, then that's expected: commands that you approve get to run without the sandbox.

The point is that Codex can (by default) run commands on its own, without approval (e.g., running `make` on the project it's working on), but they're subject to the imposed OS sandbox.

This is controlled by the `--sandbox` and `--ask-for-approval` arguments to `codex`.


It's definitely not a sandbox if you can just "use python to write files" outside of it o_O

Hence the article’s security theatre remark.

I’m not sure why everyone seems to have forgotten about Unix permissions, proper sandboxing, jails, VMs etc when building agents.

Even just running the agent as a different user with minimal permissions and jailed into its home directory would be simple and easy enough.


I'm just guessing, but seems the people who write these agent CLIs haven't found a good heuristic for allowing/disallowing/asking the user about permissions for commands, so instead of trying to sit down and actually figure it out, someone had the bright idea to let the LLM also manage that allowing/disallowing themselves. How that ever made sense, will probably forever be lost on me.

`chroot` is literally the first thing I used when I first installed a local agent, by intuition (later moved on to a container-wrapper), and now I'm reading about people who are giving these agents direct access to reply to their emails and more.


> I'm just guessing, but seems the people who write these agent CLIs haven't found a good heuristic for allowing/disallowing/asking the user about permissions for commands, so instead of trying to sit down and actually figure it out, someone had the bright idea to let the LLM also manage that allowing/disallowing themselves. How that ever made sense, will probably forever be lost on me.

I don't think there is such a good heuristic. The user wants the agent to do the right thing and not to do the wrong thing, but the capabilities needed are identical.

> `chroot` is literally the first thing I used when I first installed a local agent, by intuition (later moved on to a container-wrapper), and now I'm reading about people who are giving these agents direct access to reply to their emails and more.

That's a good, safe, and sane default for project-focused agent use, but it seems like those playing it risky are using agents for general-purpose assistance and automation. The access required to do so chafes against strict sandboxing.


Here's OpenAI's docs page on how they sandbox Codex: https://developers.openai.com/codex/security/

Here's the macOS kernel-enforced sandbox profile that gets applied to processes spawned by the LLM: https://github.com/openai/codex/blob/main/codex-rs/core/src/...

I think skepticism is healthy here, but there's no need to just guess.


That still doesn't seem ideal. Run the LLM itself in a kernel-enforced sandbox, lest it find ways to exploit vulnerabilities in its own code.

The LLM inference itself doesn't "run code" per se (it's just doing tensor math), and besides, it runs on OpenAI's servers, not your machine.

There still needs to be a harness running on your local machine to spawn the processes in their sandboxes. I consider that "part of the LLM" even if it isn't doing any inference.

If that part were running sandboxed, then it would be impossible for it to contact the OpenAI servers (to get the LLM's responses), or to spawn an unsandboxed process (for situations where the LLM requests it from the user).

That's obviously not true. You can do anything you want with a sandbox. Open a socket to the OpenAI servers and then pass that off to the sandbox and let the sandboxed process communicate over that socket. Now it can talk to OpenAI's servers but it can't open connections to any other servers or do anything else.

The startup process which sets up the original socket would have to be privileged, of course, but only for the purpose of setting up the initial connection. The running LLM harness process would not have any ability to break out of the sandbox after that.

As for spawning unsandboxed processes, that would require a much more sophisticated system whereby the harness uses an API to request permission from the user to spawn the process. We already have APIs like this for requesting extra permissions from users on Android and iOS, so it's not in-principle impossible either.

In practice I think such requests would be a security nightmare and best avoided, since essentially it would be like a prisoner asking the guard to let him out of jail and the guard just handing the prisoner the keys. That unsandboxed process could do literally anything it has permissions to do as a non-sandboxed user.


You are essentially describing the system that Codex (and, I presume, Claude Code et al.) already implements.

The devil is in the details. How much of the code running on my machine is confined to the sandbox vs how much is used in the boostrap phase? I haven't looked but I would hope it can survive some security audits.

If I'm following this it means you need to audit all code that the llm writes though as anything you run from another terminal window will be run as you with full permissions.

The thing is that on macOS at least, Codex does have the ability use an actual sandbox that I believe prevents certain write operations and network access.

You really shouldn’t be running agents outside of a container. That’s 101.

What happens if I do?

What's the difference between resetting a container or resetting a VPS?

On local machine I have it under its own user, so I can access its files but it cannot access mine. But I'm not a security expert, so I'd love to hear if that's actually solid.

On my $3 VPS, it has root, because that's the whole point (it's my sysadmin). If it blows it up, I wanna say "I'm down $3", but it doesn't even seem to be that since I can just restore it from an backup.


Bit more general; don't run agents without some sort of restriction to what they can do provided by the OS in some way. Containers is one way, VMs another, most cases it's enough with just a chroot and using the unix permission system the rest of your system already uses.

I'm trying to understand this workflow. I have just started using codex. Literally 2 days in. I have it hooked up to my githbub repo and it just runs in the cloud and creates a pr. I have it touching only UI and middle layer code. No db changes, I always tell it to not touch the models.

Does Codex randomly decide to disable the sandbox like Claude Code does?

I've seen a couple of power users already switching to Pi [1], and I'm considering that too. The premise is very appealing:

- Minimal, configurable context - including system prompts [2]

- Minimal and extensible tools; for example, todo tasks extension [3]

- No built-in MCP support; extensions exist [4]. I'd rather use mcporter [5]

Full control over context is a high-leverage capability. If you're aware of the many limitations of context on performance (in-context retrieval limits [6], context rot [7], contextual drift [8], etc.), you'd truly appreciate Pi lets you fine-tune the WHOLE context for optimal performance.

It's clearly not for everyone, but I can see how powerful it can be.

---

[1] https://lucumr.pocoo.org/2026/1/31/pi/

[2] https://github.com/badlogic/pi-mono/tree/main/packages/codin...

[3] https://github.com/mitsuhiko/agent-stuff/blob/main/pi-extens...

[4] https://github.com/nicobailon/pi-mcp-adapter

[5] https://github.com/steipete/mcporter

[6] https://github.com/gkamradt/LLMTest_NeedleInAHaystack

[7] https://research.trychroma.com/context-rot

[8] https://arxiv.org/html/2601.20834v1


Pi is the part of moltXYZ that should have gone viral. Armin is way ahead of the curve here.

The Claude sub is the only think keeping me on Claude Code. It's not as janky as it used to be, but the hooks and context management support are still fairly superficial.


Author of Pi is Mario, not Armin, but Armin is a contributor

> from copying and pasting code into ChatGPT, to Copilot auto-completions [...], to Cursor, and finally the new breed of coding agent harnesses like Claude Code, Codex, Amp, Droid, and opencode

Reading HN I feel a bit out of touch since I seem to be "stuck" on Cursor. Tried to make the jump further to Claude Code like everyone tells me to, but it just doesn't feel right...

It may be due to the size of my codebase -- I'm 6 months into solo developer bootstrap startup, so there isn't all that much there, and I can iterate very quickly with Cursor. And it's mostly SPA browser click-tested stuff. Comparatively it feels like Claude Code spends an eternity to do something.

(That said Cursor's UI does drive me crazy sometimes. In particular the extra layer of diff-review of AI changes (red/green) which is not integrated into git -- I would have preferred that to instead actively use something integrated in git (Staged vs Unstaged hunks). More important to have a good code review experience than to remember which changes I made vs which changes AI made..)


For me cursor provides a much tighter feedback loop than Claude code. I can review revert iterate change models to get what I need. It feels sometimes Claude code is presented more as a yolo option where you put more trust on the agent about what it will produce.

I think the ability to change models is critical. Some models are better at designing frontend than others. Some are better at different programming languages, writing copy, blogs, etc.

I feel sabotaged if I can’t switch the models easily to try the same prompt and context across all the frontier options


Same. For actual productions app I'm typically reviewing the thinking messages and code changes as they happen to ensure it stays on the rails. I heavily use the "revert" to previous state so I can update the prompt with more accurate info that might have come out of the agents trial and error. I find that if I don't do this, the agent makes a mess that often doesn't get cleaned up on its way to the actually solution. Maybe a similar workflow is possible with Claude Code...

Yeah, autonomy has the cost of your mental model getting desynchronized. You either follow along interactively or spend time catching up later.

You can ask Claude to work with you step by step and use /rewind. It only shows the diff though, which, hides some of the problem. Since diffs can seem fine in isolation, but when viewed in context can have obvious issues.

Ya I guess if you have the IDE open and monitor unstaged git, it's a similar workflow. The other cursor feature I use heavily is the ability to add specific lines and ranges of a file to the context. Feels like in the CLI this would just be pasted text and Claude would have to work a lot harder to resolve the source file and range

Probably an ideal compromise solution for you would be to install the official Claude Code extension for VS Code, so you have an IDE for navigating large, complex codebases while still having CC integration.

Bootstrapped solo dev here. I enjoyed using Claude to get little things done which I happed on my TODO list below the important stuff, like updating a landing page, or in your case perhaps adding automated testing for the frontend stuff (so you don't have to click yourself). It's just nice having someone coming up with a proposal on how to implement something, even it's not the perfect way, it's good as a starter. Also I have one Claude instance running to implement the main feature, in a tight feedback loop so that I know exactly what it's doing. Yes, sometimes it takes a bit longer, but I use the time checking what the other Claudes are doing...

Claude Code spends most of its time poking around the files. It doesn't have any knowledge of the project by default (no file index etc), unless they changed it recently.

When I was using it a lot, I created a startup hook that just dumped a file listing into the context, or the actual full code on very small repos.

I also got some gains from using a custom edit tool I made which can edit multiple chunks in multiple files simultaneously. It was about 3x faster. I had some edge cases where it broke though, so I ended up disabling it.


> in particular the extra layer of diff-review of AI changes (red/green) which is not integrated into git

We're making this better very soon! In the coming weeks hopefully.


That's great news.

I see in your public issue tracker that a lot of people are desperate simply for an option to turn that thing off ("Automatically accept all LLM changes"). Then we could use any kind of plugin really for reviews with git.


> remember which changes I made vs which changes AI made..

They are improving this use case too with their enhanced blame. I think it was mentioned in their latest update blog.

You'll be able to hover over lines to see if you wrote it, or an AI. If it was an AI, it will show which model and a reference to the prompt that generated it.

I do like Cursor quite a lot.


Sounds good, but also need an option to auto-approve all the changes in their local "replica of git".

(if one already exists, someone needs to tell the public Cursor issue tracker)


Seems like there's a speed/autonomy spectrum where Cursor is the fastest, Codex is the best for long-running jobs, and Claude is somewhere in the middle.

Personally, I found Cursor to be too inaccurate to be useful (possibly because I use Julia, which is relatively obscure) – Opus has been roughly the right level for my "pair programming" workflow.


I mainly use Opus as well, Cursor isn't tied to any AI model and both Opus and Sonnet and a lot of others are available. Of course there's differences in how the context is managed, but Opus is usually amazing in Cursor at least.

I will very quickly @- the parts of the code that are relevant to get the context up and running right away. Seems in Claude that's harder..

(They also have their own, "Composer 1", which is just lightning fast compared to the others...and sometimes feels as smart as Opus, but now and then don't find the solution if it's too complicated and I have to ask Opus to clean it up. But if there's simple stuff I switch to it.)


Armin Ronacher wrote a good piece about why he uses Pi here: https://lucumr.pocoo.org/2026/1/31/pi/

I hadn't realized that Pi is the agent harness used by OpenClaw.


Pi has probably the best architecture and being written in Javascript it is well positioned to use the browser sandbox architecture that I think is the future for ai agents.

I only wish the author changed his stance on vendor extensions: https://github.com/badlogic/pi-mono/discussions/254


“standardize the intersection, expose the union” is a great phrase I hadn’t heard articulated before

I've got the wording from an llm. I knew there was this pattern in all traditional tools - but I did not know the name.

You’ve never heard it before because explicitly signaling “I know basic set theory” is kind of cringy

I dont know how to feel about being the only one refusing to run yolo mode until the tooling is there, which is still about 6 months away for my setup. Am I years behind everyone else by then? You can get pretty far without completely giving in. Agents really dont need to execute that many arbitrary commands. linting, search, edit, web access should all be bespoke tools integrated into the permission and sandbox system. agents should not even be allowed to start and stop applications that support dev mode, they edit files, can test and get the logs what else would they need to do? especially as the amount of external dependencies that make sense goes to a handful you can without headache approve every new one. If your runtime supports sandboxing and permissions like deno or workerd this adds an initial layer of defense.

This makes it even more baffling why anthropic went with bun, a runtime without any sandboxing or security architecture and will rely in apple seatbelt alone?


You use YOLO mode inside some sandbox (VM, container). Give the container only access to the necessary resources.

But even then, the agent can still exfiltrate anything from the sandbox, using curl. Sandboxing is not enough when you deal with agents that can run arbitrary commands.

What is your threat model?

If you're worried about a hostile agent, then indeed sandboxing is not enough. In the worst case, an actively malicious agent could even try to escape the sandbox with whatever limited subset of commands it's given.

If you're worried about prompt injection, then restricting access to unfiltered content is enough. That would definitely involve not processing third-party input and removing internet search tools, but the restriction probably doesn't have to be mechanically complete if the agent has also been instructed to use local resources only. Even package installation (uv, npm, etc) would be fine up to the existing risk of supply-chain attacks.

If you're worried about stochastic incompetence (e.g. the agent nukes the production database to fix a misspelled table name), then a sandbox to limit the 'blast radius' of any damage is plenty.


That argument seems to assume a security model where the default prior is « no hostile agent ». But that’s the problem, any agent can be made hostile with a successful prompt injection attack. Basically, assuming there’s no hostile agent is the same as assuming there’s no attacker. I think we can agree a security model that assumes no attacker is insufficient.

The whole point of the sandbox is that you don’t put anything sensitive inside of it. Definitely not credentials or anything sensitive/confidential.

It depends on what you're trying to prevent.

If your fear is exfiltration of your browser sessions and your computer joining a botnet, or accidental deletion of your data, then a sandbox helps.

If your fear is the llm exfiltrating code you gave it access to then a sandbox is not enough.

I'm personally more worried about the former.


Code is not the only thing the agent could exfiltrate, what about API keys for instance? I agree sandboxing for security in depth is good, but it’s not sufficient and can lull you into a false sense of security.

How much does a proxy with an allow list save a(n ai) person?

This is what emulators and separate accounts are for. Ideally you can use an emulator and never let the container know about an API key. At worst you can use a dedicated account/key for dev that is isolated from your prod account.

VM + dedicated key with quotas should get you 95% there if you want to experiment around. Waiting is also an option, so much of the workflow changes with months passing so you’re not missing much.

Sure, though really these are guidelines for any kind of development, not just the agentic kind.

That depends on how you configure or implement your sandbox. If you let it have internet access as part of the sandbox, then yes, but that is your own choice.

Internet access is required to install third party packages, so given the choice almost no one would disable it for a coding agent sandbox.

In practice, it seems to me that the sandbox is only good enough to limit file system access to a certain project, everything else (code or secret exfiltration, installing vulnerable packages, adding prompt injection attacks for others to run) is game if you’re in YOLO mode like pi here.

Maybe a finer grained approach based on capabilities would help: https://simonwillison.net/2025/Apr/11/camel/


Right idea but the reason people don't do this in practice is friction. Setting up a throwaway VM for every agent session is annoying enough that everyone just runs YOLO on their host.

I built shellbox (https://shellbox.dev) to make this trivial -- Firecracker microVMs managed entirely over SSH. Create a box, point your agent at it, let it run wild. You can duplicate a box before a risky operation (instant, copy-on-write) and delete it after.

Billing stops when the SSH session disconnects.

No SDK, no container config, just ssh. Any agent that can run shell commands works out of the box.


apart from nearly no one using vms as far as i can tell, even if they were, a vm does not magically solve all the issues, its just a part of the needed tools.

Great writeup on minimal agent architecture. The philosophy of "if I don't need it, it won't be built" resonates strongly.

I've been running OpenClaw (which sits on top of similar primitives) to manage multiple simultaneous workflows - one agent handles customer support tickets, another monitors our deployment pipeline, a third does code reviews. The key insight I hit was exactly what you describe: context engineering is everything.

What makes OpenClaw particularly interesting is the workspace-first model. Each agent has AGENTS.md, TOOLS.md, and a memory/ directory that persists across sessions. You can literally watch agents learn from their mistakes by reading their daily logs. It's less magic, more observable system.

The YOLO-by-default approach is spot on. Security theater in coding agents is pointless - if it can write and execute code, game over. Better to be honest about the threat model.

One pattern I documented at howtoopenclawfordummies.com: running multiple specialized agents beats one generalist. Your sub-agent discussion nails why - full observability + explicit context boundaries. I have agents that spawn other agents via tmux, exactly as you suggest.

The benchmark results are compelling. Would love to see pi and OpenClaw compared head-to-head on Terminal-Bench.


The best deep-dive into coding agents (and best architecture) I've seen so far. And I love the minimalism with this design, but there's so much complexity necessary already, it's kind of crazy. Really glad I didn't try to write my own :)

Re: security, I think I need to make an AI credential broker/system. The only way to securely use agents is to never give them access to a credential at all. So the only way to have the agent run a command which requires credentials, is to send the command to a segregated process which asks the user for permission, then runs it, then returns status to the agent. It would process read-only requests automatically but write requests would send a request to the user to authorize. I haven't yet found somebody else writing this, so I might as well give it a shot

Other than credentialed calls, I have Docker-in-Docker in a VM, so all other actions will be YOLO'd. I think this is the only reasonable system for long-running loops.


> Re: security, I think I need to make an AI credential broker/system. The only way to securely use agents is to never give them access to a credential at all. So the only way to have the agent run a command which requires credentials, is to send the command to a segregated process which asks the user for permission, then runs it, then returns status to the agent

This is a problem that model context protocol solves

Your MCP server has the creds, your agent does not.


But what about the context pollution? For every request you want an MCP to handle, it has to fill up the context with instructions on how to make requests; and the MCP server has to implement basically every function, right? So like, an AWS MCP would have hundreds of commands to support, and all that would need to be fed into context. You could try to limit the number of AWS MCP functions in context, but then you're limiting yourself. Compare this to just letting the AI run an AWS command (or API call via curl) using the knowledge it already has; no extra complexity or context on the AI-side. You just need to implement a server which intercepts these stock commands/API calls and handles them the same way an MCP server would

You don’t need to implement every api endpoint as a tool you can just say - this is the aws cli tool it takes one string as an argument and that string is an aws cli command

No difference between that and using the bash tool - except you can keep the keys on the MCP server


I mean, there's a tiny difference: once of them is secure, the other isn't...

Main reason I haven’t switched over to the new pi coding agent (or even fully to Claude Code alternatives) is the price point. I eat tokens for breakfast, lunch, and dinner.

I’m on a $100/mo plan, but the codex bar makes it look like I’m burning closer to $500 every 30 days. I tried going local with Qwen 3 (coding) on a Blackwell Pro 6000, and it still feels a beat behind, either laggy, or just not quite good enough for me to fully relinquish Claude Code.

Curious what other folks are seeing: any success stories with other agents on local models, or are you mostly sticking with proprietary models?

I’m feeling a bit vendor-locked into Claude Code: it’s pricey, but it’s also annoyingly good


According to the article, Pi massively shrinks your context use (due to smaller system prompt and lack of MCPs) so your token use may drop. Also Pi seems to support Anthropic OAuth for your plan (but afaik they might ban you)

And its doubtful they are anywhere near break even costs

I did something similar in Python, in case people want to see a slightly different perspective (I was aiming for a minimal agent library with built-in tools, similar to the Claude Agent SDK):

https://github.com/NTT123/nano-agent


This is as good a "You might not need Vercel AI SDK" post as you'll read.

I work on internal LLM tooling for a F100 at $DAYJOB and was nodding vigorously while reading this, especially when it comes to things like letting users freely switch between models, and the affordances you need to be able to provide good UX around streaming and tool calling, which seem barely thought-out in things like the MCP spec (which at least now has a way to get friendly display names for tools since the last time I looked at it).


Being minimalist is real power these days as everything around us keeps shoving features in our face every week with a million tricks and gimmicks to learn. Something minimalist like this is honestly a breath of fresh air!

The YOLO mode is also good, but having a small ‘baby setting mode’ that’s not full-blown system access would make sense for basic security. Just a sensible layer of "pls don't blow my machine" without killing the freedom :)


Pi supports restricting the set of tools given to an agent. For example, one of the examples in pi --help is:

    # Read-only mode (no file modifications possible)
    pi --tools read,grep,find,ls -p "Review the code in src/"
Otherwise, "yolo mode" inside a sandbox is perfectly reasonable. A basic bubblewrap configuration can expose read-only system tools and have a read/write project directory while hiding sensitive information like API keys and other home-directory files.

I'm hoping someone makes an agent that fixes the container situation, better:

> If you're uncomfortable with full access, run pi inside a container or use a different tool if you need (faux) guardrails.

I'm sick of doing this. I also don't want faux guardrails. What I do want is an agent front-end that is trustworthy in the sense that it will not, even when instructed by the LLM inside, do anything to my local machine. So it should have tools that run in a container. And it should have really nice features like tools that can control a container and create and start containers within appropriate constraints.

In other words, the 'edit' tool is scoped to whatever I've told the front-end that it can access. So is 'bash' and therefore anything bash does. This isn't a heuristic like everyone running in non-YOLO-mode does today -- it’s more like a traditional capability system. If I want to use gVisor instead of Docker, that should be a very small adaptation. Or Firecracker or really anything else. Or even some random UART connection to some embedded device, where I want to control it with an agent but the device is neither capable of running the front-end nor of connecting to the internet (and may not even have enough RAM to store a conversation!).

I think this would be both easier to use and more secure than what's around right now. Instead of making a container for a project and then dealing with installing the agent into the container, I want to run the agent front-end and then say "Please make a container based on such-and-such image and build me this app inside." Or "Please make three containers as follows".

As a side bonus, this would make designing a container sandbox sooooo much easier, since the agent front-end would not itself need to be compatible with the sandbox. So I could run a container with -net none and still access the inference API.

Contrast with today, where I wanted to make a silly Node app. Step 1: Ask ChatGPT (the web app) to make me a Dockerfile that sets up the right tools including codex-rs and then curse at it because GPT-5.2 is really remarkably bad at this. This sucks, and the agent tool should be able to do this for me, but that would currently require a completely unacceptable degree of YOLO.

(I want an IDE that works like this too. vscode's security model is comically poor. Hmm, an IDE is kind of like an agent front-end except the tools are stronger and there's no AI involved. These things could share code.)


This is actually something I've been playing with. Containers/VMs managed by a daemon with lifecycles that an agent can invoke sessions on and execute commands in, using OPA/Rego over gRPC. The cherry on top is envoy for egress with whitelists and credential injection.

One cool thing is that you can run a vscode service on these containers and open the port up to the outside world, then code in and watch a project come to life.


>Context transfer between [sub]agents is also poor

That's the main point of sub-agents, as far as I can tell. They get their own context, so it's much cheaper. You divide tasks into chunks, let a sub-agent handle each chunk. That actually ties in nicely with the emphasis on careful context management, earlier in the article.


Glad to see more people doing this!

I built on ADK (Agent Development Kit), which comes with many of the features discussed in the post.

Building a full, custom agent setup is surprisingly easy and a great learning experience for this transformational technology. Getting into instruction and tool crafting was where I found the most ROI.


> The second approach is to just write to the terminal like any CLI program, appending content to the scrollback buffer

This is how I prototyped all of mine. Console.Write[Line].

I am currently polishing up one of the prototypes with WinForms (.NET10) & WebView2. Building something that looks like a WhatsApp conversation in basic winforms is a lot of work. This takes about 60 seconds in a web view.

I am not too concerned about cross platform because a vast majority of my users will be on windows when they'd want to use this tool.


If you use WPF you can have the Mica backdrop underneath your WebView2 content and set the WebView2 to have transparent background color, which looks nice and a little more native, fyi. Though if you're doing more than just showing the WebView maybe isn't a choice to switch.

I like the idea of using a transparent background in the webview. That would compose really well.

The primary motivation for winforms was getting easy access to OS-native multiline input controls, clipboard, audio, image handling, etc. I could have just put kestrel in the console app and served it as a pure web app, but this is a bit more clunky from a UX perspective (separate browser window, permissions, etc.).


Minimal, intentional guidance is the cornerstone of my CLAUDE.md’s design philosophy document.

https://github.com/willswire/dotfiles/blob/main/claude/.clau...


I was confused by him basically inventing his own skills but I guess this is from Nov 2025 so makes sense as skills were pretty new at that point.

Also please note this is nowhere on the terminal bench leaderboard anymore. I'd advise everyone reading the comments here to be aware of that. This isn't a CLI to use. Just a good experiment and write up.


It's batteries-not-included, by design. Here's what it looks like with batteries (and note who owns this repo):

https://github.com/mitsuhiko/agent-stuff/tree/main

Perhaps benchmarks aren't the best judge.


I don’t follow nor use pi so no horse in this race, but I think the results were never submitted to terminal bench? not sure how the process works exactly but it’s entirely missing from the benchmark. is this a sign of weakness? I honestly don’t know.

I particularly liked Mario's point about using tmux for long-running commands. I've found models to be very good at reading from / writing to tmux, so I'll do things like spin up a session with a REPL, use Claude to prototype something, then inspect it more deeply in the REPL.

The solution to the security issue is using `useradd`.

I would add subagents though. They allow for the pattern where the top agent directs / observe a subagent executing a step in a plan.

The top agent is both better at directing a subagent, and it keeps the context clean of details that don't matter - otherwise they'd be in the same step in the plan.


There are lots of ways of doing subagents. It mostly depends on your workflow. That's why pi doesn't ship with anything built in. It's pretty simple to write an extension to do that.

Or you use any of the packages people provide, like this one: https://github.com/nicobailon/pi-subagents


The simple approach is great, chef's kiss, don't change a thing. Orchestration at the harness level tends not to be great anyhow, it's not built for the type of review that's needed.

I’m just curious why your writing is punctuated by lots of word breaks. I hardly see hyphenated word breaks across lines anymore and it made me pause on all those occurrences. I do remember having to do this with literal typewriters.

According to dev tools this is a simple `hyphens: auto` CSS

Interesting that the right margin seems very jagged despite this. I would have like smaller margins on the phone, and possibly narrower text and justification.

I'm curious about how the costs compare using something like this where you're hitting api's directly vs my $20 ChatGPT plan which includes Codex.

You can use your ChatGPT subscription with Pi!

Oh wow! No way! Thank you!

I really like pi and have started using it to build my agent. Mario's article fully reveals some design trade-offs and complexities in the construction process of coding agents and even general agents. I have benefited a lot!

I always wonder what type of moat systems / business like these have

edit: referring to Anthropic and the like


Subsidized plans that are only for their Agent (Claude Code). Fine tuning their models to work best with their agent. But it's not much of a moat once every leading model is great at tool calling.

Is it a moat if new start ups avoid competing in the space because there is inherently no moat?

Capital, both social and economic.

Also data, see https://hackernews.hn/item?id=46637328


The only moat in all of this is capital.

Its open source. Where does it say he wants to monetise it?

None, basically.

I do think Claude Code as a tool gave Anthropic some advantages over others. They have plan mode, todolist, askUserQuestion tools, hooks, etc., which greatly extend Opus's capabilities. Agree that others (Codex, Cursor) also quickly copy these features, but this is the nature of the race, and Anthropic has to keep innovating to maintain its edge over others

The biggest advantage by far is the data they collect along the way. Data that can be bucketed to real devs and signals extracted from this can be top tier. All that data + signals + whatever else they cook can be re-added in the training corpus and the models re-trained / version++ on the new set. Rinse and repeat.

(this is also why all the labs, including some chinese ones, are subsidising / metoo-ing coding agents)


(I work at Cursor) We have all these! Plan mode with a GUI + ability to edit plans inline. Todos. A tool for asking the user questions, which will be automatically called or you can manually ask for it. Hooks. And you can use Opus or any other models with these.

An excellent piece of writing.

One thing I do find is that subagents are helpful for performance -- offloading tasks to smaller models (gpt-oss specifically for me) gets data to the bigger model quicker.


>The only way you could prevent exfiltration of data would be to cut off all network access for the execution environment the agent runs in

You can sandbox off the data.


Can I replace Vercel’s AI SDK with Pi’s equivalent?

It's not an API drop in replacement, if that's what you mean. But the pi-ai package serves the same purpose as Vercel's AI SDK. https://github.com/badlogic/pi-mono/tree/main/packages/ai

I'll check it out, thanks for your work on this!

Not only did you build a minimal agent, but the framework around it so anyone can build their own. I'm using Pi in the terminal, but I see you have web components. Any tips or creating a "Chat mode" where the messages are like chat bubbles? It would be easier to use on mobile.


The web package has a minimal example. I'm not a frontend developer, so YMM hugely V.

"Also, it [Claude Code] flickers" - it does, doesn't it? Why?.. Did it vibe code itself so badly that this is hopeless to fix?..

Because they target 60 fps refresh, with 11 of the 16 ms budget per frame being wasted by react itself.

They are locked in this naive, horrible framework that would be embarrassing to open source even if they had the permission to do it.


That's what they said, but as far as I can see it makes no sense at all. It's a console app. It's outputing to stdout, not a GPU buffer.

The whole point of react is to update the real browser DOM (or rather their custom ASCII backend, presumably, in this case) only when the content actually changes. When that happens, surely you'd spurt out some ASCII escape sequences to update the display. You're not constrained to do that in 16ms and you don't have a vsync signal you could synchronise to even if you wanted to. Synchronising to the display is something the tty implementation does. (On a different machine if you're using it over ssh!)

Given their own explanation of react -> ascii -> terminal, I can't see how they could possibly have ended up attempting to render every 16ms and flickering if they don't get it done in time.

I'm genuinely curious if anybody can make this make sense, because based on what I know of react and of graphics programming (which isn't nothing) my immediate reaction to that post was "that's... not how any of this works".


Claude code is written in react and uses Ink for rendering. "Ink provides the same component-based UI building experience that React offers in the browser, but for command-line apps. It uses Yoga to build Flexbox layouts in the terminal,"

https://github.com/vadimdemedes/ink


I figured they were doing something like Ink, but interesting to know that they're actually using Ink. Do you have any evidence that's the case?

It doesn't answer the question, though. Ink throttles to at most 30fps (not 60 as the 16ms quote would suggest, though the at most is far more important). That's done to prevent it churning out vast amounts of ASCII, preventing issues like [1], not as some sort of display sync behaviour where missing the frame deadline would be expected to cause tearing/jank (let alone flickering).

I don't mean to be combative here. There must be some real explanation for the flickering, and I'm curious to know what it is. Using Ink doesn't, on it's own, explain it AFAICS.

Edit: I do see an issue about flickering on Ink [2]. If that's what's going on, the suggestion in one of the replies to use alternate screen sounds reasonable and nothing to do with having to render in 16ms. There are tons of TUI programs out there that manage to update without flickering.

[1] https://github.com/gatsbyjs/gatsby/issues/15505

[2] https://github.com/vadimdemedes/ink/issues/359


How about the ink homepage (same link as before), which lists Claude as the first entry under

Who's Using Ink?

    Claude Code - An agentic coding tool made by Anthropic.

Great, so probably a pretty straightforward fix, albeit in a dependency. Ink does indeed write ansiEscapes.clearTerminal [1], which does indeed "Clear the whole terminal, including scrollback buffer. (Not just the visible part of it)" [2]. (Edit: even the eraseLines here [4] will cause flicker.)

Using alternate screen might help, and is probably desirable anyway, but really the right approach is not to clear the screen (or erase lines) at all but just write out the lines and put a clear to end-of-line (ansiEscapes.eraseEndLine) at the end of each one, as described in [3]. That should be a pretty simple patch to Ink.

Likening this to a "small game engine" and claiming they need to render in 16ms is pretty funny. Perhaps they'll figure it out when this comment makes it into Claude's training data.

[1] https://github.com/vadimdemedes/ink/blob/e8b08e75cf272761d63...

[2] https://www.npmjs.com/package/ansi-escapes

[3] https://stackoverflow.com/a/71453783

[4] https://github.com/vadimdemedes/ink/blob/e8b08e75cf272761d63...


Claude code programmers are very open that they vibe code it.

I don't think they say they vibe code, just that claude writes 100% of the code.

I'm writing my own agent too as a side project at work. This is a good article but simultaneously kinda disappointing. The entire agent space has disappeared down the same hole, with exactly the same core design used everywhere and everyone making the same mistakes. The focus on TUIs I find especially odd. We're at the dawn of the AI age and people are trying to optimize the framerate of Teletext? If you care about framerates use a proper GUI framework!

The agent I'm writing shares some ideas with Pi but otherwise departs quite drastically from the core design used by Claude Code, Codex, Pi etc, and it seems to have yielded some nice benefits:

• No early stopping ("shall I continue?", "5 tests failed -> all tests passed, I'm done" etc).

• No permission prompts but also no YOLO mode or broken Seatbelt sandboxes. Everything is executed in a customized container designed specifically for the model and adapted to its needs. The agent does a lot of container management to make this work well.

• Agent can manage its own context window, and does. I never needed to add compaction because I never yet saw it run out of context.

• Seems to be fast compared to other agents, at least in any environment where there's heavy load on the inferencing servers.

• Eliminates "slop-isms" like excessive error swallowing, narrative commenting, dropping fully qualified class names into the middle of source files etc.

• No fancy TUI. I don't want to spend any time fixing flickering bugs when I could be improving its skill at the core tasks I actually need it for.

It's got downsides too, it's very overfit to the exact things I've needed and the corporate environment it runs in. It's not a full replacement for CC or Codex. But I use it all the time and it writes nearly all my code now.

The agent is owned by the company and they're starting to ask about whether it could be productized so I suppose I can't really go into the techniques used to achieve this, sorry. Suffice it to say that the agent design space is far wider and deeper than you'd initially intuit from reading articles like this. None of the ideas in my agent are hard to come up with so explore!


As a user of a minimal, opinionated agent (https://exe.dev) I've observed at least 80% of this article's findings myself.

Small and observable is excellent.

Letting your agent read traces of other sessions is an interesting method of context trimming.

Especially, "always Yolo" and "no background tasks". The LLM can manage Unix processes just fine with bash (e.g. ps, lsof, kill), and if you want you can remind it to use systemd, and it will. (It even does it without rolling it's eyes, which I normally do when forced to deal with systemd.)

Something he didn't mention is git: talk to your agent a commit at a time. Recently I had a colleague check in his minimal, broken PoC on a new branch with the commit message "work in progress". We pointed the agent at the branch and said, "finish the feature we started" and it nailed it in one shot. No context whatsoever other than "draw the rest of the f'ing owl" and it just.... did it. Fascinating.


One aspect that resonates from stories like this is the tension between opinionated design and real-world utility.

When building something minimal, especially in areas like agent-based tooling or assistants, the challenge isn’t only about reducing surface area — it’s about focusing that reduction around what actually solves a user’s problem.

A minimal agent that only handles edge cases, or only works in highly constrained environments, can feel elegant on paper but awkward in practice. Conversely, a slightly less minimal system that still maintains clarity and intent often ends up being more useful without being bloated.

In my own experience launching tools that involve analysis and interpretation, the sweet spot always ends up being somewhere in the intersection of: - clearly scoped core value, - deliberately limited surface, and - enough flexibility to handle real user variation.

Curious how others think about balancing minimalism and practical coverage when designing agents or abstractions in their own projects.


begone, bot



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: