More

jawiggins · 2026-03-26T16:21:16 1774542076

Yeah totally, you don't have to auto-merge anything - you can review the PRs yourself

fhouser · 2026-03-26T16:28:01 1774542481

Yeah, I think that's the most important part in these new types of processes. Although it is tempting to just let an agent run with it for a while.

jawiggins · 2026-03-26T15:23:46 1774538626

One pod is an instance of a repo, you can set the number of instances of each agent/task that can be running on a pod at a time. For >1, each agent should be using it's own worktree.

jawiggins · 2026-03-26T15:22:38 1774538558

Maybe - I do think as the model get better they'll be able to handle more and more difficult tasks. And yet, even if they can only solve the simplest issues now, why not let them so you can focus on the more important things?

jawiggins · 2026-03-26T05:25:22 1774502722

Yup. MCP can be configured on a repo level. At task execution time, enabled MCP servers are written as a .mcp.json file into the agent's worktree. Enabled skills are written as .claude/commands/{name}.md files in the worktree, making them available as slash commands to the agent

jawiggins · 2026-03-26T02:04:10 1774490650

Generally I've found agents are capable of self correcting as long as they can bash up against a guardrail and see the errors. So in optio the agent is resumed and told to fix any CI failures or fix review feedback.

jawiggins · 2026-03-26T00:37:16 1774485436

Should be fixed now :)

hmokiguess · 2026-03-26T15:17:38 1774538258

thank you x)

jawiggins · 2026-03-26T00:28:51 1774484931

Recently I used to to finish up my re-implementation of curl/libcurl in rust (https://hackernews.hn/item?id=47490735). At first I started by trying to have a single claude code session run in an iterative loop, but eventually I found it was way to slow.

I started tasking subagents for each remaining chunk of work, and then found I was really just repeating the need for a normal sprint tasking cycle but where subagents completed the tasks with the unit tests as exit criteria. So optio came to my mind, where I asked an agent to run the test suite, see what was failing, and make tickets for each group of remaining failures. Then I use optio to manage instances of agents working on and closing out each ticket.

jawiggins · 2026-03-25T23:27:31 1774481251

Oh good question, I haven't thought deeply about this.

Right now nothing special happens, so claude/codex can access their normal tools and make web calls. I suppose that also means they could figure out they're running in a k8s pod and do service discovery and start calling things.

What kind of features would you be interested in seeing around this? Maybe a toggle to disable internet connections or other connections outside of the container?

nevon · 2026-03-26T07:21:24 1774509684

Network policies controlling egress would be one thing. I haven't seen how you make secrets available to the agent, but I would imagine you would need to proxy calls through a mitm proxy to replace tokens with real secrets, or some other way to make sure the agent cannot access the secrets themselves. Specifically for an agent that works with code, I could imagine being able to run docker-in-docker will probably be requested at some point, which means you'll need gvisor or something.

jawiggins · 2026-03-25T23:24:55 1774481095

There are a few things:

a) you can create CI/build checks that run in github and the agents will make sure pass before it merges anything

b) you can configure a review agent with any prompt you'd like to make sure any specific rules you have are followed

c) you can disable all the auto-merge settings and review all the agent code yourself if you'd like.

kristjansson · 2026-03-26T00:41:15 1774485675

> to make sure

you've really got to be careful with absolute language like this in reference to LLMs. A review agent provides no guarantees whatsoever, just shifts the distribution of acceptable responses, hopefully in a direction the user prefers.

jawiggins · 2026-03-26T00:44:43 1774485883

Fair, it's something like a semantic enforcement rather than a hard one. I think current AI agents are good enough that if you tell it, "Review this PR and request changes anytime a user uses a variable name that is a color", it will do a pretty good job. But for complex things I can still see them falling short.

SR2Z · 2026-03-26T03:11:57 1774494717

I mean, having unit tests and not allowing PRs in unless they all pass is pretty easy (or requiring human review to remove a test!).

A software engineer takes a spec which "shifts the distribution of acceptable responses" for their output. If they're 100% accurate (snort), how good does an LLM have to be for you to accept its review as reasonable?

59nadir · 2026-03-26T05:37:10 1774503430

We've seen public examples of where LLMs literally disable or remove tests in order to pass. I'm not sure having tests and asking LLMs to not merge things before passing them being "easy" matters much when the failure modes here are so plentiful and broad in nature.

jawiggins · 2026-03-26T16:22:03 1774542123

You'd want to have the tests run as a github action and then fail the check if the tests don't pass. Optio will resume agents when the actions fail and tell them to fix the failures.

ElFitz · 2026-03-26T07:00:16 1774508416

My favourite so far was Claude "fixing" deployment checks with `continue-on-error: true`

SR2Z · 2026-03-26T18:58:59 1774551539

So... add another presubmit test that fails when a test is removed. Require human reviews.

It's not like a human being always pushes correct code, my risk assessment for an LLM reading a small bug and just making a PR is that thinking too hard is a waste of time. My risk assessment for a human is very similar, because actually catching issues during code review is best done by tests anyways. If the tests can't tell you if your code is good or not then it really doesn't matter if it's a human or an LLM, you're mostly just guessing if things are going to work and you WILL push bad code that gets caught in prod.

jawiggins · 2026-03-01T03:51:20 1772337080

Yes, in testing I did add four fuzzing targets to the repo:

1. fuzz_xml_parse: throws arbitrary bytes at the XML parser in both strict and recovery mode

2. fuzz_html_parse: throws arbitrary bytes at the HTML parser

3. fuzz_xpath: throws arbitrary XPath expressions at the evaluator

4. fuzz_roundtrip: parse → serialize → re-parse, checking that the pipeline never panics

Because this project uses memory safe rust, there isn't really the need to find the memory bugs that were the majority of libxml2's CVEs.

There is a valid point about logic bugs or infinite loops, which I suppose could be present in any software package, and I'm not sure of a way to totally rule out here.

agentifysh · 2026-03-01T07:27:03 1772350023

pretty sure you are replying to a bot seems like they make a new account just to leave short drive by comments

this is like the 8th green handle i've seen so far recently with similar style of comments I suspect is AI generated