More

comboy · 2026-04-06T20:00:47 1775505647

It's not a well-written book. It's an interesting book (more like a story).

FrustratedMonky · 2026-04-06T20:02:21 1775505741

Oh no, a book that tells a story.

comboy · 2026-04-04T18:07:56 1775326076

Hey, you seem to have similar view on this. I know ideas are cheap but hear me out:

You talk with agent A it only modifies this spec, you still chat and can say "make it prettier" but that agent only modifies the spec, the spec could also separate "explicit" from "inferred".

And of course agent B which builds only sees the spec.

User actually can care about diffs generated by agent A again, because nobody wants to verify diffs on agents generated code full of repetition and created by search and replace. I believe if somebody implements this right it will be the way things are done.

And of course with better models spec can be used to actually meaningfully improve the product.

Long story short what industry misses currently and what you seem to be understanding is that intent is sacred. It should be always stored, preferably verbatim and always with relevant context ("yes exactly" is obviously not enough). Current generation of LLMs can already handle all that. It would mean like 2-3x cost but seem so much worth it (and the cost on the long run could likely go below 1x given typical workflows and repetitions)

beshrkayali · 2026-04-04T19:52:32 1775332352

Right, the spec/build separation is exactly the idea and Ossature is already built that way on the build side.

I agree a dedicated layer for intent capture makes a lot of sense. I thought about that as well, I am just not fully convinced it has to be conversational (or free-form conversational). Writing a prompt to get the right spec change is still a skill in itself, and it feels like it'd just be shifting the problem upstream rather than actually solving it. A structured editing experience over specs feels like it'd be more tractable to me. But the explicit vs inferred distinction you mention is interesting and worth thinking through more.

comboy · 2026-04-04T20:01:45 1775332905

The spec manually crafted the user is ideal.

It's just that we're lazy. After being able to chat, I don't see people going back. You can't just paste some error into the specs, you can't paste it image and say it make it look more like this. Plus however well designed the spec, something like "actually make it always wait for the user feedback" can trigger changes in many places (even for the sake of removing contradictions).

ithkuil · 2026-04-04T22:15:05 1775340905

The spec can be wrong for many reasons:

1. You can write a spec that builds something that is not what you actually wanted

2. You can write spec that is incoherent with itself or with the external world

3. You can write a spec that doesn't have sufficient mechanical sympathy with the tooling you have and so it requires you to all spec out more and more of the surrounding tech than you practically can.

All of those issues can be addressed by iterating on the spec with the help of agents. It's just an engineering practice, one that we have to become better at understanding

beshrkayali · 2026-04-05T14:01:44 1775397704

All three of these are real. The audit pass in Ossature is meant to catch the first two before generation starts, it reads across all specs and flags underspecified behavior, missing details, and contradictions. You resolve those, update the specs, and re-audit until the plan is clean. It's not perfect but it shifts a lot of the discovery earlier in the process.

The third point is harder. You still need to know your tooling well enough to write a spec that works with it. That part hasn't gone away.

whattheheckheck · 2026-04-05T03:32:54 1775359974

And what is a spec other than a program in a programming language? How do you prove the code artifact matches the spec or state machine

comboy · 2026-04-05T08:51:47 1775379107

Program defines the exact computer instructions. Most of the time you don't care about that level of detail. You just have some intent and some constraints.

Say "I want HN client for mobile", "must notify me about comments", you see it and you add "should support dark mode". Can you see how that is much less than anything in any programming language?

visarga · 2026-04-05T05:26:12 1775366772

My own approach also has intent sitting at the top: intent justifies plan justifies code justifies tests. And the other way around, tests satisfy code, satisfy plan, satisfy intent. These threads bottom up and top down are validated by judge agents.

I also make individual tasks md files (task.md) which makes them capable of carrying intent, plan, but not just checkbox driven "- [ ]" gates, they get annotated with outcomes, and become a workbook after execution. The same task.md is seen twice by judge agents which run without extra context, the plan judge and the implementation judge.

I ran tests to see which component of my harness contributes the most and it came out that it is the judges. Apparently claude code can solve a task with or without a task file just as well, but the existence of this task file makes plans and work more auditable, and not just for bugs, but for intent follow.

Coming back to user intent, I have a post user message hook that writes user messages to a project scoped chat_log.md file, which means all user messages are preserved (user text << agent text, it is efficient), when we start a new task the chat log is checked to see if intent was properly captured. I also use it to recover context across sessions and remember what we did last.

Once every 10-20 tasks I run a retrospective task that inspects all task.md files since last retro and judges how the harness performs and project goes. This can detect things not apparent in task level work, for example when using multiple tasks to implement a more complex feature, or when a subsystem is touched by multiple tasks. I think reflection is the one place where the harness itself and how we use it can be refined.

    claude plugin marketplace add horiacristescu/claude-playbook-plugin

    source at https://github.com/horiacristescu/claude-playbook-plugin/tree/main

beshrkayali · 2026-04-05T10:51:58 1775386318

The hierarchy you describe (intent -> plan -> code -> tests) maps well to how Ossature works. The difference is that your approach builds scaffolding around Claude Code to recover structure that chat naturally loses, whereas Ossature takes chat out of the generation pipeline entirely. Specs are the source of truth before anything is generated, so there's no drift to compensate for, the audit and build plan handle that upfront.

The judge finding is interesting though. Right now verification during build for each task in Ossature is command-based, compile, tests, that kind of thing. A judge checking spec-to-code fidelity rather than (or maybe in addition to?) runtime correctness is worth thinking about.

visarga · 2026-04-05T14:39:50 1775399990

Yes, judges should not just look for bugs, they should also validate intent follow, but that can only happen when intent was preserved. I chose to save the user messages as a compromise, they are probably 10 or 100x smaller than full session. I think tasks themselves are one step lower than pure user intent. Anyway, if you didn't log user messages you can still recover them from session files if they have not been removed.

One interesting data point - I counted word count in my chat messages vs final code and they came out about 1:1, but in reality a programmer would type 10x the final code during development. From a different perspective I found I created 10x more projects since I relied on Claude and my harness than before. So it looks user intent is 10x more effective than manual coding now.

4b11b4 · 2026-04-05T00:01:37 1775347297

close

miki123211 · 2026-04-05T06:43:17 1775371397

See also: https://juxt.github.io/allium/ (not affiliated in any way, just an interesting project)

I'm using something similar-ish that I build for myself (much smaller, less interesting, not yet published and with prettier syntax). Something like:

    a->b # b must always be true if a is true
    a<->b # works both ways
    a=>b # when a happens, b must happen
    a->fail, a=> fail # a can never be true / can never happen
    a # a is always true

So you can write:

    Product.alcoholic? Product in Order.lineItems -> Order.customer.can_buy_alcohol?
    u1 = User(), u2=User(), u1 in u2.friends -> u2 in u1.friends
    new Source() => new Subscription(user=Source.owner, source=Source)
    Source.subscriptions.count>0 # delete otherwise

This is a much more compact way to write desired system properties than writing them out in English (or Allium), but helps you reason better about what you actually want.

beshrkayali · 2026-04-05T14:55:23 1775400923

Allium looks interesting, making behavioral intent explicit in a structured format rather than prose is very close to what I'm trying to do with Ossature actually.

Ossature uses two markdown formats, SMD[1] for describing behavior and AMD for structure (components, file paths, data models). AMDs[2] link back to their parent SMD so behavior and structure stay connected. Both are meant to be written, reviewed, and/or owned by humans, the LLM only reads the relevant parts during generation. One thing I am thinking about for the future is making the template structure for this customizable per project, because "spec" means different things to different teams/projects. Right now the format is fixed, but I am thinking about a schema-based way to declare which sections are required, their order, and basic content constraints, so teams can adapt the spec structure to how they think about software without having to learn a grammar language to do it (though maybe peg-based underneath anyway, not sure).

The formal approach you describe is probably more precise for expressing system properties. Would be interesting to see how practical it is to maintain it as a project grows.

1: https://docs.ossature.dev/specs/smd.html

2: https://docs.ossature.dev/specs/amd.html

4b11b4 · 2026-04-05T00:01:11 1775347271

yep but spec isn't the root

comboy · 2026-04-04T17:45:34 1775324734

GPUs can do graphics too?

aobdev · 2026-04-04T21:54:25 1775339665

I can’t tell if you’re making a joke about the current state of AI and GPUs or refuting the purpose of this driver

comboy · 2026-04-01T12:23:49 1775046229

It's hard to tell how much it says about difficulty of harnessing vs how much it says about difficulty of maintaining a clean and not bloated codebase when coding with AI.

amangsingh · 2026-04-01T12:32:41 1775046761

Why not both? AI writes bloated spaghetti by default. The control plane needs to be human-written and rigid -> at least until the state machine is solid enough to dogfood itself. Then you can safely let the AI enhance the harness from within the sandbox.

whiplash451 · 2026-04-01T13:45:14 1775051114

Were human organizations (not individuals) any good at the latter anyway?

comboy · 2026-04-01T09:05:11 1775034311

I mean, tools change, but I'd be happy to hear if any tool can create that by just saying create "Claude Code Unpack" with nice graphics. or some other single prompt. It likely was an iterative process and it would be lovely if more people started sharing that, because the process itself is also very interesting.

I've created some chinese characters learning website and I took me typing 1/3 of LoTR to get there[1]. I would have typed like 1% of that writing code directly. It is a different process, but it still needs some direction.

1. https://hanzirama.com/making-of

comboy · 2026-03-31T10:03:14 1774951394

As things stand today even when doing research tasks, time spent by model is >> than fetching websites. I don't see it changing any time soon, except when some deals happen behind the scenes where agents get to access CF guarded resources that normally get blocked from automated access.

comboy · 2026-03-28T13:43:19 1774705399

Add CI to check if new laws don't contradict with any existing ones.

bertil · 2026-03-28T13:56:59 1774706219

You might need to turn laws into formal proofs, and the existence of judges makes me think that’s not as likely as you would like. A commenting system would though—trained on countries’s precedents, jurisprudence and traditions might.

whattheheckheck · 2026-03-28T14:00:02 1774706402

Can you imagine rebases with merge conflicts?

bentcorner · 2026-03-28T16:08:11 1774714091

This could in theory already happen without any tech, but I suspect since the government is pretty monolithic, any changes in a specific law are all being done by the same set of people.

You might not have merge conflicts but I imagine you could end up with conflicting guidance from two separate pieces of law (e.g., law A says you must wear green on St. Patrick's day, law B outlaws green pajamas).

comboy · 2026-03-27T16:35:25 1774629325

*that we know of

criddell · 2026-03-27T17:03:52 1774631032

Which is exactly what they said:

> “We are not aware of any successful mercenary spyware attacks against a Lockdown Mode-enabled Apple device,” Apple spokesperson Sarah O’Rourke told TechCrunch on Friday.

ectospheno · 2026-03-27T16:43:03 1774629783

Which is infinitely better than the cases we know about without the feature enabled.

comboy · 2026-03-27T14:40:08 1774622408

Haha, here's some random AI generated content:

    At least 225 judges have ruled in more than 700 cases that the administration's mandatory immigration detention policy likely violates the right to due process[1] The Fifth Amendment's Due Process Clause generally requires those having federal funds cut off to receive notice and an opportunity for a hearing, which was not provided in many of DOGE's spending freezes[2]

(there's more but what's the point)

1. https://www.justsecurity.org/107087/tracker-litigation-legal...

2. https://www.cbpp.org/research/federal-budget/many-trump-admi...

comboy · 2026-03-27T08:23:49 1774599829

Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.

XCSme · 2026-03-27T08:27:46 1774600066

Oh, I didn't think about this, that's a good idea. I also feel generally model performance changes over time (usually it gets worse).

The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.

comboy · 2026-03-27T09:54:20 1774605260

Yeah, good tests are associated with cost. I'd like to see benchmarks on big messy codebases and how models perform on a clearly defined task that's easy to verify.

I was thinking that tokens spent in such case could also be an interesting measure, but some agent can do small useful refactoring. Although prompt could specify to do the minimal change required to achieve the goal.