Which models were you using under this? If you used the quality default as exists in the interface, it makes sense that it was ~4x the cost as it'd be 3 frontier models judged by one of those.
The idea would be to use fusion with simpler, cheaper models.
I’m a little lost though this seems like it could be fun, on safari mobile whatever I build keeps losing some of the connections as I tap on other things so it’s hard to get far with it.
> software development, as a “decide-execute-deliver sandwich”. AI compresses the “execute” layer — the middle of the sandwich — but the other two layers resist automation in a way that will not be overcome by capability improvements alone.
I really struggle to see why improved capabilities cannot deal with those other layers. I do not believe you have substantiated this claim about not being possible as capabilities improve.
> At one end of the pipeline, development teams need to decide what to build.
Developers are not the ones that do this largely. This role is far more on the side of "Product Owner". Sometimes your job covers both, but this is not the majority of the work and does not mostly require SE knowledge - some input usually.
> This layer is hard to automate because it requires thinking about user needs, market signals, organizational priorities, and in some cases regulatory constraints.
Hmm, these are language models that can talk through much of this already - but more importantly none of what is mentioned there requires software engineering. For parts that do (I'm sure someone would come to correct me if I said that there was none or seemed to suggest it is never ever ever relevant) this is a much smaller slice.
> As AI capabilities improve, the kinds of decisions that can be delegated to AI increase over time. But this does not make the “decide” layer thinner — once a decision can be delegated to AI, it is no longer a source of competitive advantage, and the value of human decision-making migrates upward. Software increases in complexity over time, so there is no ceiling to this process.
Now this is rather hidden but a huge leap in logic. The decide layer does get thinner for all the same projects, and then you simply assert that software will get more complex and so this cancels it all out.
A team of 5 may end up being able to ship what a team of 50 used to, and maybe now there are 10 teams outputting more - but is there not a clear limit to this? At some point do we not just need 45 fewer people? That there needs to be some engineers is not the same as needing anywhere near as many as we have.
For a time I think we will see increased output meaning more software, but that tails off as they get better.
> At the other end of the sandwich, human teams need to be accountable for what they deliver.
Why? And if we assume so, why does that need a software engineer?
> It is possible that some day in the future teams will ship mission-critical code without fully testing and understanding it,
You don't need to read code to test it, and people choose to ship products without fully understanding the code all the time. Literally any decision maker who is not a software engineer who knows the entire codebase does this. Companies fully ship systems that are far too complex for any single developer to even understand.
And much of software isn't mission critical. Or at least, if you want to say it is then the mission is low stakes.
> today’s AI is so unreliable that such haphazard practices would represent an existential threat to software teams and their customers.
I'd argue for a bunch of stuff this isn't true, and the whole point of the article is "never even if they get better" which is different.
> A central insight of AI as Normal Technology is that we can collectively choose to keep humans accountable through shared norms, law, and policy.
Sure, we can ban AI writing code, but will we? Is there a huge collective concern for all us high paid engineers being replaced by AI?
Yep- this is what I do. I use a high quality VLM to generate labelled boxes (in my case, around tardigrades in a microscope image), do some light editing to fix the small number of errors, and then train YOLO26 with it. Works great, saved me tens of hours of labelling. It's a bit scary that there is a VLM that works as well as my fine-tuned model (although much slower).
thats a fantastic strategy thank you, and thanks to all the other helpful posters as well here. do you have any tips for how to choose the base yolo model? or just any generic one will do?
I'm not going to say it's a perfect prediction, but I do find the trajectory of "can write something reasonable" to "oh can write snippets of code" towards larger and larger systems feels like it's played out - the common thing I see more now is that people talk of "taste" that the humans are contributing more than the raw coding part.
I get what you mean with this rather automated research, I've done it on a smaller scale with performance work because it can run/test/measure/propose changes/debug and loop. I can throw a vague idea at it, guide it or discuss with it and go and make a coffee.
Diaspora is mind-blowing, yet highly improbable and speculative, even though carefully threaded to sound plausible on all levels. The whole introdus idea and simulation from VM perspective sounds incredible, but I don't think the body runs a simulation of anything. It is something else.
51% does not mean it randomly gets things wrong half the time.
These things can be useful if you can accurately predict which tasks they will reliably do, and which they will usually fail on. Then you can get much more reliable work from them.
Why would you refuse to use a patch that deals with a valid PoC exploit?
If a random contributor posted an explanation of an exploit, showed it worked in an executable way, presented a patch and you could see that the exploit no longer worked - would you refuse to use the fix until the contributor showed how they figured it out?
Given where Mythos alleges to go, reproducibility far beyond a hash promise, an alleged (but not really proven) existence of an PoC, and “Trust me bro” is necessary.
When an ungated (or even abliterated) public model can repeatedly, easily, and accurately embarrass Anthropic’s models, that might change.
The idea would be to use fusion with simpler, cheaper models.
reply