This very closely resembles my own experience, right down to the timeline.
I don't have an answer, but just seeing this thread has been cathartic for me.
Some of the options I'm considering (all speculative):
- It's okay to be a "hired gun" and switch companies every few years just to ensure you stay interested. Some people's minds are stimulated by novelty and learning; that's not a bad thing! In fact some of the engineers I most respect work as consultants not traditional employees.
- Try working at a more "stodgy" company. Your average Fortune 500 employs more developers than most unicorns and is probably a decade behind the curve in terms of technology-- maybe you can go to one of those, take it easy, and be a hero.
- If it's an option financially, "hire yourself" for a few months to go do a passion project-- hobbyist app? major OSS improvement? creative endeavor?-- and see how it feels.
Ideally, if you can convince yourself something cannot happen, you can also convince the compiler, and get rid of the branch entirely by expressing the predicate as part of the type (or a function on the type, etc.)
Language support for that varies. Rust is great, but not perfect. Typescript is surprisingly good in many cases. Enums and algebraic type systems are your friend. It'll never be 100% but it sure helps fill a lot of holes in the swiss cheese.
Because there's no such thing as a purely internal error in a well-constructed program. Every "logic error" has to bottom out in data from outside the code eventually-- otherwise it could be refactored to be static. Client input is wrong? Error the request! Config doesn't parse? Better specify defaults! Network call fails? Yeah, you should have a plan for that.
Not every piece of logic lends itself to being expressed in the type system.
Let's say you're implementing a sorting algorithm. After step X you can be certain that the values at locations A, B, and C are sorted such that A <= B <= C. You can be certain of that because you read the algorithm in a prestigious journal, or better, you read it in Knuth and you know someone else would have caught the bug if it was there. You're a diligent reader and you've convinced yourself of its correctness, working through it with pencil and paper. Still, even Knuth has bugs and perhaps you made a mistake in your implementation. It's nice to add an assertion that at the very least reminds readers of the invariant.
Perhaps some Haskeller will pipe up and tell me that any type system worth using can comfortably describe this PartiallySortedList<A, B, C>. But most people have to use systems where encoding that in the type system would, at best, make the code significantly less expressive.
Yes, this has been my experience too! Another tool in the toolbox is property / fuzz testing. Especially for data structures, and anything that looks like a state machine. My typical setup is this:
1. Make a list of invariants. (Eg if Foo is set, bar + zot must be less than 10)
2. Make a check() function which validates all the invariants you can think of. It’s ok if this function is slow.
3. Make a function which takes in a random seed. It initializes your object and then, in a loop, calls random mutation functions (using a seeded RNG) and then calls check(). 100 iterations is usually a good number.
4. Call this in an outer loop, trying lots of seeds.
5. If anything fails, print out the failing seed number and crash. This provides a reproducible test so you can go in and figure out what went wrong.
If I had a penny for every bug I’ve found doing this, I’d be a rich man. It’s a wildly effective technique.
This is indeed a great technique. The only way it could be improved is to expand on step 3 by keeping a list of the random mutation functions called and the order in which they were called, then if the test passes you throw that list away and generate a new list with the next seed. But if the test fails then you go through the following procedure to "shrink" the list of mutations down to a minimal (or nearly minimal) repro:
1. Drop the first item in the list of mutations and re-run the test.
2. If the test still fails and the list of mutations is not empty, goto step 1.
3. If the test passes when you dropped the first item in the mutation list, then that was a key part of the minimal repro. Add it to a list of "required for repro" items, then repeat this whole process with the second (and subsequent) items on the list.
In other words, go through that list of random mutations and, one at a time, check whether that particular mutation is part of the scenario that makes the test fail. This is not guaranteed to reach the smallest possible minimal repro, but it's very likely to reach a smallish repro. Then in addition to printing the failing seed number (which can be used to reproduce the failure by going through that shrinking process again), you can print the final, shrunk list of mutations needed to cause the failure.
Printing the list of mutations is useful because then it's pretty simple (most of the time) to turn that into a non-RNG test case. Which is useful to keep around as a regression test, to make sure that the bug you're about to fix stays fixed in the future.
There is no inherent benefit in going and expressing that fact in a type. There are two potential concerns:
1) You think this state is impossible but you've made a mistake. In this case you want to make the problem as simple to reason about as possible. Sometimes types can help but other times it adds complexity when you need to force it to fit with the type system.
People get too enamored with the fact that immutable objects or certain kinds of types are easier to reason about other things being equal and miss the fact that the same logic can be expressed in any Turing complete language so these tools only result in a net reduction in complexity if they are a good conceptual match to the problem domain.
2) You are genuinely worried about the compiler or CPU not honoring it's theoretical guarantees -- in this case rewriting it only helps if you trust the code compiling those cases more for some reason.
I think those concerns are straw men. The real concern is that the invariants we rely on should hold when the codebase changes in the future. Having the compiler check that automatically, quickly, and definitively every time is very useful.
This is what TFA is talking about with statements like "the compiler can track all code paths, now and forever."
Sometimes the "error" is more like, "this is a case that logically could happen but I'm not going to handle it, nor refactor the whole program to stop it from being expressable"
Funny you should mention the floating point rounding mode, I actually had to fix a bug like that once. Our program worked fine, until you printed to an HP printer - then it crashed shortly after. It took forever to discover the cause - the printer driver was changing the floating point rounding mode and not restoring it. The fix was to set the mode to a known value each and every time after you printed something.
These are vanishingly unlikely if you mostly target consumer/server hardware. People who code for environments like satellites, or nuclear facilities, have to worry about it, sure, but it's not a realistic issue for the rest of us
Bitflips are waaay more common than you think they are. [0]
> A 2011 Black Hat paper detailed an analysis where eight legitimate domains were targeted with thirty one bitsquat domains. Over the course of about seven months, 52,317 requests were made to the bitsquat domains.
> Bitflips are waaay more common than you think they are... Over the course of about seven months, 52,317 requests...
Your data does not show them to be common - less than 1 in 100,000 computing devices seeing an issue during a 7 month test qualifies as "rare" in my book (and in fact the vast majority of those events seem to come from a small number of server failures).
And we know from Google's datacenter research[0] that bit flips are highly correlated hard failures (i.e. they tend to result from a faulty DRAM module, and so affect a small number of machines repeatedly).
It's hard to pin down numbers for soft failures, but it seems to be somewhere in the realm of 100 events/gigabyte/year - and that's before any of the many ECC mechanisms do their thing. In practical sense, no consumer software worries about bit flips in RAM (whereas bit flips in storage are much more likely, hence checksumming DB rows, etc).
> ... A new consumer grade machine with 4GiB of DRAM, will encounter 3 errors a month, even assuming the lowest estimate of 120 FIT per megabit.
The guarantees offered by our hardware suppliers today, is not "never happens" but "accounted for in software".
So, if you ignore it, and start to operate at any scale, you will start to see random irreproducible faults.
Sure, you can close all tickets as user error or unable to reproduce. But it isn't the user at fault. Account for it, and your software has less glitches than the competitor.
1 in 40,000 customer devices experiencing a failure annually is considerable better than 4 9s of reliability. So we are debating whether going from 4 9s to 5 9s is worth it.
And like, sure, if the rest of your stack is sufficiently polished (and your scale is sufficiently large) that the once-a-year bit flip event becomes a meaningful problem... then by all means do something about it.
But I maintain that the vast majority of software developers will never actually reach that point, and there are a lot of lower-hanging fruit on the reliability tree
https://en.wikipedia.org/wiki/Prototaxites has more context. Tree sized and shaped living thing that wasn't a plant, which probably fed and reproduced like a fungus but per this latest research wasn't a fungus either, by far the largest known organism on land up to that point, in a time when land animals barely existed. Unsettling!
Rust can go that fast because of guarantees its compiler enforces about what the code is doing, that JS emphatically doesn't.
By all means build your tooling and runtime in Rust if it helps, but "you can write high-performance code in JS with no Rust-like constraints" is fundamentally a nonsense pitch.
I disagree. Right now, feedback on correctness is a major practical limitation on the usefulness of AI coding agents. They can fix compile errors on their own, they can _sometimes_ fix test errors on their own, but fixing functionality / architecture errors takes human intervention. Formal verification basically turns (a subset of) functionality errors into compile errors, making the feedback loop much stronger. "Come up with what software we need better than us in the first place" is much higher on the ladder than that.
TL;DR: We don't need to be radically agnostic about the capabilities of AI-- we have enough experience already with the software value chain (with and without AI) for formal verification to be an appealing next step, for the reasons this author lays out.
I completely agree it's appealing, I just don't see a reason to assume that an agent will be able to succeed at it and at the same time fail at other things that could make the whole exercise redundant. In other words, I also want agents to be able to consistently prove software correct, but if they're actually able to accomplish that, then they could just as likely be able to do everything else in the production of that software (including gathering requirements and writing the spec) without me in the loop.
>I just don't see a reason to assume that an agent will be able to succeed at it and at the same time fail at other things that could make the whole exercise redundant.
But that is much simpler to understand: eventually finding a proof using guided search (machines searching for proofs, multiple inference attempts) takes more effort than verifying a proof. Formal verification does not disappear, because communicating a valuable succinct proof is much cheaper than having to search for the proof anew. The proofs will become inevitable lingua franca (like it is among capable humans) for computers as well. Basic economics will result in adoption of formal verification.
Whenever humans found an original proof, their notes will contain a lot of deductions that were ultimately not used, they were searching for a proof, using intuition gained in reading and finding proofs of other theorems. It's just that LLM's are similarily gaining intuition, and at some point become better than humans at finding proofs. It is currently already much better than the average human at finding proofs. The question is how long it takes until it gets better than any human being at finding proofs.
The future you see where the whole proving exercise (if by humans or by LLMs) becomes redundant because it immediately emits the right code is nonsensical: the frontier of what LLM's are capable of will move gradually, so for each generation of LLMs. there will always be problems it can not instantly generate provably correct software (but omitting the according-to-you-unnecessary proof). That doesn't mean they can't find the proofs, just that it would have to search by reasoning, with no guarantee if it ever finds a proof.
That search heuristic is Las Vegas, not Monte Carlo.
Companies will compare the levelized operating costs of different LLM's to decide which LLMs to use in the future on hard proving tasks.
Satellite data centers will consume ever more resources in a combined space/logic race for cryptographic breakthroughs.
> eventually finding a proof using guided search (machines searching for proofs, multiple inference attempts) takes more effort than verifying a proof. Formal verification does not disappear, because communicating a valuable succinct proof is much cheaper than having to search for the proof anew. The proofs will become inevitable lingua franca (like it is among capable humans) for computers as well. Basic economics will result in adoption of formal verification.
But AI isn't used to verify the proof. It's used to find it. And if it can find it - one of the hardest things in software development - there's no reason to believe it can't do anything else associated with software development. If AI agents find formal verification helpful, they would probably opt to use it, but there would also be no need for a human in the loop at all.
> It is currently already much better than the average human at finding proofs.
The average human also can't write hello world. LLMs are currently significantly worse at finding proofs than the average formal-verification person (I say this as someone who's done formal verification for many years), though, just as they're significantly worse at writing code than the average programmer. I'm not saying they won't become better, it's just strange to expect that they'll become better than the average formal-verification person and at the same time they won't be better than the average product manager.
> there will always be problems it can not instantly generate provably correct software
Nobody said anything about "instantly". If the AI finds formal verification helpful, it will choose to use it, but if it can find proofs better than humans, why expect that it won't be able to do easier tasks better than humans?
> I also want agents to be able to consistently prove software correct...
I know this is just an imprecision of language thing but they aren't 'proving' the software is correct but writing the proofs instead of C++ (or whatever).
I had a but of a discussion with one of them about this a while ago to determine the viability of having one generate the proofs and use those to generate the actual code, just another abstraction over the compiler. The main takeaway I got from that (which may or may not be the way to do) is to use the 'result' to do differential testing or to generate the test suite but that was (maybe, don't remember) in the context of proving existing software was correct.
I mean, if they get to the point where they can prove an entire codebase is correct just in their robot brains I think we'll probably have a lot bigger things to worry about...
It's getting better every day, though, at "closing the loop."
When I recently booted up Google Antigravity and had it make a change to a backend routine for a web site, I was quite surprised when it opened Chrome, navigated to the page, and started trying out the changes to see if they had worked. It was janky as hell, but a year from now it won't be.
Aside from just being immensely satisfying, these patterns of defensive programming may be a big part of how we get to quality GAI-written code at scale. Clippy (and the Rust compiler proper) can provide so much of the concrete iterative feedback an agent needs to stay on track and not gloss over mistakes.
I don't have an answer, but just seeing this thread has been cathartic for me.
Some of the options I'm considering (all speculative):
- It's okay to be a "hired gun" and switch companies every few years just to ensure you stay interested. Some people's minds are stimulated by novelty and learning; that's not a bad thing! In fact some of the engineers I most respect work as consultants not traditional employees.
- Try working at a more "stodgy" company. Your average Fortune 500 employs more developers than most unicorns and is probably a decade behind the curve in terms of technology-- maybe you can go to one of those, take it easy, and be a hero.
- If it's an option financially, "hire yourself" for a few months to go do a passion project-- hobbyist app? major OSS improvement? creative endeavor?-- and see how it feels.
reply