Ideally, if you can convince yourself something cannot happen, you can also conv...

dmurray · 2025-12-31T20:40:16 1767213616

Not every piece of logic lends itself to being expressed in the type system.

Let's say you're implementing a sorting algorithm. After step X you can be certain that the values at locations A, B, and C are sorted such that A <= B <= C. You can be certain of that because you read the algorithm in a prestigious journal, or better, you read it in Knuth and you know someone else would have caught the bug if it was there. You're a diligent reader and you've convinced yourself of its correctness, working through it with pencil and paper. Still, even Knuth has bugs and perhaps you made a mistake in your implementation. It's nice to add an assertion that at the very least reminds readers of the invariant.

Perhaps some Haskeller will pipe up and tell me that any type system worth using can comfortably describe this PartiallySortedList<A, B, C>. But most people have to use systems where encoding that in the type system would, at best, make the code significantly less expressive.

josephg · 2025-12-31T20:16:22 1767212182

Yes, this has been my experience too! Another tool in the toolbox is property / fuzz testing. Especially for data structures, and anything that looks like a state machine. My typical setup is this:

1. Make a list of invariants. (Eg if Foo is set, bar + zot must be less than 10)

2. Make a check() function which validates all the invariants you can think of. It’s ok if this function is slow.

3. Make a function which takes in a random seed. It initializes your object and then, in a loop, calls random mutation functions (using a seeded RNG) and then calls check(). 100 iterations is usually a good number.

4. Call this in an outer loop, trying lots of seeds.

5. If anything fails, print out the failing seed number and crash. This provides a reproducible test so you can go in and figure out what went wrong.

If I had a penny for every bug I’ve found doing this, I’d be a rich man. It’s a wildly effective technique.

rmunn · 2026-01-01T08:38:50 1767256730

This is indeed a great technique. The only way it could be improved is to expand on step 3 by keeping a list of the random mutation functions called and the order in which they were called, then if the test passes you throw that list away and generate a new list with the next seed. But if the test fails then you go through the following procedure to "shrink" the list of mutations down to a minimal (or nearly minimal) repro:

1. Drop the first item in the list of mutations and re-run the test. 2. If the test still fails and the list of mutations is not empty, goto step 1. 3. If the test passes when you dropped the first item in the mutation list, then that was a key part of the minimal repro. Add it to a list of "required for repro" items, then repeat this whole process with the second (and subsequent) items on the list.

In other words, go through that list of random mutations and, one at a time, check whether that particular mutation is part of the scenario that makes the test fail. This is not guaranteed to reach the smallest possible minimal repro, but it's very likely to reach a smallish repro. Then in addition to printing the failing seed number (which can be used to reproduce the failure by going through that shrinking process again), you can print the final, shrunk list of mutations needed to cause the failure.

Printing the list of mutations is useful because then it's pretty simple (most of the time) to turn that into a non-RNG test case. Which is useful to keep around as a regression test, to make sure that the bug you're about to fix stays fixed in the future.

TruePath · 2026-01-01T23:06:41 1767308801

There is no inherent benefit in going and expressing that fact in a type. There are two potential concerns:

1) You think this state is impossible but you've made a mistake. In this case you want to make the problem as simple to reason about as possible. Sometimes types can help but other times it adds complexity when you need to force it to fit with the type system.

People get too enamored with the fact that immutable objects or certain kinds of types are easier to reason about other things being equal and miss the fact that the same logic can be expressed in any Turing complete language so these tools only result in a net reduction in complexity if they are a good conceptual match to the problem domain.

2) You are genuinely worried about the compiler or CPU not honoring it's theoretical guarantees -- in this case rewriting it only helps if you trust the code compiling those cases more for some reason.

chriswarbo · 2026-01-08T17:07:54 1767892074

I think those concerns are straw men. The real concern is that the invariants we rely on should hold when the codebase changes in the future. Having the compiler check that automatically, quickly, and definitively every time is very useful.

This is what TFA is talking about with statements like "the compiler can track all code paths, now and forever."

wakawaka28 · 2025-12-31T22:55:25 1767221725

Sometimes the "error" is more like, "this is a case that logically could happen but I'm not going to handle it, nor refactor the whole program to stop it from being expressable"

cozzyd · 2025-12-31T23:46:07 1767224767

Until you have a bit flip or a silicon error. Or someone changed the floating point rounding mode.

mark-r · 2026-01-02T14:03:15 1767362595

Funny you should mention the floating point rounding mode, I actually had to fix a bug like that once. Our program worked fine, until you printed to an HP printer - then it crashed shortly after. It took forever to discover the cause - the printer driver was changing the floating point rounding mode and not restoring it. The fix was to set the mode to a known value each and every time after you printed something.

cozzyd · 2026-01-02T23:08:27 1767395307

That is amazingly devious. Well done HP.

swiftcoder · 2026-01-01T08:16:00 1767255360

> Until you have a bit flip

These are vanishingly unlikely if you mostly target consumer/server hardware. People who code for environments like satellites, or nuclear facilities, have to worry about it, sure, but it's not a realistic issue for the rest of us

shakna · 2026-01-01T08:38:38 1767256718

Bitflips are waaay more common than you think they are. [0]

> A 2011 Black Hat paper detailed an analysis where eight legitimate domains were targeted with thirty one bitsquat domains. Over the course of about seven months, 52,317 requests were made to the bitsquat domains.

[0] https://en.wikipedia.org/wiki/Bitsquatting

swiftcoder · 2026-01-01T09:57:03 1767261423

> Bitflips are waaay more common than you think they are... Over the course of about seven months, 52,317 requests...

Your data does not show them to be common - less than 1 in 100,000 computing devices seeing an issue during a 7 month test qualifies as "rare" in my book (and in fact the vast majority of those events seem to come from a small number of server failures).

And we know from Google's datacenter research[0] that bit flips are highly correlated hard failures (i.e. they tend to result from a faulty DRAM module, and so affect a small number of machines repeatedly).

It's hard to pin down numbers for soft failures, but it seems to be somewhere in the realm of 100 events/gigabyte/year - and that's before any of the many ECC mechanisms do their thing. In practical sense, no consumer software worries about bit flips in RAM (whereas bit flips in storage are much more likely, hence checksumming DB rows, etc).

[0]: https://static.googleusercontent.com/media/research.google.c...

shakna · 2026-01-01T14:02:07 1767276127

1 in 100,000 devices is about 1 in about 40,000 customers due to how many devices most people own.

Which means if you're about medium business or above, one of your customers will see this about once a year.

That classifies more as "inevitable" than "rare" in my book.

swiftcoder · 2026-01-01T15:13:15 1767280395

> That classifies more as "inevitable" than "rare" in my book.

But also pretty much insignificant. Is any other component in your product achieving 5 9s reliability?

shakna · 2026-01-01T15:58:27 1767283107

We're not talking 5 9s, here.

> ... A new consumer grade machine with 4GiB of DRAM, will encounter 3 errors a month, even assuming the lowest estimate of 120 FIT per megabit.

The guarantees offered by our hardware suppliers today, is not "never happens" but "accounted for in software".

So, if you ignore it, and start to operate at any scale, you will start to see random irreproducible faults.

Sure, you can close all tickets as user error or unable to reproduce. But it isn't the user at fault. Account for it, and your software has less glitches than the competitor.

swiftcoder · 2026-01-01T17:07:37 1767287257

> We're not talking 5 9s, here.

1 in 40,000 customer devices experiencing a failure annually is considerable better than 4 9s of reliability. So we are debating whether going from 4 9s to 5 9s is worth it.

And like, sure, if the rest of your stack is sufficiently polished (and your scale is sufficiently large) that the once-a-year bit flip event becomes a meaningful problem... then by all means do something about it.

But I maintain that the vast majority of software developers will never actually reach that point, and there are a lot of lower-hanging fruit on the reliability tree

graemep · 2026-01-01T14:23:17 1767277397

Link to the paper: https://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinabu...

Yoric · 2026-01-01T15:29:50 1767281390

Of course, any attempt at safety or security requires defense in depth.

But usually, any effort spent on making one layer sturdy is worth it.