> ... A new consumer grade machine with 4GiB of DRAM, will encounter 3 errors a month, even assuming the lowest estimate of 120 FIT per megabit.
The guarantees offered by our hardware suppliers today, is not "never happens" but "accounted for in software".
So, if you ignore it, and start to operate at any scale, you will start to see random irreproducible faults.
Sure, you can close all tickets as user error or unable to reproduce. But it isn't the user at fault. Account for it, and your software has less glitches than the competitor.
1 in 40,000 customer devices experiencing a failure annually is considerable better than 4 9s of reliability. So we are debating whether going from 4 9s to 5 9s is worth it.
And like, sure, if the rest of your stack is sufficiently polished (and your scale is sufficiently large) that the once-a-year bit flip event becomes a meaningful problem... then by all means do something about it.
But I maintain that the vast majority of software developers will never actually reach that point, and there are a lot of lower-hanging fruit on the reliability tree
Which means if you're about medium business or above, one of your customers will see this about once a year.
That classifies more as "inevitable" than "rare" in my book.