Here's the craziest one that actually happened to me. The company I worked for h...

no-s · on April 30, 2020

I had a loaner machine (RS-6000 minicomputer) that would have unrecoverable ECC errors when the cover was on. The tech would come and try to diagnose it, but with the cover off, everything would work fine. He'd swap the memory anyway and put the cover back on. within a few hours the memory bank would be failing again. Turned out the machine had been a loaner in a lab where it had acquired some alpha-emitting goo on the inside of the side panel. The lab had just run it with the side panel off to solve the problem, never noticing the goo, never mentioning it to IBM when they packed it up to ship.

It's a long story but the gist is after multiple board swaps, realizing we'd isolated the panel as the fault, I noticed the goo and on a hunch checked it with a scintillator, deducing it was alpha when cardboard blocked it. Turns out the ultra-precious-metal IBM heat sink on the board had an open path that effectively channeled the alpha particles into one of those multi-chip carrier thingies, which featured exposed chips.

As for why I had a scintillator lounging in my desk at a portfolio management company, don't ask. Let's just note the iconic IT anti-hero of that era was the Bastard Operator From Hell, and leave it at that.

m-watson · on April 29, 2020

Unrelated to a strange bug story or anything but you just reminded me of when I was also helping someone set up a, as you called it, mini-supercomputer. It was to do quantum simulations. We were setting it up and the researcher who was going to use it made the root user name skynet. Now I know that joke has probably been played out at campuses around the world but it just seems unnecessary to tempt the fates like that.

newswasboring · on April 29, 2020

> Our original theory was that it had to do with cosmic rays causing bit-flips. This was a well known problem with installations in that area, having caused multi-month delays for some of the larger supercomputer installations in the area. But we'd already corrected for that.

Wow, I sense a more interesting story in here. Care to reveal how it was first found out and how common it actually is?

notacoward · on April 29, 2020

In a nutshell, cosmic rays causing bit-flips really is a thing, and it's more of a thing at higher altitude because of less atmosphere. It's rarely a problem at sea level. At higher altitude you really need to use ECC memory, and do some sort of scrubbing (in Linux it's called Error Detection And Correction or EDAC) to correct single-bit errors before they accumulate and some word somewhere becomes uncorrectable.

The incident that brought this home to a lot of people was at either NCAR or UCAR, both near Boulder. Whichever it was, they were installing a new system - tens of thousands of nodes - and had not been careful about the EDAC settings. Therefore, EDAC wasn't running often enough, and wasn't catching those single-bit errors. Therefore^2, uncorrectable errors were bringing down nodes constantly. According to rumor, this caused a huge delay and almost torched the entire project. It's easy to say in retrospect that they should have checked the EDAC settings first, but as it happened they probably only got to that after multiple rounds of blaming the vendor for flaky hardware (which would generally be the more likely cause especially when you're on the bleeding edge).

yjftsjthsd-h · on April 30, 2020

> It's easy to say in retrospect that they should have checked the EDAC settings first, but as it happened they probably only got to that after multiple rounds of blaming the vendor for flaky hardware (which would generally be the more likely cause especially when you're on the bleeding edge).

Yeah, part of the nightmare of cosmic-ray bitflips (or any random bitflips, I suppose) is precisely that they don't look like anything. A server randomly locks up. A packet has a bad checksum (and is silently resent). A process gets into an unexpected state. That buggy batch job fails 1% more frequently than it used to. Nothing ever points to memory errors, except that there is no pattern.