CPU reliability – Linus Torvalds (2007)

zxcdw · on Dec 9, 2013

I don't work in environment where I get to deal with hardware failures, so pardon my ignorance, but has anyone seen a failed CPU piece which has failed during normal operation? I am under an impression that it is very rare for a CPU itself to fail so that it would need to be replaced.

The only times I've even heard about failing CPUs has been if they've been overclocked or insufficiently cooled(add in overvolting, and you get both :)) or physical damage during mounting/unmounting or otherwise handling hardware. And even then the failure has usually been elsewhere than the CPU itself.

Of course I am not saying it'd be unheard of, but for me frankly, right now it is.

ChuckMcM · on Dec 9, 2013

" has anyone seen a failed CPU piece which has failed during normal operation?"

Several. But I'm lucky in that I worked for NetApp for 5 years which have several million NetApp filers in the field that were all calling home when they had issues, and Google which has a very large number of CPUs all around the planet doing their bidding. With visibility into a population like that you see faults that are once in a billion happen about once a month :-).

Two general kinds of failures though, the more common one is a system machine check (the internal logic detected the fault condition and put the CPU into the machine check state) which happens when 3 or more bits go sideways in the various RAM inside the mesh of execution units. Nominally ECC protected it can detect but not correct multi-bit errors. Power it off, power it on, and restart from a known condition and it's good as new.

The more rare occurrence is that something in the CPU fails which results in the CPU not coming out of RESET even, or immediately going into a machine check state. When you find those Intel often wants you to send it back to them so they can do failure analysis on it. The most common root cause analysis for those is some moderate electrostatic damage which took a while to finally finish the process of failing.

Some of the more interesting papers at the ISSCC are sometimes on lifetime expectancy of small geometry transistors. They are a lot more susceptible to damage and disruption due to cosmic rays and other environmental agents.

dale-cooper · on Dec 9, 2013

Interesting. How would these failures manifest themselves for a user?

ithkuil · on Dec 9, 2013

In the best case your app crashes or misbehaves randomly not unlike bad RAM (yes it can happen even with ECC). In the worst case there is a subtle numeric error that could potentially percolate to users. Usually it's about floating point issues, since usually you don't use floats to access arrays or do pointer arithm, issues in that unit may not cause an early failure and hence go undetected.

ChuckMcM · on Dec 9, 2013

Simple case is that your machine just halts. If you have a way of looking at them, it might puts a bios code on the LEDs. On servers the baseboard management chip (BMC) will usually record a machine check event as well.

forrestthewoods · on Dec 9, 2013

I think you're only considering catastrophic unable to boot failure. A CPU could be going bad with the effects less obvious.

Microsoft released research that shows enthusiast overclocking has a clear increase in hardware faults.

[1] http://www.extremetech.com/gaming/131739-microsoft-analyzes-... [2] http://research.microsoft.com/pubs/144888/eurosys84-nighting...

alelefant · on Dec 9, 2013

While that research is certainly interesting, the poster above you acknowledged that overclocking could result in failures.

tonyarkles · on Dec 9, 2013

I had a Celeron that had its cache go bad once. It would work "fine" in Windows, but Linux would report that the CPU was throwing some kind of exception. If you went into the BIOS and disabled the cache it would be stable, but with the cache on it would crash after a day or two. Swapped out the CPU and the machine lived a great life for a long time.

zymhan · on Dec 9, 2013

> but Linux would report that the CPU was throwing some kind of exception

Does this mean it didn't boot into Linux?

tonyarkles · on Dec 9, 2013

It booted but dmesg was very noisy!

It would actually run for a day or two, serving up a modest PHP load, before crapping out.

dntrkv · on Dec 9, 2013

What a crazy coincidence. I was actually just helping a friend troubleshoot his dead PC and I told him to test his ram, video card, motherboard, powersupply, in that order, and not to bother with the CPU since I have never had one fail on me. It ended up being the CPU that went out. I go on Hacker News a couple hours later and see this post. heh.

ac29 · on Dec 9, 2013

https://en.wikipedia.org/wiki/Baader-Meinhof_phenomenon

tfinniga · on Dec 9, 2013

Man, I just learned about the Baader-Meinhof phenomenon the other day, and now it seems to be popping up everywhere.

josephagoss · on Dec 9, 2013

Is this the same effect that when told to count yellow cars whilst driving, you end up noticing far more yellow cars than you thought were on road.

Or how you notice how much more frequent your model of car is on the roads?

olalonde · on Dec 9, 2013

Also when you learn a new word in a different language, it seems like suddenly everyone is using that word all the time.

option_greek · on Dec 9, 2013

Are there any explanations for this phenomenon. I keep experiencing it all the time :)

ehsanu1 · on Dec 9, 2013

Me too, but I had always figured it as survivorship bias. That is, there are many such things that I newly hear of, but most of them don't get repeated mentions right way. The few of them that do catch my attention, and it seems like it happens a lot. But it actually happens only for a few of the many, many new things I hear about all the time. Similar to my friend who thought that every time he looked at a dead street lamp, it would turn on. :)

Another possible factor is that I find I pay more attention to things I've recently learned about more closely when I come across them, but ignore them when I either know them very well or don't know them at all. I don't know if this is true for anyone else.

coldtea · on Dec 9, 2013

It's, of course the second explanation.

The first would include the Kardasians for example. But just because they get mentioned a lot, you don't feel anything strange about hearing about them a lot.

Whereas the very essense of the phenomenon is that encoutering something multiple times seems strange to you.

coldtea · on Dec 9, 2013

Well, after you learned about (or focused on) something, you have an increased awareness of it mentioned, whereas other times in your life you can encounter it 2-3 times in the same day without paying much attention (just one more unknown word).

rodgerd · on Dec 9, 2013

From one of my personal machines:

[14865975.000023] Machine check events logged

MCE 0

CPU 0 BANK 2

ADDR 1438280

TIME 1384859595 Wed Nov 20 00:13:15 2013

STATUS d40040000000011a MCGSTATUS 0

MCGCAP 104 APICID 0 SOCKETID 0

CPUID Vendor AMD Family 6 Model 8

This will, of course, be getting replaced shortly, preferably before it does any real damage. Given it's a 2002-era chip, 11 years of service isn't exactly terrible.

I've seen quite a few UltraSPARC chips (especially IIIs) go over the years at work, and often had a shit of a time trying to get Sun to accept them as faulty and replace them.

aimhb · on Dec 9, 2013

It was posted here on HN a few days ago, but you might find the DEFCON talk on single bit domain errors relevant: https://www.youtube.com/watch?v=ZPbyDSvGasw

In summary, some data centers are run hotter than recommended, which leads to a lot of mostly ignored domain resolution errors, which leads to a security risk.

vanderZwan · on Dec 9, 2013

My dad had a laptop which would not boot unless he put it in the fridge first for half an hour. As long as he didn't reboot everything then worked "fine". Does that count?

raverbashing · on Dec 9, 2013

That's probably a bad contact or a tiny fissure that makes contact again when cooled

And the failed part is either needed only on bootup or when the current gets going it doesn't stop until it's powered off again

vanderZwan · on Dec 9, 2013

Thanks for clearing that up - it's always been bothering what it could possibly be (that and the "how on earth did he figure that out in the first place?")

danudey · on Dec 9, 2013

The cold would shrink the motherboard to re-connect the contacts; it might have been fixed by doing the old Xbox/nvidia trick of putting it in the oven (which would soften the solder, causing it to shift and re-connect). With the Xbox, you could apparently even just wrap it in a towel and let its own heat do the trick.

serans · on Dec 9, 2013

how on earth did he come up with putting the laptop in the fridge?

valarauca1 · on Dec 9, 2013

I had this problem once when overclocking an AMD phenon. The short story is (I don't know the whole cause) the on-board crypto units stopped being random.

Which wasn't a real problem for 'some' day-to-day use. This was in the mid to later 00's, so https wasn't quiet everywhere yet.

The problem manifested slowly. When ever I'd connect to HTTPS, my browser would crash. One sound card would phone home for an update, and my computer would crash. Randomly certain games would crash when ever anti-cheat software attempted to run.

It was just odd, and took a few days of hunting to find out what was actually going wrong.

naner · on Dec 9, 2013

Yes, CPUs can fail just like any other hardware component. On desktop systems the most common case is you'll try and boot the system and just be presented with blank video or a beep code. On server systems with multiple CPUs there usually will be an error reported via blinking light or the little info LCD on the front of the system. In some cases the damage is actually visible on the CPU (e.g. discoloration of some of the gold contact points on the bottom). CPUs under normal operation fail less frequently than most other components in my experience.

zhemao · on Dec 9, 2013

I think the MTBF is generally longer than people would normally go without replacing their CPU. Also, CPUs are generally designed to degrade more gracefully. For instance, they may have circuity that scales the frequency down as delays get longer. Also, in multicore CPUs, there are generally some spare cores that will get swapped in if a previously in-use core breaks.

anon_cownerd · on Dec 9, 2013

> Also, in multicore CPUs, there are generally some spare cores that will get swapped in if a previously in-use core breaks.

That sounds like a huge cost to bear. Looking at e.g. a Haswell die photo [1], there are just four physical cores present for a four-core part. With that die area per core, you would take a ~15-20% area hit (that translates to 15-20% cost) just to have a spare core in case one failed some years later.

I have heard of manufacturers selling otherwise "defective" parts where a core or cache slice has a defect by relabeling as a part with fewer cores. But that's a manufacture-time decision, not a dynamic reconfiguration in the field.

[1] http://cdn2.wccftech.com/wp-content/uploads/2013/05/Intel-Ha...

ssafejava · on Dec 9, 2013

I think the GP was on the right track, but somewhat confused. They obviously don't do this on all models, but some dual-core models are disabled quad-cores. Remember the Athlon X3? That was a binned chip that usually was created from X4s with a broken core. Most buyers didn't mind, and some of them got lucky and were able to re-enable the disabled core.

It seems like the GP might be suggesting that a quad-core CPU will swap in another core when one dies. That doesn't happen. But the binning process allows them to still sell slightly defective silicon with disabled parts (cores, cache), which saves money.

On a related note, a lot of GPUs actually do have a few dozen execution units that are disabled by default and can be swapped in after stress testing at the factory. I believe some can even do that in the wild, but I could be wrong.

zhemao · on Dec 9, 2013

Oh whoops, I guess my VLSI professor lied to me. Maybe it only happens with certain multiprocessors. But yeah, there are definitely disabled cores in a lot of processors for the reasons you mentioned.

cantfindmypass · on Dec 9, 2013

ISTR the PS3's Cell processor had eight physical cores to increase yeild.

zhemao · on Dec 12, 2013

You're mostly right

https://en.wikipedia.org/wiki/PlayStation_3#Technical_specif...

The cell processor had a main PowerPC core and eight floating-point SIMD co-processors called synergistic processing elements (SPE). For the PS3, one of the eight SPEs was disabled and another was reserved for the operating system, leaving the other six for developers.

cnvogel · on Dec 9, 2013

A machine here at work logged cache related machine check excel exceptions, at a rate of roughly 1/day but not regularly or deterministic. Not related to load or temperature and even after clocking lower than spec'ed. Changing the CPU fixed it.

Those were correctable errors, prime95 or memtest did not detect anything.

grumps · on Dec 9, 2013

I use to work in situations where we had to account for failures. We had lab equipment that would just run all that time, and with terrible wiring and we were horrible to it too. We left covers off, piled stuff on top of it and just Frankenstein the hell out of all of it. I even managed to flash a new OS/App at the same time of a power failure but it still lived....

Failures were more prominent in memory...but they did happen. We also sent equipment through environmental testing that would force failures. I don't recall of hearing of any CPU failures. Although most of our equipment was DSP & FPGA based but there were some tiny 'lil CPU's there.

gnoway · on Dec 9, 2013

I have seen it happen one time that I can remember, where I was sure it was the CPU. We had 8 dual socket 5400-era Xeon servers in a VMware cluster. Whenever 64bit Windows 2008+ virtual machines were started on or were vmotioned to one of the hosts, they would bluescreen. We did not experience this behavior at all, then one day we did. I replaced both CPUs and the problem disappeared. I have to assume it was one of the CPUs.

It's entirely possible that they overheated, but if they did it was due to poor cooling; we did not overvolt or overclock these machines.

zurn · on Dec 9, 2013

It's not easy to get a CPU chip failure reliably diagnosed in the field. Even if you manage to do the trial and error component swapping dance pointing to the CPU, you don't get very good confidence. Might be that new CPU taxes the power feed less or there was misapplied cooling paste or a bad contact in the pins etc.

cafard · on Dec 9, 2013

A couple of times: Sparc chips that did some notable damage on their way out. One was pretty old, one not so much.

Shorel · on Dec 9, 2013

I saw a 386 let the magic blue smoke out.

Taniwha · on Dec 9, 2013

so not even mentioned here is metastability - basically signals that cross clock domains within traditional clocked logic where the clocks are not carefully organized to be multiples of each other can end up being sampled just as they change - the result is a value inside of a flip-flop that's neither a 1 or a 0 - sometimes an analog value somewhere in between, sometimes an oscillating mess at some unknown frequency - worst worst case this unknown bad value can end up propagating into a chip causing havoc, a buzzing mess of chaos.

In the real world this doesn't happen very often and there are techniques to mitigate it when it does (usually at a performance or latency cost) - core CPUs are probably safe, they're all one clock but display controllers, networking, anything that touches the real world has to synchronize with it.

For example I was involved with designing a PC graphics chip in the mid '90s - we did the calculations around metastability (we had 3 clock domains and 2 crossings), we calculated that our chip would suffer from metastability (might be as simple as a burble on one frame of a screen, or a complete breakdown) about once every 70 years - we decided we could live with that as they were running on Win95 systems - no one would ever notice

Everyone who designs real world systems should be doing that math - more than one clock domain is a no no in life support rated systems - your pacemaker for example

caf · on Dec 9, 2013

If a failure mode was likely to happen once every 70 chip-years of operation, then it seems like if you sold a few hundred thousand chips then you would expect several instances of that failure mode to occur across the population of chips every day?

Taniwha · on Dec 9, 2013

simply yes - but as I mentioned in our case by far the most most were going to be pixel burbles - you'd likely see one in the lifetime of your video card - the chances of the more serious sort of jabbering core sort of meltdown are much less likely - we design against them - but, one has to stress, not impossible.

You can design to be metastablity tolerant - use high-gain, high clk->Q flops as synchronizers, uses multiple synchronizers in a row (trading latency for reliability), you can do things to reduce frequencies (run multiple synchronizers in parallel, synchronize edges rather than absolute values etc), but in the end if you're synchronizing an asynchronous event you can't engineer metastability out of your design - you just have to make it "good enough" for some value of good enough that will keep marketing and legal happy.

It's our dirty little secret (by 'our' I mean the whole industry)

elwell · on Dec 9, 2013

This field really interests me.

pedrocr · on Dec 8, 2013

It would be awesome if companies like Google would calculate MTBF statistics on components. They've done it for disks and it would be great to extend it to CPUs and memory modules. They're probably in a better position than even Intel to calculate these things with precision.

bcoates · on Dec 9, 2013

Here's one for RAM in servers, from Google: http://news.cnet.com/8301-30685_3-10370026-264.html

Found it here, which also goes into some testing the Guild Wars guys did on their population of gamer PCs: http://www.codeofhonor.com/blog/whose-bug-is-this-anyway (scroll down to "Your computer is broken", around 1% of the systems they tested failed a CPU-to-RAM consistency stress test)

Both of them indicate intermittently defective components in running systems are way more common than anybody assumes.

raverbashing · on Dec 9, 2013

This is very interesting

But when they say "Memory Error", even though it's something detected/corrected by ECC I'm not sure we can say 'the memory is defective'

It may be a combination of the conditions of power/load/data/time since last refresh and variance between modules.

Since Google appears not to show all the data they have, we probably are not going to get that from them though :/

rkangel · on Dec 8, 2013

They'd have to be careful with how they quoted the numbers though. As Linus accurately points out, MTBF varies wildly depending on the usage pattern. If you want to quote it in a unit of time, e.g. "years", then you have to specify the usage the part has been under, which will be very different for a server part compared to a desktop part. You could quote it per instruction or equivalent, I suppose, taking into account how hard the component is used, but even that isn't perfect.

bobdvb · on Dec 9, 2013

MTBF is quite well tied to a contract purchasing when you buy large quantities of components for manufacturing. Intel don't know where people are going to use their products and the range of environments they get exposed to can't be easily equated for.

I know Intel has been working for some time on the idea of high temperature data centres, this will impact the MTBF of all components but you can always calculate the cost of the losses vs the cost of the cooling: http://www.datacenterdynamics.com/focus/archive/2012/08/inte...

amscanne · on Dec 8, 2013

Interesting relevant paper: http://www.cs.cmu.edu/~bianca/fast07.pdf

Table 3 suggests that there are data sets that include all components (CPU, memory, power supplies, etc.).

avn2109 · on Dec 8, 2013

Data of the sort shown in Table 3 certainly exists. However, it's often inaccessible to the public because companies tend to treat it as a trade secret.

Any large company that makes things employs a bunch of reliability engineers, who are usually EE's or ME's who make Weibull plots and bathtub curves all day (to set the warranty duration, mostly). These guys have all the data you could ever want on this topic, but they're not sharing. Especially at Intel.

zebra · on Dec 8, 2013

I'm almost sure that the components without moving parts will become technologically obsolete long before they start to fail. When I buy used laptop I always change the HDD, the DVD and its reliability jumps sharply up.

pedrocr · on Dec 8, 2013

That may very well be true on average but I'd bet there are plenty of CPUs and memory modules that fail in the first year of usage for example. After all CPUs are tested and sorted into high/low performance parts, so sample variation itself would be enough to generate some early failures.

As a consumer it's hard enough to keep up with what's reliable in hard drives. Keeping the manufacturers honest with good stats for the most common parts would be great.

jfim · on Dec 8, 2013

Even things with moving parts, it would be nice to know that model X of brand Y has a MTBF of 4.5 years, but hunting the same model X 4.5 years later isn't likely to yield the exact same hardware but some later revision of the same specced hardware.

superuser2 · on Dec 9, 2013

I've seen a lot of failed laptop motherboards.

wging · on Dec 8, 2013

They might very well do this already. But it'd seem they're disincentivized to make this information public... do they publish the disk information?