Cosmic Rays: what is the probability they will affect a program?

问题

Once again I was in a design review, and encountered the claim that the probability of a particular scenario was \"less than the risk of cosmic rays\" affecting the program, and it occurred to me that I didn\'t have the faintest idea what that probability is.

\"Since 2^-128 is 1 out of 340282366920938463463374607431768211456, I think we\'re justified in taking our chances here, even if these computations are off by a factor of a few billion... We\'re way more at risk for cosmic rays to screw us up, I believe.\"

Is this programmer correct? What is the probability of a cosmic ray hitting a computer and affecting the execution of the program?

回答1:

From Wikipedia:

Studies by IBM in the 1990s suggest that computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month.^[15]

This means a probability of 3.7 × 10^-9 per byte per month, or 1.4 × 10^-15 per byte per second. If your program runs for 1 minute and occupies 20 MB of RAM, then the failure probability would be

                 60 × 20 × 1024²
1 - (1 - 1.4e-15)                = 1.8e-6 a.k.a. "5 nines"

Error checking can help to reduce the aftermath of failure. Also, because of more compact size of chips as commented by Joe, the failure rate could be different from what it was 20 years ago.

回答2:

Apparently, not insignificant. From this New Scientist article, a quote from an Intel patent application:

"Cosmic ray induced computer crashes have occurred and are expected to increase with frequency as devices (for example, transistors) decrease in size in chips. This problem is projected to become a major limiter of computer reliability in the next decade. "

You can read the full patent here.

回答3:

Note: this answer is not about physics, but about silent memory errors with non-ECC memory modules. Some of errors may come from outer space, and some - from inner space of desktop.

There are several studies of ECC memory failures on large server farms like CERN clusters and Google datacenters. Server-class hardware with ECC can detect and correct all single bit errors, and detect many multi-bit errors.

We can assume that there is lot of non-ECC desktops (and non-ECC mobile smartphones). If we check the papers for ECC-correctable error rates (single bitflips), we can know silent memory corruptions rate on non-ECC memory.

Large scale CERN 2007 study "Data integrity": vendors declares "Bit Error Rate of 10^-12 for their memory modules", "a observed error rate is 4 orders of magnitude lower than expected". For data-intensive tasks (8 GB/s of memory reading) this means that single bit flip may occur every minute (10^-12 vendors BER) or once in two days (10^-16 BER).
2009 Google's paper "DRAM Errors in the Wild: A Large-Scale Field Study" says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM after my calculations. Paper says the same: "mean correctable error rates of 2000–6000 per GB per year".
2012 Sandia report "Detection and Correction of Silent Data Corruptionfor Large-Scale High-Performance Computing": "double bit flips were deemed unlikely" but at ORNL's dense Cray XT5 they are "at a rate of one per day for 75,000+ DIMMs" even with ECC. And single-bit errors should be higher.

So, if the program has large dataset (several GB), or has high memory reading or writing rate (GB/s or more), and it runs for several hours, then we can expect up to several silent bit flips on desktop hardware. This rate is not detectable by memtest, and DRAM modules are good.

Long cluster runs on thousands of non-ECC PCs, like BOINC internet-wide grid computing will always have errors from memory bit-flips and also from disk and network silent errors.

And for bigger machines (10 thousands of servers) even with ECC protection from single-bit errors, as we see in Sandia's 2012 report, there can be double-bit flips every day, so you will have no chance to run full-size parallel program for several days (without regular checkpointing and restarting from last good checkpoint in case of double error). The huge machines will also get bit-flips in their caches and cpu registers (both architectural and internal chip's triggers e.g. in ALU datapath), because not all of them are protected by ECC.

PS: Things will be much worse if the DRAM module is bad. For example, I installed new DRAM into laptop, which died several weeks later. It started to give lot of memory errors. What I get: laptop hangs, linux reboots, runs fsck, finds errors on root filesystem and says that it want to do reboot after correcting errors. But at every next reboot (I did around 5-6 of them) there are still errors found on the root filesystem.

回答4:

Wikipedia cites a study by IBM in the 90s suggesting that "computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month." Unfortunately the citation was to an article in Scientific American, which didn't give any further references. Personally, I find that number to be very high, but perhaps most memory errors induced by cosmic rays don't cause any actual or noticable problems.

On the other hand, people talking about probabilities when it comes to software scenarios typically have no clue what they are talking about.

回答5:

Well, cosmic rays apparently caused the electronics in Toyota cars to malfunction, so I would say that the probability is very high :)

Are cosmic rays really causing Toyota woes?

回答6:

With ECC you can correct the 1 bit errors of Cosmic Rays. In order to avoid the 10% of cases where cosmic rays result in 2-bit-errors the ECC cells are typically interleaved over chips so no two cells are next to each other. A cosmic ray event which affects two cells will therefore result in two correctable 1bit errors.

Sun states: (Part No. 816-5053-10 April 2002)

Generally speaking, cosmic ray soft errors occur in DRAM memory at a rate of ~10 to 100 FIT/MB (1 FIT = 1 device fail in 1 billion hours). So a system with 10 GB of memory should show an ECC event every 1,000 to 10,000 hours, and a system with 100 GB would show an event every 100 to 1,000 hours. However, this is a rough estimation that will change as a function of the effects outlined above.

回答7:

Memory errors are real, and ECC memory does help. Correctly implemented ECC memory will correct single bit errors and detect double bit errors (halting the system if such an error is detected.) You can see this from how regularly people complain about what seems to be a software problem that is resolved by running Memtest86 and discovering bad memory. Of course a transient failure caused by a cosmic ray is different to a consistently failing piece of memory, but it is relevant to the broader question of how much you should trust your memory to operate correctly.

An analysis based on a 20 MB resident size might be appropriate for trivial applications, but large systems routinely have multiple servers with large main memories.

Interesting link: http://cr.yp.to/hardware/ecc.html

The Corsair link in the page unfortunately seems to be dead.

回答8:

If a program is life-critical (it will kill someone if it fails), it needs to be written in such a way that it will either fail-safe, or recover automatically from such a failure. All other programs, YMMV.

Toyotas are a case in point. Say what you will about a throttle cable, but it is not software.

See also http://en.wikipedia.org/wiki/Therac-25

回答9:

This is a real issue, and that is why ECC memory is used in servers and embedded systems. And why flying systems are different from ground-based ones.

For example, note that Intel parts destined for "embedded" applications tend to add ECC to the spec sheet. A Bay Trail for a tablet lacks it, since it would make the memory a bit more expensive and possibly slower. And if a tablet crashes a program every once in a blue moon, the user does not care much. The software itself is far less reliable than the HW anyway. But for SKUs intended for use in industrial machinery and automotive, ECC is mandatory. Since here, we expect the SW to be far more reliable, and errors from random upsets would be a real issue.

Systems certified to IEC 61508 and similar standards usually have both boot-up tests that check that all RAM is functional (no bits stuck at zero or one), as well as error handling at runtime that tries to recover from errors detected by ECC, and often also memory scrubber tasks that go through and read and write memory continuously to make sure that any errors that occur get noticed.

But for mainstream PC software? Not a big deal. For a long-lived server? Use ECC and a fault handler. If an uncorrectable error kills the kernel, so be it. Or you go paranoid and use a redundant system with lock-step execution so that if one core gets corrupted, the other one can take over while the first core reboots.

回答10:

I once programmed devices which were to fly in space, and then you (supposedly, noone ever showed me any paper about it, but it was said to be common knowledge in the business) could expect cosmic rays to induce errors all the time.

回答11:

"cosmic ray events" are considered to have a uniform distribution in many of the answers here, this may not always be true (i.e. supernovas). Although "cosmic rays" by definition (at least according to Wikipedia) comes from outer space, I think it's fair to also include local solar storms (aka coronal mass ejection under the same umbrella. I believe it could cause several bits to flip within a short timespan, potentially enough to corrupt even ECC-enabled memory.

It's well-known that solar storms can cause quite some havoc with electric systems (like the Quebec power outage in March 1989). It's quite likely that computer systems can also be affected.

Some 10 years ago I was sitting right next to another guy, we were sitting with each our laptops, it was in a period with quite "stormy" solar weather (sitting in the arctic, we could observe this indirectly - lots of aurora borealis to be seen). Suddenly - in the very same instant - both our laptops crashed. He was running OS X, and I was running Linux. Neither of us are used to the laptops crashing - it's a quite rare thing on Linux and OS X. Common software bugs can more or less be ruled out since we were running on different OS'es (and it didn't happen during a leap second). I've come to attribute that event to "cosmic radiation".

Later, "cosmic radiation" has become an internal joke at my workplace. Whenever something happens with our servers and we cannot find any explanation for it, we jokingly attribute the fault to "cosmic radiation". :-)

回答12:

More often, noise can corrupt data. Checksums are used to combat this on many levels; in a data cable there is typically a parity bit that travels alongside the data. This greatly reduces the probability of corruption. Then on parsing levels, nonsense data is typically ignored, so even if some corruption did get past the parity bit or other checksums, it would in most cases be ignored.

Also, some components are electrically shielded to block out noise (probably not cosmic rays I guess).

But in the end, as the other answerers have said, there is the occasional bit or byte that gets scrambled, and it's left up to chance whether that's a significant byte or not. Best case scenario, a cosmic ray scrambles one of the empty bits and has absolutely no effect, or crashes the computer (this is a good thing, because the computer is kept from doing harm); but worst case, well, I'm sure you can imagine.

回答13:

I have experienced this - It's not rare for cosmic rays to flip one bit, but it's very unlikely that a person observe this.

I was working on a compression tool for an installer in 2004. My test data was some Adobe installation files of about 500 MB or more decompressed.

After a tedious compression run, and a decompression run to test integrity, FC /B showed one byte different.

Within that one byte the MSB had flipped. I also flipped, worrying that I had a crazy bug that would only surface under very specific conditions - I didn't even know where to start looking.

But something told me to run the test again. I ran it and it passed. I set up a script to run the test 5 times overnight. In the morning all 5 had passed.

So that was definitely a cosmic ray bit flip.

回答14:

You might want to have a look at Fault Tolerant hardware as well.

For example Stratus Technology builds Wintel servers called ftServer which had 2 or 3 "mainboards" in lock-step, comparing the result of the computations. (this is also done in space vehicles sometimes).

The Stratus servers evolved from custom chipset to lockstep on the backplane.

A very similar (but software) system is the VMWare Fault Tolerance lockstep based on the Hypervisor.

回答15:

As a data point, this just happened on our build:

02:13:00,465 WARN  - In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ostream:133:
02:13:00,465 WARN  - /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/locale:3180:65: error: use of undeclared identifier '_'
02:13:00,465 WARN  - for (unsigned __i = 1; __i < __trailing_sign->size(); ++_^i, ++__b)
02:13:00,465 WARN  - ^
02:13:00,465 WARN  - /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/locale:3180:67: error: use of undeclared identifier 'i'
02:13:00,465 WARN  - for (unsigned __i = 1; __i < __trailing_sign->size(); ++_^i, ++__b)
02:13:00,465 WARN  - ^

That looks very strongly like a bit flip happening during a compile, in a very significant place in a source file by chance.

I'm not necessarily saying this was a "cosmic ray", but the symptom matches.

来源：https://stackoverflow.com/questions/2580933/cosmic-rays-what-is-the-probability-they-will-affect-a-program

标签

statistics

physics

probability

error-detection

risk-analysis