Cores that don't count

This paper is from Google and appeared at HotOS 2021. There is also a very nice 10 minute video presentation for it.

So Google found fail-silent Corruption Execution Errors (CEEs) at CPU/cores. This is interesting because we thought tested CPUs do not have logic errors, and if they had an error it would be a fail-stop or at least fail-noisy hardware errors triggering machine checks. Previously we had known about fail-silent storage and network errors due to bit flips, but the CEEs are new because they are computation errors. While it is easy to detect data corruption due to bit flips, it is hard to detect CEEs because they are rare and require expensive methods to detect/correct in real-time. 


What are the causes of CEEs?

This is mostly due to ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design. Together, these create new challenges for the verification methods that chip makers use to detect diverse manufacturing defects --especially those defects that manifest in corner cases (under certain voltage, frequency, temperature), or only after post-deployment aging. Chip manifacturing is magic, and with 5nm technology some gates are of the length of 10 atoms, which can lead to flaky behavior.


Are CEEs reproducible? How do they manifest themselves?

The paper says this. CEEs are harder to root-cause than software bugs, which we usually assume we can debug by reproducing on a different machine. In just a few cases, we can reproduce the errors deterministically; usually the implementation-level and environmental details have to line up. Data patterns can affect corruption rates, but it’s often hard for us to tell. Some specific examples where we have seen CEE:

  • Violations of lock semantics leading to application data corruption and crashes.
  • Data corruptions exhibited by various load, store, vector, and coherence operations.
  • A deterministic AES mis-computation, which was “self-inverting”: encrypting and decrypting on the same core yielded the identity function, but decryption elsewhere yielded gibberish.
  • Corruption affecting garbage collection, in a storage system, causing live data to be lost.
  • Database index corruption leading to some queries, depending on which replica (core) serves them, being non- deterministically corrupted.
  • Repeated bit-flips in strings, at a particular bit position (which stuck out as unlikely to be coding bugs).
  • Corruption of kernel state resulting in process and kernel crashes and application malfunctions.


How dangerous is this?

It is a serious problem. The paper says that Google has already applied many engineering decades to the problem. Because CEEs may be correlated with specific execution units within a core, they expose us to large risks appearing suddenly and unpredictably due to seemingly-minor software changes, such as an innocuous change to a low-level library. Only a small subset of the server machines (called mercurial cores) would be effected with the CEEs.


Which chipsets do they occur and how frequently?

The paper does not reveal too much information about CEEs. They don't even mention which chips they observed these. They don't reveal the rate of mercurial cores, but at one place mention 1 in 1000 is possible. 


How do we detect and mitigate fail-silent CEEs?

With storage and networking, the "right result" is obvious and simple to check: it’s the identity function. That enables the use of coding-based techniques to tolerate moderate rates of correctable low-level errors in exchange for better scale, speed, and cost. Detecting CEEs, conversely, seems to imply a factor of two of extra work. Automatic correction seems to possibly require triple modular redundancy. Most computational failures cannot be addressed by coding. Storage and networking can better tolerate low-level errors because they typically operate on relatively large chunks of data, such as disk blocks or network packets. This allows corruption-checking costs to be amortized, which seems harder to do at a per-instruction scale.


Is this Byzantine failure?

I think fail-silent CEEs is weaker than the adversary Byzantine failure model. The chips do not arbitrarily/fully deviate from the protocols. On the other hand, it is likely stronger than transient memory corruption because the corruption may keep reintroduced because it is coming from computation.  


Further reading?

This recent report from Facebook also reports fail-silent CEEs.


What, going forward?

Maybe this will lead to abondonment of complex deep-optimizing chipsets like Intel chipsets, and make simpler chipsets, like ARM chipsets, more popular for datacenter deployments. AWS has started using ARM-based Graviton cores due to their energy-efficiency and cost benefits, and avoiding CEEs could give boost to this trend. 

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Foundational distributed systems papers

Advice to the young

Linearizability: A Correctness Condition for Concurrent Objects

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book