HPTS'17 day 2

- November 03, 2017

On HPTS day 2, there were 4 presentation sessions (I mention 2 below) and an evening session on remembering Jim Gray and Ed Lassettre.

(Follow these links for HPTS'17 day 0 and day 1.)

Verification of Systems sesion

The first talk was Jepsen VIII by Kyle Kingsbury, who breaks databases for a living. He gave a very fast paced talk. A good rule of thumb for presentations is to go with 2 minutes per slide. Kyle flips this rule upside down and then goes even further to present 5 slides per minute. He presented 150+ slides in less than 30 minutes, and somehow he made this work. I can't believe how quickly/smoothly he was able to transition from a slide to the next, and how he managed to memorize all those transitions.

Kyle's Jepsen toolkit tests databases as blackboxes using a client to submit overlapping operations, where the start/end of operations define the operation intervals. To prepare these tests, Kyle first carefully reads through the documentation to see which guarantees are claimed and then he writes a bunch of targeted tests in Jepsen to check if the database indeed behaves as advertised under faults such as partitions, crashes, and clock skew.

Kyle first reported on Jepsen test results of VoltDB, an in-memory SQL database. The documents claimed that all transactions are strictly serializable, but this was violated in v6.3, because stale reads, dirty reads, lost updates were possible in the presence of network partitions. VoltDB 6.4 passed all Jepsen tests for strict serializability, even when this meant incurring a 20% performance hit for read to achieve this.

He then talked about MongoDB. Using wall-clocks as optime timestamps led to lost updates due to clock-skew. MongoDB fixed them by making the optime a tuple of logical-term and wall-clocks. Despite these improvements, Jepsen tests identified data loss issues even with the replication v1 protocol of MongoDB.

Kyle also talked about Tendermint blockchain which was found to lose documents and fail to tolerate Byzantine faults.

He concluded with these advice for distributed system design: be formal and specific, figure out the invariants your system needs, consider your failure modes (e.g., crash, clock skew, process pause, partition), and test the system end to end.

The next 2 talks in the session was from Peter Alvaro and his student, on the topic of Lineage Driven Fault Injection for testing. Peter had 3 other students presenting at HPTS, which was an impressive accomplishment.