Posts

FoundationDB Record Layer: A Multi-Tenant Structured Datastore

Image
This is a 2019 arxiv report . Back in 2019, when the report was out, I wrote a review about it, but did not publish it then because I felt I didn't have enough information on FoundationDB yet. With FoundationDB Sigmod 2021 paper out recently , I am now releasing that earlier write up. I will follow up on this soon with a review of the Sigmod21 paper on FoundationDB. Introduction FoundationDB made a bold design choice of ACID key-value store. They had released a transaction manifesto: Everyone needs transactions Transactions make concurrency simple Transactions enable abstraction Transactions enable efficient data representations Transactions enable flexibility Transactions are not as expensive as you think Transactions are the future of NoSQL FoundationDB, available as opensource , consists of the transactional minimalist storage engine as the base layer, and other layers are developed on top of the base layer to extend functionality. The record layer, that the report d

Genius: The Life and Science of Richard Feynman

This is a 1992 biography of Feynman by James Gleick. The book provides good coverage of both the life and science of Feynman. Having read most of the books about Feynman, I can say that this is the best out there.  In contrast to the other books which were anecdote heavy, this provides a more balanced coverage of Feynman's life and science together. The book also gives good insights into some aspects of Feynman's personality that was missing in the other books. I hadn't read about Feynman's nervous break down on his father's grave, his depressive episodes, and his rivalry with other physicians before. The book also has a chapter discussing Feynman's unacceptable attitude towards women. The science coverage in the book is top notch, and gives a detailed explanation of how the field of quantum physics started and grew in to a discipline alongside Feynman's life. We get a good picture of how the proposed theories get refined and evolve as they interact with oth

Silent data corruptions at scale

Image
This paper from Facebook (Arxiv Feb 2021) is referred in the Google fail-silent Corruption Execution Errors (CEEs) paper as the most related work. Both papers discuss the same phenomenon, and say that we need to update our belief about quality-tested CPUs not having logic errors, and that if they had an error it would be a fail-stop or at least fail-noisy hardware errors triggering machine checks. This paper provides an account of how Facebook have observed CEEs over several years. After running a wide range of silent error test scenarios across 100K  machines, they found that 100s of CPUs are identified as having these errors, showing that CEEs are a systemic issue across generations. This paper, as the Google paper, does not name specific vendor or chipset types. Also the ~1/1000 ratio reported here matches the ~1/1000 mercurial core ratio that the Google paper reports. The paper claims that silent data corruptions can occur due to device characteristics and are repeatable at scale

Cores that don't count

This paper is from Google and appeared at HotOS 2021 . There is also a very nice 10 minute video presentation for it. So Google found fail-silent Corruption Execution Errors (CEEs) at CPU/cores. This is interesting because we thought tested CPUs do not have logic errors, and if they had an error it would be a fail-stop or at least fail-noisy hardware errors triggering machine checks. Previously we had known about fail-silent storage and network errors due to bit flips, but the CEEs are new because they are computation errors. While it is easy to detect data corruption due to bit flips, it is hard to detect CEEs because they are rare and require expensive methods to detect/correct in real-time.  What are the causes of CEEs? This is mostly due to ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design. Together, these create new challenges for the verification methods that chip makers use to detect diverse

Tale of two cities

In January, I took a one year leave of absence from the University at Buffalo and joined Amazon's Automated Reasoning Group at AWS S3. At S3-ARG, our mission is to apply formal methods for verification of large scale distributed systems to provide durability, availability, and security guarantees. It has been 5 months, and I am loving it.  Going forward I will have more opportunities to talk about my work at S3-ARG. Today, I wanted to reflect on the difference of objectives/incentives in industry and academia and how that shapes the corresponding landscapes. What I write is *my subjective experiences*. At both places I have been blessed with great colleagues and great working environments, so my comparison is mostly about relative merits of ideal positions in academia and industry.  What's the goal? The goal of academic research in CSE is to sell a new vision of 10 years in the future.  In academia, there is a very perverse prioritization of novelty over practicality, usefulnes

Building Distributed Systems With Stateright

Image
Stateright is a model checker for distributed systems. It is provided as a Rust library, and it allows you to verify systems implemented in Rust. It is openly available on GitHub and the developer, Jon Nadal, is looking for contributors and new users.   On Tuesday Jon gave a presentation to us on Zoom. He made his presentation slides available here. We have also recorded Jon's presentation and the Q&A and demo sessions in entirety.    The highlights of Stateright are: great visualization support, time travel debugger: which helps you go back/forth and choose to explore another branch from a given point of the current execution (in the Figure below, the Next Steps heading provide possible next steps to choose from), an actor-based model, an embedded linearizability tester, and extensive docs and Rust book for introducing the concepts. The model trait has state, init_states, actions, next_state, and properties. Similarly there is an actor trait you can implement, and model che

Book review. Storyworthy: Engage, Teach, Persuade, and Change Your Life through the Power of Storytelling

The most powerful person in the world is the storyteller. The storyteller sets the vision, values and agenda of an entire generation that is to come. -- Steve Jobs This book is by  Matthew Dicks , 48-time Moth StorySLAM winner and 6-time GrandSLAM champion. The book gives great tips about crafting stories. Earlier I had covered " Made to stick " and " Talk like TED "  on presenting and story telling. This book is at a different level than those. I strongly recommend you to read this book. It is entertaining as much as it is informative. This is like a short-story format version of the Hollywood movie-script format storytelling, which I covered briefly with " Nobody wants to read your shit ". Both books have the same message really: "You must streamline your message (staying on theme), and make its expression fun (organizing around an interesting concept)." My highlights from the book No one ever made a decision because of a number. They need a st

Sundial: Fault-tolerant Clock Synchronization for Datacenters

Image
This paper appeared recently in OSDI 2020 . This paper is about clock synchronization in the data center. I presented this paper for our distributed systems zoom meeting group . I took a wider view of the problem by explaining time synchronization challenges and fundamental techniques to achieve precise time synchronization. I will take the same path in this post as well. It is a bit circuitous road, but it gives a scenic pleasurable journey. So let's get going. The benefits of better time synchronization For any distributed system, timestamping and ordering of events is a very important thing. Processes in a distributed system run concurrently without knowing what the other processes are doing at the moment.  Processes learn about each other's states only by sending and receiving messages and this information by definition come from the past state of the nodes. The process needs to compose the coherent view of the system from these messages and all the while the system is movi

Your attitude determines your success

This may sound like a cliche your dad used to tell, but after many years of going through new areas, ventures, and careers, I find this to be the most underrated career advice. This is the number one advice I would like my kids to internalize as they grow up. This is the most important idea I would like every one undertaking a new venture to know.  If you think you are not good enough, it becomes a self-fulfilling prophecy. If you think you are not enjoying something, you start to hate it.  I gave examples of this several times before. Let's suffice with this one : In graduate school, I had read "Hackers: Heroes of the Computer Revolution" from Steven Levy and enjoyed it a lot. (I still keep the dog eared paper copy with affection.) So, I should have read Steven Levy's Crypto book a long time ago. But for some reason, I didn't...even though I was aware of the book. I guess that was due to a stupid quirk of mine; I had some aversion to the security/cryptography res

Defending Computer Science & Engineering in a life raft debate

Image
What is a life raft debate ? In the Life Raft Debate, we imagine that there has been a nuclear war, and the survivors (the audience) are setting sail to rebuild society from the ground up. There is a group of academic-types vying to win the coveted Oar and get on the raft, and only one seat is left. Each professor has to argue that his or her discipline is the one indispensable area of study that the new civilization will need to flourish. At the end of the debating, the audience votes and the lucky winner claims the Oar and climbs aboard, waving goodbye to the others.   Maybe a discipline worth its own salt would  be able to built their own boat, no? Or a good discipline would have documented their findings so well and made itself a science rather than an art, so a practitioner is not needed to transfer information. Which discipline do I think should be the discipline to go? Let me tell you, I would oppose having a computer science and engineering person on the boat before we make sur

Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

My Distributed Systems Seminar's reading list for Fall 2020

Silent data corruptions at scale

I have seen things

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

PigPaxos: Devouring the communication bottlenecks in distributed consensus

Sundial: Fault-tolerant Clock Synchronization for Datacenters

Facebook's software architecture