tag:blogger.com,1999:blog-84363307621363443792024-03-18T20:09:27.521-04:00MetadataOn distributed systems broadly defined and other curiosities. The opinions on this site are my own.Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.comBlogger660125tag:blogger.com,1999:blog-8436330762136344379.post-16993303226251821752024-03-15T10:03:00.003-04:002024-03-15T13:18:43.073-04:00 The demise of coding is greatly exaggerated<p><a href="https://twitter.com/Carnage4Life/status/1761483377365152234">NVDIA CEO Jensen Huang recently made very contraversial remarks</a>:</p><p><i>"Over the course of the last 10 years, 15 years, almost everybody who sits on a stage like this would tell you that it is vital that your children learn computer science, and everybody should learn how to program. And in fact, it’s almost exactly the opposite.</i></p><p><i>It is our job to create computing technology such that nobody has to program and that the programming language is human. Everybody in the world is now a programmer. </i><i>This is the miracle of artificial intelligence."</i></p><p>I am not going to wise crack and say that this is power poisioning and this is what happens when your company valuation more than triples in a year and surpasses Amazon and Google. (Although I don't discount this effect completely.)</p><p>Jensen is very smart and also has <a href="https://x.com/RachaelRad/status/1767612555063955891">some great wisdom</a>, so I think we should give this the benefit of doubt and try to respond in a thoughtful manner. </p><p>A response is warranted because this statement got a lot of publicity, and created confusion for a wide range of people as this comes with some authority behind it. My brother asked me about this, presumably because he wanted to see how he might want to direct the education of his children.</p><p>My response is not motivated by turf-defending or out of job security concerns with the rise of AI. I am a researcher, and my day to day job is not coding/programming. I don't feel threatened a bit about proliferation of AI tools.</p><p><br /></p><h1 style="text-align: left;"><a href="https://en.wikipedia.org/wiki/The_king_is_dead,_long_live_the_king!">Coding is dead, long live coding</a></h1><p>With every new advancement in programming languages and technology, this concern came up anew. Some people declared coding is dead, and some people freaked out. What ended up happening over and over is that we just had a higher level specification/abstraction and the demand for coding/programming went up thanks to those new developments. Moreover, old programming languages and their niche still stayed mostly undisturbed. After more than six decades, <a href="https://en.wikipedia.org/wiki/COBOL">Cobol</a> is still widely used in applications deployed on mainframe computers, such as large-scale batch and transaction processing jobs.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgpqBNNdWhvVkv5p047uCpQ-0C1D8DiF8wpiSAeRYc-1mN_ACz1IIk4o4H0hrpY7IGWrv4pjzLX3TKHJ-jBt4j-Pceq9TJtvzf0Ql-tRv7J1ClFMj9jn0Jus44TOSgYmnpObAZTSNjZV81hax5nGPZJctZLWjYcTfYSBa0kkIC1Oqc9R5ogEZt9qJdJQqE" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="613" data-original-width="650" height="605" src="https://blogger.googleusercontent.com/img/a/AVvXsEgpqBNNdWhvVkv5p047uCpQ-0C1D8DiF8wpiSAeRYc-1mN_ACz1IIk4o4H0hrpY7IGWrv4pjzLX3TKHJ-jBt4j-Pceq9TJtvzf0Ql-tRv7J1ClFMj9jn0Jus44TOSgYmnpObAZTSNjZV81hax5nGPZJctZLWjYcTfYSBa0kkIC1Oqc9R5ogEZt9qJdJQqE=w640-h605" width="640" /></a></div><p></p><p><a href="https://www.commitstrip.com/en/2016/08/25/a-very-comprehensive-and-precise-spec/">This comic strip, from CommitStrip (2016)</a>, sums it up well. There will always be coding. The abstraction level may go up, we may start using domain specific languages (DSLs), but we will still need to be precise and comprehensive in our specifications to solve real world problems. The world is very complicated, <a href="https://muratbuffalo.blogspot.com/2018/10/debugging-designs-with-tla.html">there are corner cases everywhere.</a></p><p><a href="https://www.cs.utexas.edu/users/EWD/transcriptions/EWD06xx/EWD667.html">Natural language is ambigious and not suitable for programming</a>. LLMs still need to generate code to get things done. If not inspected carefully, this incurs tech debt at monumental speed of the computers. The natural language prompts are not repeatable/deterministic, they are subject to breaking any time. This makes "natural language programming" unsuitable for even small sized projects, let alone medium to large projects. </p><p>Moreover, some things are inherently very hard, they are AI-complete (to adopt the term NP-complete to the occasion: hardest problems to which solutions can be verified quickly, but not necessarily able to be found in any reasonable time). <a href="https://muratbuffalo.blogspot.com/2023/09/beyond-code-tla-and-art-of-abstraction.html">I use TLA+ for modeling and designing distributed systems and algorithms</a>, I don't see AI replacing that anytime soon. There is simply too much sublety and a great deal of intelligence and expertise is required to work at the design of distributed systems and algorithms.</p><p>As my final argument (<a href="https://muratbuffalo.blogspot.com/2021/03/defending-computer-science-engineering.html">borrowing from one of my previous posts</a>), I like to mention that a career in computer science and software technology (practicing coding) gives you vital and generally applicable skills: hacking, debugging, abstract thinking, quick learning/adaptation, and organizational skills.</p><p>Being supported by AI tools is not a substitute for <a href="https://muratbuffalo.blogspot.com/2018/03/master-your-tools.html">mastering these skills</a>. You cannot borrow skills/wisdom, you need to earn and own them. As the Turkish proverb says: "You cannot drive a water mill with hand-carried buckets of water". Or as the Amazonian proverb says: "There is no compression algorithm for (hands on) experience".</p><p><br /></p><h1 style="text-align: left;">What next?</h1><p><a href="https://www.youtube.com/watch?v=vMKNUylmanQ#t=720s">Innovation begets innovation.</a> The emergence of new problems and domains is a great equalizer. As we discover things, we find new terrains open up. And a new terrain is a good opportunity to make impact without needing immense resources. AI is taking off (with a long arduous journey ahead), so this is a great opportunity to take up on computer science and coding. AI is software, and one day it will start producing software, so this only means it is a ripe opportunity to learn and work on software. </p><p>For the future, Jensen Huang suggested that "students should focus more on fields like biology, teaching, industry, or farming." This is bad advice again. Let people pursue their passion. (Unlike Calvin Newport, I am strongly on the passion camp.) If any of biology, teaching, industry, or farming is your passion (you will know if it is, it won't be ambiguous), pursue them. But it is very misguided to direct people away from computer science and software technology saying AI will take care of that and make it obsolete.</p><p>I think it is time to double down on computer science and software technology. I think we will start seeing computer science and software technology going further in to K12 school curriculum. We will start to see more Pi-shaped people, who have depth at two areas and who pursue generalist applications. After building some depth, being a <a href="https://muratbuffalo.blogspot.com/2019/06/book-review-range-why-generalists.html">generalist</a> is a <a href="https://muratbuffalo.blogspot.com/2024/01/recent-reads.html">good strategy</a>.</p><p>Finally, let me air some grievence about a pet peeve of mine. Imagine the breakthroughs we could achieve, if only we could channel 1% of the resources/effort/interest being directed to researching/developing machine learning to <a href="https://youtu.be/dY7QNxXbziA?t=14">researching/developing human learning.</a></p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com2tag:blogger.com,1999:blog-8436330762136344379.post-30916329397250030352024-03-14T17:21:00.001-04:002024-03-18T10:01:41.719-04:00Checking Causal Consistency of MongoDB<p><a href="https://hengxin.github.io/papers/2022-JCST-MongoDB-CCC.pdf">This paper</a> declares the Jepsen testing of MongoDB for causal consistency a bit lacking in rigor, and <a href="https://github.com/Tsunaou/Checking-Causal-Consistency-of-MongoDB">goes on to test MongoDB</a> against three variants of causal consistency (CC, CCv, and CM) under node failures, data movement, and network partitions. They also add <a href="https://github.com/hengxin/tla-causal-consistency">TLA+ specifications of causal consistency</a> and their checking algorithms, and verify them using the TLC model checker, and discuss how TLA+ specification can be related to Jepsen testing.</p><p>This is a journal paper, so it is long. The writing could have been better. But it is apparent that a lot of effort went into the paper.</p><p>One thing I didn't like in the paper was the wall-of-formalism around defining causal consistency in Section 2:Preliminaries. This was hard to follow. And I was upset about the use of operational definitions such as "bad patterns" for testing. Why couldn't they define this in a state-centric manner, as in <a href="https://muratbuffalo.blogspot.com/2022/06/seeing-is-believing-client-centric.html">the client-centric database isolation paper</a>?</p><p>It turns out that Section 2 did not introduce any new content, rather it was just reviewing the previous model established by the <a href="https://arxiv.org/abs/1611.00580">"On verifying causal consistency (POPL'17)" paper</a> by Bouajjani et. al. So the reason that the Preliminaries section fell flat was because it was trying to compress that 22-page Bouajjani paper into a couple of pages without much context. By quickly skimming that POPL'17 paper, I got sufficient context about the area. And boy, did I learn some interesting things about causal consistency.</p><p>So we will first look at that POPL'17 paper, and then we will come back to this paper for TLA+ specs for causal consistency and Jepsen testing of MongoDB.</p><p><br /></p><h2 style="text-align: left;">On Verifying Causal Consistency (POPL'17)</h2><p><a href="https://arxiv.org/abs/1611.00580">This is a good old distributed systems theory paper </a>without any experiments, and it is an immensely useful and practical paper as well. I learned a lot by just skimming this in an hour --yeah, I call that skimming, not reading.</p><p>The paper opens with a stunner: Verifying whether all the executions of an implementation are causally consistent is undecidable! They explain this neatly in one paragraph: "This undecidability result might be surprising, since it is known that linearizability (stronger than CC) and eventual consistency (weaker than CC) are decidable to verify in that same setting. This result reveals an interesting aspect in the definition of causal consistency. Intuitively, two key properties of causal consistency are that (1) it requires that the order between operations issued by the same site to be preserved globally at all the sites, and that (2) it allows an operation o1 which happened arbitrarily sooner than an operation o2 to be executed after o2 (if o1 and o2 are not causally related). Those are the essential ingredients that are used in the undecidability proofs (that are based on encodings of the Post Correspondence Problem). In comparison, linearizability does not satisfy (2) because for a fixed number of sites/threads, the reordering between operations is bounded (since only operations which overlap in time can be reordered), while eventual consistency does not satisfy (1)."</p><p>But fear not, they also show that in practice this is not a problem. "We prove that reasoning about causal consistency w.r.t. the RWM abstraction becomes tractable under the natural assumption of data independence (i.e., read and write instructions is insensitive to the actual data values that are read or written)."</p><p>Data independence implies that it is sufficient to consider executions where each value is written at most once, i.e., differentiated histories. Differentiated histories mean we can determine, only by looking at the operations of a history, from which write each read is reading from. There is no ambiguity, as each value can only be written once on each variable. In practice this comes from a timestamp or versionstamp attached to the data by the write operation.</p><p>I have seen this differentiated histories trick simplify and improve the checking of linearizability in practice. And the paper leverages this for checking of causal consistency, and explains the importance of differentiated histories as follows: "In this characterization, the fact that we consider only differentiated executions is crucial. The reason is that all relations used to express bad patterns include the read-from relation that associates with each read operation the write operation that provides its value. This relation is uniquely defined for differentiated executions, while for arbitrary executions where writes are not unique, reads can take their values from an arbitrarily large number of writes. This is actually the source of complexity and undecidability in the non-data independent case." </p><p>A bad pattern is a set of operations occurring within an execution in some particular order corresponding to a causal consistency violation. They show that for a given execution, checking that it contains a bad pattern can be done in polynomial time. They also show that for each bad pattern, it is possible to construct effectively an observer (which is a state-machine of some kind) that is able, when running in parallel with an implementation, to detect all the executions containing the bad pattern. The efficiency insight here is that proving causal consistency for any given implementation with differentiated histories reduces to proving its causal consistency for a bounded data domain. </p><p>The paper then defines different flavors of causal consistency and relates them to each other formally.</p><p></p><ul style="text-align: left;"><li>CC: allows non-causally dependent operations to be executed in different orders by different sites, and decisions about these orders to be revised by each site. This models mechanisms for solving the conflict between non-causally dependent operations where each site speculates on an order between such operations and possibly roll-backs some of them if needed later in the execution, e.g. Bayou and <a href="https://muratbuffalo.blogspot.com/2012/09/dont-settle-for-eventual-scalable.html">COPS</a>.</li><li>CCv: assumes that there is a total order between non-causally dependent operations and each site can execute operations only in that order (when it sees them). Therefore, a site is not allowed to revise its ordering of non-causally dependent operations, and all sites execute in the same order the operations that are visible to them, e.g., Gentle-Rain, Bolt-On Causal Consistency.</li><li>CM: a site is allowed to diverge from another site on the ordering of non-causally dependent operations, but is not allowed to revise its ordering later on.</li></ul><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgED7bVUKG4BJGG_Qy1DNcwcfFync3QD-AQEuVsRpqXimPwPz_wphxNb0bt2xrmGCQe6vx9waFow5PJs6SHYWdq5aEruOu09HiJGsGMDfZOmKEMcgWPmPkSVsTzntbpEY0vPqn6bDz7LSmLRTcYK0B9JK88hVqmS9Cvj5OPFJpmw02z7plS-09R2RZjaHg" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1576" data-original-width="1132" src="https://blogger.googleusercontent.com/img/a/AVvXsEgED7bVUKG4BJGG_Qy1DNcwcfFync3QD-AQEuVsRpqXimPwPz_wphxNb0bt2xrmGCQe6vx9waFow5PJs6SHYWdq5aEruOu09HiJGsGMDfZOmKEMcgWPmPkSVsTzntbpEY0vPqn6bDz7LSmLRTcYK0B9JK88hVqmS9Cvj5OPFJpmw02z7plS-09R2RZjaHg=s16000" /></a></div><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgvysg8W_uweO1jiVFK5Lbhvf6mlh6cuk3-ICEu6NCNUkZzC2kvl1AWTdCEfuBPDXoqAELfCq2-AARljm7f5V2Nd06J31y0Sg2-4hsDAOljLZnI_tn-dTbs78hvKF361NnxtxcO7RkgGVoSwgYfaObcApGjdX6mO1ikFYxqY-JtWU27A81l5hVx9em6Rj0" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="624" data-original-width="1132" src="https://blogger.googleusercontent.com/img/a/AVvXsEgvysg8W_uweO1jiVFK5Lbhvf6mlh6cuk3-ICEu6NCNUkZzC2kvl1AWTdCEfuBPDXoqAELfCq2-AARljm7f5V2Nd06J31y0Sg2-4hsDAOljLZnI_tn-dTbs78hvKF361NnxtxcO7RkgGVoSwgYfaObcApGjdX6mO1ikFYxqY-JtWU27A81l5hVx9em6Rj0=s16000" /></a></div><p></p><p>Both CCv and CM strengthen CC in independent and incomparable ways. And here are the bad patterns for checking whether a trace satisfies CC, CCv, or CM. Below RF is the ReadFrom relation. PO is program order say from executing in the same thread. The relation CO, defined as (PO \union RF)^+, represents the smallest causality order possible.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhVAWtew73tCo7Dchgg4foTR5b4MujcSwHOPjGsF4boGU4PAYggcXH-2RimsE1FhI_tR9qQyzbtOSce8UKJhbvA0iLiQnf_wsk9KbPnYcsq8k3Er9XKRaToyNfuLaIl8nEqMmLQUX-jDRVDlDzvxSeaUcpQIvqmEud-GOO5vrSSyLgjccNIgZiL0upvsqw" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1460" data-original-width="1132" src="https://blogger.googleusercontent.com/img/a/AVvXsEhVAWtew73tCo7Dchgg4foTR5b4MujcSwHOPjGsF4boGU4PAYggcXH-2RimsE1FhI_tR9qQyzbtOSce8UKJhbvA0iLiQnf_wsk9KbPnYcsq8k3Er9XKRaToyNfuLaIl8nEqMmLQUX-jDRVDlDzvxSeaUcpQIvqmEud-GOO5vrSSyLgjccNIgZiL0upvsqw=s16000" /></a></div><br /><p><br /></p><h1 style="text-align: left;">TLA+ specifications</h1><p>Ok, back to the paper at hand. The paper presents <a href="https://github.com/hengxin/tla-causal-consistency">TLA+ specifications for CC, CCv, and CM</a>. They mention that model checking histories against CC , CCv , or CM as defined in Fig 7 are prohibitively inefficient. So they go through some optimizations to improve the checking time, which culminates in implementing an efficient partial order enumeration algorithm in Python, and letting TLC call it when necessary.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEi2hVQK4ymvJvNzbzBcnpg-KjNM_1WzKhEVIHcw_yGnC3xbZ3U5qxpakX_c83GiviAyR4fms1M8zM4nWCXc7rDRR6b-vDK-mSewave3_FcfZkU4wF7Wur5_U9ONXF2QbDavp8XIvwxQh4ZmXcBCYoDkjOB_e914oGV_hnsnKlg79qJFhIK6Z3YhoIYJizA" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1546" data-original-width="1466" src="https://blogger.googleusercontent.com/img/a/AVvXsEi2hVQK4ymvJvNzbzBcnpg-KjNM_1WzKhEVIHcw_yGnC3xbZ3U5qxpakX_c83GiviAyR4fms1M8zM4nWCXc7rDRR6b-vDK-mSewave3_FcfZkU4wF7Wur5_U9ONXF2QbDavp8XIvwxQh4ZmXcBCYoDkjOB_e914oGV_hnsnKlg79qJFhIK6Z3YhoIYJizA=s16000" /></a></div><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiHw7Hv9YcKfTakTBo5gMF0RyQVJxb6wPWkxbvWXi58pTHx2w38y5CnFmYKyRRrwkqajvQIgQr-VrXdqjbav4emxNd-NdBHdBUnhhbCkyBaNPuNUfTyBoIjPKNJ8HUxW2QtrDYm2HUE9fgVzeWnN1Pf0OYSL_dkuU6zQMG5ryLhrpo56nHaHefJ-EVOaTo" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1488" data-original-width="1466" src="https://blogger.googleusercontent.com/img/a/AVvXsEiHw7Hv9YcKfTakTBo5gMF0RyQVJxb6wPWkxbvWXi58pTHx2w38y5CnFmYKyRRrwkqajvQIgQr-VrXdqjbav4emxNd-NdBHdBUnhhbCkyBaNPuNUfTyBoIjPKNJ8HUxW2QtrDYm2HUE9fgVzeWnN1Pf0OYSL_dkuU6zQMG5ryLhrpo56nHaHefJ-EVOaTo=s16000" /></a></div><p>They use very small traces to check with TLA+ model checking. But they also mention that it is possible to combine/apply this to traces gathered from directing Jepsen to MongoDB. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgAylbPdBw6gtHkamkjKux2c_1uPtfc5G3duafqVOz7MsowxKhch3cFURX3kYX4F25XS0inXqyrbtkHvJ8QhqLkXz2yDN726qGa4Fm3rPx6ZBIjT1AImpiBQ77PVPF1S_3iDUVSlp6Pa4y0SU1IRCs7nBjVkJwevfAvGU1C8N6u2HvQ3rLTw9JqGCmrjxA" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1096" data-original-width="1466" src="https://blogger.googleusercontent.com/img/a/AVvXsEgAylbPdBw6gtHkamkjKux2c_1uPtfc5G3duafqVOz7MsowxKhch3cFURX3kYX4F25XS0inXqyrbtkHvJ8QhqLkXz2yDN726qGa4Fm3rPx6ZBIjT1AImpiBQ77PVPF1S_3iDUVSlp6Pa4y0SU1IRCs7nBjVkJwevfAvGU1C8N6u2HvQ3rLTw9JqGCmrjxA=s16000" /></a></div><br /><p><br /></p><h1 style="text-align: left;">Extended Jepsen testing</h1><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiAzDsY9g2v91hCuh7WdmRegYoJQPhpVthM7eYmOKaWfRd2mnffL7npKhDIaxBodjosHsRYEmgCUYLbVXb7LR8JLMhD4TC6faBWAVDFNXjALWsvULHMvpgrLBEkmo7huMJLFget-zEosF_4XsHJ0kCIzezC3T03hAMk3PmPLtAAr027s6RReEcj5UotXxc" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="406" data-original-width="2160" src="https://blogger.googleusercontent.com/img/a/AVvXsEiAzDsY9g2v91hCuh7WdmRegYoJQPhpVthM7eYmOKaWfRd2mnffL7npKhDIaxBodjosHsRYEmgCUYLbVXb7LR8JLMhD4TC6faBWAVDFNXjALWsvULHMvpgrLBEkmo7huMJLFget-zEosF_4XsHJ0kCIzezC3T03hAMk3PmPLtAAr027s6RReEcj5UotXxc=s16000" /></a></div><p>They use Jepsen to get histories from MongoDB, and apply <a href="https://github.com/Tsunaou/Checking-Causal-Consistency-of-MongoDB">Java based implementation of CC, CCv, CM, for checking causal consistency of these histories.</a></p><p>The experimental results confirm the claim in <a href="https://www.mongodb.com/docs/manual/core/causal-consistency-read-write-concerns/">MongoDB’s documentation</a> that in the presence of node failures or network partitions, causally consistent sessions guarantee causal consistency only for reads with majority readConcern and writes with majority writeConcern. In the absence of node failures or network partitions causal-consistency is guaranteed by all configurations including read concern local and write concern w1.</p><p>All of CC variants testing gave identical results: there was no configuration where CC, CCv, and CM checks disagreed with each other. So it was not clear if studying these variants explicitly bought us anything. </p><p><br /></p><h1 style="text-align: left;">Discussion</h1><p>Ok, let me return to the question I had in the beginning. Is it possible to define causal-consistency in a state-centric manner, rather than via an operational manner, using bad patterns?</p><p>I checked <a href="https://muratbuffalo.blogspot.com/2022/06/seeing-is-believing-client-centric.html">the client-centric database isolation paper,</a> and it does not include anything close to causal consistency (which is fair, as this is a consistency property and not an isolation property). The paper presents SER, SI, and ReadCommitted. And of course ReadCommitted could be arbitrarily stale and that is not good. But, I still don't see why it could not be possible to have state-centric definition for causal consistency. Maybe the problem is about how much metadata that would involve, and if there would be a convenient way to represent/refer to that metadata.</p><p><br /></p><p>Ok, moving on to the next discussion point. Causal consistency is interesting. It provides a nice tradeoff point for consistency space similar to how <a href="https://muratbuffalo.blogspot.com/2024/01/scalable-oltp-in-cloud-whats-big-deal.html">SnapshotIsolation</a> does for the transactional isolation space. Causal consistency is the strictest consistency level where you can still make progress in the presence of partitions as discussed in the <a href="http://www.cs.cornell.edu/lorenzo/papers/cac-tr.pdf">"Consistency, Availability, and Convergence" paper</a>. And causal consistency provides "Read your own writes" which is very helpful and sought after for developing applications. That sounds like a good tradeoff point, no?</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgOjjrxBCBNTu2l8XzsSPYGewJvSQhNfwXROaPt07SAXfTanEnDoRapXmyjRnKKHAuXNvpKLZnvnsEc9cReYICq9hCZZc6B8csSDLliKHVLR7_8LB-rTeClqrqPzIfUipurC3RDrb3QEFqRRu06gGFCUWb0AcyQh7z6BTSEPrEqTcbVbuAE2wT2ZyOHEao" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1032" data-original-width="1514" height="436" src="https://blogger.googleusercontent.com/img/a/AVvXsEgOjjrxBCBNTu2l8XzsSPYGewJvSQhNfwXROaPt07SAXfTanEnDoRapXmyjRnKKHAuXNvpKLZnvnsEc9cReYICq9hCZZc6B8csSDLliKHVLR7_8LB-rTeClqrqPzIfUipurC3RDrb3QEFqRRu06gGFCUWb0AcyQh7z6BTSEPrEqTcbVbuAE2wT2ZyOHEao=w640-h436" width="640" /></a></div>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-38317690788303427402024-03-08T21:54:00.004-05:002024-03-08T22:06:37.865-05:00 Transaction Processing Monitors (Chapter 5. Transaction processing book)<p>"Transaction Processing Monitors" is chapter 5 as part of <a href="https://muratbuffalo.blogspot.com/2024/02/transaction-processing-book-grayreuter.html">our transaction processing book reading journey. </a></p><p>This chapter had been the hardest to chapter to read and understand. It went into implementation concerns in contrast to the previous chapters which discussed design principles and concepts. It turns out 1980s were a very different time, and it is very hard to relate to the systems and infrastructure at that time. The people in our reading group also were lost, and found this chapter very hard to engage with. </p><p><br /></p><h1 style="text-align: left;">1980s</h1><p><a href="https://www.pcworld.com/article/423125/9-awesome-photos-of-school-computer-labs-from-the-1980s.html"><img alt="1980s lab 09" src="https://images.techhive.com/images/article/2015/08/1980s_lab_09-100608917-large.jpg?auto=webp&quality=85,70" /></a></p><p>Ok, we are back at the year 1990 when this book is being written. This is when the internet and the web were still obscure. Client-server processing is the modern thing then, and had just started to replace for time-sharing for databases and transaction processing.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgbfB0orkNF7UDrNspXovumiKLdURnjNGOIDXcYUSiA2UwFEw9PtReZRfMjmKX38BTrfiNil9WcVhJ-b_P7_eYPiXgpbU0KwFVv7SVItq3qyhLWJiyMNCvJ2-LvP_wXED4Xigzv_uE0TedmH0ZT_F23dyOsPlr_Q1LQF9S59vyBERWGdtmLTYemP1_Ovf8" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="828" data-original-width="1200" src="https://blogger.googleusercontent.com/img/a/AVvXsEgbfB0orkNF7UDrNspXovumiKLdURnjNGOIDXcYUSiA2UwFEw9PtReZRfMjmKX38BTrfiNil9WcVhJ-b_P7_eYPiXgpbU0KwFVv7SVItq3qyhLWJiyMNCvJ2-LvP_wXED4Xigzv_uE0TedmH0ZT_F23dyOsPlr_Q1LQF9S59vyBERWGdtmLTYemP1_Ovf8=s16000" /></a></div><p>This chapter talks about transaction oriented processing. If I squint at it, I can see the concepts of SLAs and shared responsibility models between a cloud customer and provider in these description.</p><p><i>"Despite these additional responsibilities, the performance requirements of TP systems are similar to those of real-time systems. Even though no maximum response time for all requests of a certain type must be guaranteed, the usual requirement is that around 90% of all requests have a response time less than x seconds. This qualifies transaction processing systems as soft real-time systems."</i></p><p><i>"System does recovery. Because of the use of shared data, there must be formal guarantees of consistency that are automatically maintained. After a crash, all users must be informed about the current state of their environment, which functions were executed, which were not, and so on. The guiding principle here is determined by the ACID properties of transactions."</i></p><p><br /></p><h2 style="text-align: left;">Transaction processing monitor (TP monitor)</h2><p>This was the term we found very confusing. We were not able to see which concept in modern databases the TP monitor corresponds to. This seems to be a fuzzy term even at that time, as the book warns: <i>"In a contest for the least well-defined software term, TP monitor would be a tough contender."</i></p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgYG1qHvtFBX6ljz0uxIpsRwKOpRlIS8nlvLBr2oRqknFC6drCrmh1Twt1V02ruo5yXP4zpkQ9bpPItSBXKqqvmo-nzJ6onZtbCUNFLvTIiAUITwi-0f4uyMH-kbi6L3kg5wMgGRzwS-cthJPMU1No39KXpIQ8RPq0jmWrKw9XtBiYgrlCE9LkYlPRBjyY" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1508" data-original-width="1172" src="https://blogger.googleusercontent.com/img/a/AVvXsEgYG1qHvtFBX6ljz0uxIpsRwKOpRlIS8nlvLBr2oRqknFC6drCrmh1Twt1V02ruo5yXP4zpkQ9bpPItSBXKqqvmo-nzJ6onZtbCUNFLvTIiAUITwi-0f4uyMH-kbi6L3kg5wMgGRzwS-cthJPMU1No39KXpIQ8RPq0jmWrKw9XtBiYgrlCE9LkYlPRBjyY=s16000" /></a></div><p>Do you notice the TP monitor box on the left. It interacts with the application programs: it dispatches them and gets notified when they are done. The application program invokes transactions, which is managed by a transaction model. This is a very confusing diagram, and it is hard to make anything of TP monitor since we are missing context about the workloads and software architectures of the time.</p><p><br /></p><p>This is where I emailed <a href="https://pathelland.substack.com/">Pat Helland</a>, <a href="https://muratbuffalo.blogspot.com/search?q=pat+helland">the apostate philosopher of database systems</a>, and my dear friend, for help. I imagine he had a smile emerge on his lips, and he looked in to the distance out of his office window, and traveled down the memory lane. Below is what he wrote back. This was very helpful.</p><p>TLDR: <b>A tp-monitor is an application execution environment for stylized applications focused on connecting humans to databases in an OLTP environment.</b></p><p>-----</p><p>Most computing in the 1960s & 1970s was memo-post batch work:</p><p></p><ul style="text-align: left;"><li>Load the state of the business from last nights mag tapes</li><li>Load the changes to "post" against last night's state of the system</li><li>Process the changes</li><li>Dump tonight's state on mag tapes.</li></ul><p></p><p>In the 1960s, 1970s, & early 80s, a "transaction" had two broad meanings:</p><p></p><ul style="text-align: left;"><li>An interaction with a human</li><li>A type of computation to support that interaction with a human</li><ul><li>Example 1) a "debit-credit" transaction (e.g., TPB-B) meant the work a teller at a bank did to deposit or withdraw</li><li>Example 2) an airline reservation transaction would book a flight through the Sabre system </li></ul></ul><p></p><p>There were a number of application execution environments helping run "transactions" -- the code for work by humans against the computer.</p><p></p><ul style="text-align: left;"><li>CICS (Customer Information Control System) : This worked with different databases to store data -- notably IMS/DB</li><li>IMS : Both IMS/DB (a hierarchical database) and IMS/DC (a combination of an app execution environment and data communications to front-end terminals, this was later rebranded IMS/TM)</li><li>Tandem NonStop Pathway: A Tandem specific TM Monitor for block mode terminal interaction with the NonStop SQL database</li><ul><li>Handled requests from block mode terminals</li><li>Processed a transaction per terminal interaction against the database</li><li>Designed for transparent fault tolerance with idempotent (transactional) processing of the work.</li></ul><li>Lots more...</li></ul><p>So, in summary, a tp-monitor is an application execution environment for stylized applications focused on connecting humans to databases in an OLTP environment. </p><p>----</p><p><br /></p><h2 style="text-align: left;">The application database duality</h2><p>The application and the database were one at the beginning. 1970s, a "database" included apps and managing directly attached block mode terminals. Networking meant hooking up to your terminals. They were data management systems, that are more like workflow systems, where the data management parts are interleaved within the application logic. </p><p>Then the database (as a concept) gradually got peeled out of the integrated whole as the abstractions like relational model and transactions started to emerge. But due to performance reasons, a new kind of coupling started. Stored procedures were commonly employed to execute application code/logic in the database. We saw the application reach into the database.</p><p>Since then we have seen the pendulum swing back and forth between these directions since then. <a href="https://a16z.com/the-modern-transactional-stack/">This post provides a recent discussion of workflow-centric versus database-centric transactions concept.</a> </p><p><i>"As Ian Livingstone (who provided feedback on this piece) put it, “It’s the classic ‘Do you bring the application logic to the database, or the database to the application logic?’ playing out again ... this time brought on by breaking up the monolith.” Having had that dichotomy for decades, it’s clear both models will persist in the short term. It’s far less clear that’ll remain the case in the long run. "</i></p><p><br /></p><p>Finally the transaction processing book mentions an interesting direction: "<i>Some believe it would be best if the operating systems just swallowed the TP monitor, thus making transactions a basic system service. This issue is reconsidered in Chapter 6, after the similarities between operating systems and TP systems have become clear."</i> </p><p>I think today <a href="https://dbos-project.github.io/blog/intro-blog.html">the DBOS project</a> is trying to do research along this direction.</p><p><br /></p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-50798923530015864462024-03-08T14:32:00.002-05:002024-03-08T15:08:36.097-05:00 Why I blog<p>My blog has been going for 14 years now, and has just passed 4 million pageviews. Yay! I remember <a href="https://muratbuffalo.blogspot.com/2017/02/1-million-pageviews.html">the 1 million pageviews moment in 2017</a>!</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjt2T5Z2mnxbHL2npS1HmIgm7vsoFQltWbBjh9E_jsel6sjo2Du9o3pfXYZmQLVFDgooiL0dL_hGun9cljKuknkdvpOsnNMh3P_eB8UdzHsfilG55SYSSPTlYINwbQS0EqcohVDQJBNed3RYljx-nE2GevqI9-rbKMvSO8fCURT6NeJIzBE0hODJO-geh4" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="608" data-original-width="1964" src="https://blogger.googleusercontent.com/img/a/AVvXsEjt2T5Z2mnxbHL2npS1HmIgm7vsoFQltWbBjh9E_jsel6sjo2Du9o3pfXYZmQLVFDgooiL0dL_hGun9cljKuknkdvpOsnNMh3P_eB8UdzHsfilG55SYSSPTlYINwbQS0EqcohVDQJBNed3RYljx-nE2GevqI9-rbKMvSO8fCURT6NeJIzBE0hODJO-geh4=s16000" /></a></div><p>The main reason I was able to persist for so long is because I blog for selfish reasons. Let me try to unpack why I blog, and why I keep blogging.</p><p><br /></p><h1 style="text-align: left;">I write for myself</h1><p>The audience I have in mind is myself. I blog to clarify my understanding and thinking about a topic. </p><p>Reading a research/technical paper is already time consuming. <a href="https://muratbuffalo.blogspot.com/2022/02/deep-reading.html">I can't do it in less than 4 hours.</a> Period.</p><p><a href="https://muratbuffalo.blogspot.com/2021/12/learning-technical-subject.html">I love learning.</a> And I am fortunate that I get to read research papers as part of my work. I double-dip on this effort to blog about them, to improve my understanding of these papers. <a href="https://muratbuffalo.blogspot.com/2013/07/how-i-read-research-paper.html">Writing a blog post is the final step in my pipeline for reading a paper.</a></p><p>I think my blog reviews of papers hits a good niche. Research papers are written for the wrong audience (or rather maybe the right audience but for the wrong reason): <a href="https://muratbuffalo.blogspot.com/2020/02/how-to-write-papers-so-they-get-accepted.html">they are written to please 3 specific expert reviewers who are overwhelmingly from academia</a>. Thus much of the benefit from the research and writing goes wasted. If we didn't have this objective of having to look impressive for peer-reviewing (and the resulting <a href="https://en.wikipedia.org/wiki/Handicap_principle">costly signaling effect</a>), I believe we would be able to learn way more from the research papers. The authors would aim to educate rather than impress. They would not need to be defensive about their work, and would introspect about their learnings and<a href="https://muratbuffalo.blogspot.com/2012/01/tell-me-about-your-thought-process-not.html"> their thought processes</a>. In effect, this is what I do on their behalf when I write a blog review for my understanding of the papers.</p><p>Writing for myself also keeps the voice/tone of the posts natural. Not condescending, not too pedantic. Inquisitive and somewhat playful. This puts that small bit of personal touch there. </p><p><br /></p><h1 style="text-align: left;">I set a goal of blogging once a week</h1><p>I set this goal not for the quantity of the output (more than 650 posts total, yay!), but for keeping myself in the writing frame-of-mind. What I refer to as the writing frame-of-mind is basically noticing the world around me, starting with the sphere of distributed and database systems and expending outwards and inwards. Staying in a writing frame-of-mind helps me notice things, and be mindful. The writing act following the noticing helps me process my understanding and sometimes my feelings.</p><p>I find that I need to have the intention about writing before I can find something worthy of writing. Sometimes to satisfy my one-post-per-week goal, I say "OK, fine, this topic is not interesting, but I can attempt to write some thing about it". And holy moly, am I surprised to find a fount of interesting things about that topic when I start to write. Many posts that I would not have written without this weekly goal turned out to be very insightful. I don't know what I think about something till I write about it: "Writing is nature's way of telling you how sloppy your thinking is."</p><p>The weekly posting rule keeps me to maintain a cadence and prevents my writing muscles from atrophying. This also helps me lower my standards for writing, which paradoxically maintains and raises my standards.</p><p>Another point of blogging regularly is to put myself out there in public with my understanding of a topic. This is a low-stake risk, but even this accountability works for improving the quality of my reasoning and understanding and ensures that I learned something worth sharing every week.</p><p><br /></p><h1 style="text-align: left;">I don't fret the mechanics </h1><p>Well, I write on blogspot (thank you Google, and let me knock on wood!). Any platform that is open and usable would work. I don't spend time thinking on this, because I primarily blog for myself. The post being public, rather than remaining on the local disk, is important for putting myself out in the open and taking accountability.</p><p>I write on emacs, using org-mode, and then copy it to blogspot. I don't have a git pipeline to push to the blog, etc. Copying it and doing the final edits take me 5-10 minutes. </p><p>Emacs org-mode is my playground. I can be a text-wrangler there. And play with ideas. Again, I don't know what I think about till I write about it.</p><p><br /></p><h1 style="text-align: left;">Thank you for reading! I really appreciate it</h1><p>I don't ask anybody to review my posts and give me feedback before I post. My writing is personal, I write them for myself with a low-bar, and I am ok if no one reads it. Of course I am happy when people read it and find it useful. Or even when they criticize something about it, which leads to me to compare/contrast their way of thinking with mine. But I don't ask for this or rely on this at all. </p><p>I have been continuously impressed by the outreach my blog had. Around 2016-18, I was surprised when people I met at conferences mentioned they read and like my blog posts. I first thought of this as a fluke, but this kept coming up more frequently since then. Holy moly, people read blogs!</p><p>I think the reason is that textbooks are not that common anymore. Research papers are not very accessible, and they definitely lack the personal touch. Accessible information with personal touch comes from blog posts. </p><p>I really like doing some community service via these blog posts, even though they primarily serve to scratch my own itch. Boy, did I get lucky or what. By being selfish, I also help others, and I do enjoy the <a href="https://en.wikipedia.org/wiki/Ubuntu_philosophy">ubuntu</a>, arising from this. I love learning together.</p><p>I was worried if other people reading would make me shy and not post. I think this may have curbed some half-baked thoughts or opinion pieces (I should work on this). But overall I don't think I have been affected too much. I know that I need to write for myself, and I try to post once a week to keep me going.</p><p><br /></p><h1 style="text-align: left;">Learnings</h1><p>I should be writing more. And I should be more open.</p><p>I can only be myself. And I like it better when I can be myself.</p><p>Ok, on that point of being more open and transparent, and maybe to criticize this post myself, here is another thing. This is what I think is my reason for blogging. But maybe I am just retroactively rationalizing why I blog. Maybe I am <a href="https://muratbuffalo.blogspot.com/2018/05/misc-rambling.html">predestined</a> or <a href="https://muratbuffalo.blogspot.com/2020/07/the-great-work-of-your-life-by-stephen.html">predetermined</a> to blog. I get an occasional creative itch, and blogging comes as a way to serve this. And I suspect I am wired to be a bit oblivious, and I don't mind learning in the open and sharing. So maybe these are just rationalizations rather than useful advice. But give it a try before you dispense it, no? We could use more people blogging and sharing their learning.</p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com2tag:blogger.com,1999:blog-8436330762136344379.post-1031557708723041082024-02-27T20:10:00.004-05:002024-02-28T10:27:08.765-05:00Transaction models (Chapter 4. Transaction processing book)<p>Atomicity does not mean that something is executed as one instruction at the hardware level with some magic in the circuitry preventing it from being interrupted. Atomicity merely conveys the impression that this is the case, for it has only two outcomes: the specified result or nothing at all, which means in particular that it is free of side effects. Ensuring atomicity becomes trickier as faults and failures are involved.</p><p>Consider the disk write operation, which comes in four quality levels:</p><p></p><ul style="text-align: left;"><li>Single disk write: when something goes wrong the outcome of the action is neither all nor nothing, but something in between.</li><li>Read-after write: This implementation of the disk write first issues a single disk write, then rereads the block from disk and compares the result with the original block. If the two are not identical, the sequence of writing and rereading is repeated until the block is successfully written. This has problems due to no abort path, no termination guarantee, and no partial execution guarantee.</li><li>Duplexed write: In this implementation each block has a version number, which is increased upon each invocation of the operation. Each block is written to two places, A and B, on disk. First, a single disk write is done to A; after this operation is successfully completed, a single disk write is done to B. Of course, when using this scheme, one also has to modify the disk read. The block is first read from position A; if this operation is successful, it is assumed to be the most recent, valid version of the block. If reading from A fails, then B is read.</li><li>Logged write: The old contents of the block are first read and then written to a different place (on a different storage device), using the single disk write operation. Then the block is modified, and eventually a single disk write is performed to the old location. To protect against unreported transient errors, the read-after-write technique could be used. If writing the modified block was successful, the copy of the old value can be discarded.</li></ul><p></p><p>Maybe due to their cost, these atomicity quality handling, they were not implemented. The book says:<i> "Many UNIX programmers have come to accept as a fact of life the necessity of running FSCHK after restart and inquiring about their files at the lost-and-found. Yet it need not be a fact of life; it is just a result of not making things atomic that had better not be interrupted by, for example, a crash."</i></p><p>I guess things improved only after journaled/log-based filesystems got deployed.</p><p>The book looks specifically at how to organize complex applications into units of work that can be viewed as atomic actions. The book provided some guidelines in 4.2.2, with the categorization of unprotected actions, protected actions, and real world actions.</p><p></p><ul style="text-align: left;"><li><b>Unprotected actions.</b> These actions lack all of the ACID properties except for consistency. Unprotected actions are not atomic, and their effects cannot be depended upon. Almost anything can fail.</li><li><b>Protected actions.</b> These are actions that do not externalize their results before they are completely done. Their updates are commitment controlled, they can roll back if anything goes wrong before the normal end. They have ACID properties.</li><li><b>Real actions.</b> These actions affect the real, physical world in a way that is hard or impossible to reverse.</li></ul><p></p><p>The book gives some guidelines.</p><p><i>Unprotected actions must either be controlled by the application environment, or they must be embedded in some higher-level protected action. If this cannot be done, no part of the application, including the users, must depend on their outcome.</i></p><p><i>Protected actions are the building blocks for reliable, distributed applications. Most of this text deals with them. Protected actions are easier to deal with than Real actions. They can always be implemented in such a way that repeated execution (during recovery) yields the same result; they can be made idempotent.</i></p><p><i>Real actions need special treatment. The system must be able to recognize them as such, and it must make sure that they are executed only if all enclosing protected actions have reached a state in which they will not decide to roll back. In other words, they suggested using real world actions only after the transaction is checked/committed. When doing something that is very hard or impossible to revoke, try to make sure that all the prerequisites of the critical action are fulfilled.</i></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjvBeqNB1l9xtiO5UuGghbhNpvaSF_n7CnW_TpQsCkarg6ffoniceVm6w7B3Vagc7j0cJjYnbxFGmR4WZ23e2Xpo2DfP5p6j58lyiYSlqLCmCQxU_X9OydFE1CnzfawjT8_qQyzg6BZHe9TLMYAhGb3VFXWRq8SHdFtcBtyu9rot4w7lEEfzGu5ZgkJZds" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="884" data-original-width="1112" height="318" src="https://blogger.googleusercontent.com/img/a/AVvXsEjvBeqNB1l9xtiO5UuGghbhNpvaSF_n7CnW_TpQsCkarg6ffoniceVm6w7B3Vagc7j0cJjYnbxFGmR4WZ23e2Xpo2DfP5p6j58lyiYSlqLCmCQxU_X9OydFE1CnzfawjT8_qQyzg6BZHe9TLMYAhGb3VFXWRq8SHdFtcBtyu9rot4w7lEEfzGu5ZgkJZds=w400-h318" width="400" /></a></div><br /><p>The book then talks about flat transactions, and defines the ACID components. For isolation, the book emphasizes the client observable behavior. A great paper that revisits isolation from observable behavior is the <a href="https://muratbuffalo.blogspot.com/2022/06/seeing-is-believing-client-centric.html">"Seeing is Believing: A Client-Centric Specification of Database Isolation" paper</a>.</p><p><i>Isolation simply means that a program running under transaction protection must behave exactly as it would in single-user mode. That does not mean transactions cannot share data objects. Like the other definitions, the definition of isolation is based on observable behavior from the outside, rather than on what is going on inside.</i></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjUqmd1TbXgocPTTC5XmU2EjxQsA26kT2fLTy1SYjAppEn6Ikl_tOwi41loRd5k4CwdbA1sPGHXt_0LhXQKdx8Vrlgfwa4JEc85aDQvE4WNinOm1wfo2p49pCwV7vAr6hJHfW_Q1gRn1rCj4eBm3eJGn2EnWDgRWxEcSlFjLkW9Y3jY0moocCgmeXfoNoo" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="734" data-original-width="1112" height="264" src="https://blogger.googleusercontent.com/img/a/AVvXsEjUqmd1TbXgocPTTC5XmU2EjxQsA26kT2fLTy1SYjAppEn6Ikl_tOwi41loRd5k4CwdbA1sPGHXt_0LhXQKdx8Vrlgfwa4JEc85aDQvE4WNinOm1wfo2p49pCwV7vAr6hJHfW_Q1gRn1rCj4eBm3eJGn2EnWDgRWxEcSlFjLkW9Y3jY0moocCgmeXfoNoo=w400-h264" width="400" /></a></div><br /><p>As an example of flat transactions, the book gives the classic credit/debit example. Strong smartcontract energy emanates from this SQL program!</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEj7sUT4fPNcg4z_CLEqWUJrag_hbCdOghe_wzONBpcWQL4OOb8Nqu0aI-xm5wd03pQ6Q5mfC4zsc2JBwFZsY05IuUVCiG5TIKwxbsuFUldgLD7GK-F92DWcJI_l8xx4I4rbkRLT9qmQBqRnegspyBud-f4GnOSojsA5049GIwrKuReR908q6ptD8NRDWiA" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1012" data-original-width="1112" height="364" src="https://blogger.googleusercontent.com/img/a/AVvXsEj7sUT4fPNcg4z_CLEqWUJrag_hbCdOghe_wzONBpcWQL4OOb8Nqu0aI-xm5wd03pQ6Q5mfC4zsc2JBwFZsY05IuUVCiG5TIKwxbsuFUldgLD7GK-F92DWcJI_l8xx4I4rbkRLT9qmQBqRnegspyBud-f4GnOSojsA5049GIwrKuReR908q6ptD8NRDWiA=w400-h364" width="400" /></a></div><p></p><p>The book then considers generalizations of flat transactions, arguing that exaples like trip planning and bulk updates are not adequately supported by flat transactions. The book acknowledges that these generalizations, like chained transactions, nested transactions, generalized transactions, are more of the topic of theory, rather than practice. It says: Flat transactions and the techniques to make them work account for more than 90% of this book. No matter which extensions prove to be most important and useful in the future, flat transactions will be at the core of all the mechanisms required to make these more powerful models work.</p><p>It says that this generalization discussion serves a twofold purpose: to aid in understanding flat transactions and their ramifications through explanation of extended transaction models, and to clarify why any generalized transaction model has to be built from primitives that are low-level, flat, system transactions.</p><p>This is where we go through the notion of spheres of control, which sounds like the realm of self-improvement books. </p><p>The key concept for this was proposed in the early 1970s and actually triggered the subsequent development of the transaction paradigm: the notion of spheres of control. Interestingly, Bjork and Davies started their work by looking at how large human organizations structure their work, how they contain errors, how they recover, and what the basic mechanisms are. At the core of the notion of spheres of control is the observation that controlling computations in a distributed multiuser environment primarily means:</p><p></p><ol style="text-align: left;"><li>Containing the effects of arbitrary operations as long as there might be a necessity to revoke them, and</li><li>Monitoring the dependencies of operations on each other in order to be able to trace the execution history in case faulty data are found at some point. </li></ol><p></p><p>Any system that wants to employ the idea of spheres of control must be structured into a hierarchy of abstract data types. Whatever happens inside an abstract data type (ADT) is not externalized as long as there is a chance that the result might have to be revoked for internal reasons.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgsvXUmkKZjrX94HcvzMXSsX3zUqd43GUpgVYpt5pLYGIlv27hZKRWWCyzG-R3UdHQw62lEuY5d5S0KpcuP0DIPqtY1peK-Dwu-iX2eRTXZhmYy6J8WdaPuPBVmUKNutRPm4kMgGDJARZ4jkvuhsmx1gFIvslS-tXEuz9X8Hm7LoB0ydOiOCNgK6ee8VHc" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="988" data-original-width="1112" height="356" src="https://blogger.googleusercontent.com/img/a/AVvXsEgsvXUmkKZjrX94HcvzMXSsX3zUqd43GUpgVYpt5pLYGIlv27hZKRWWCyzG-R3UdHQw62lEuY5d5S0KpcuP0DIPqtY1peK-Dwu-iX2eRTXZhmYy6J8WdaPuPBVmUKNutRPm4kMgGDJARZ4jkvuhsmx1gFIvslS-tXEuz9X8Hm7LoB0ydOiOCNgK6ee8VHc=w400-h356" width="400" /></a></div><p>This is actually great guideline for writing transactional applications. The book also prescribes about preparing compensating actions for earlier externalized effects. In general, I think it is very beneficial to show more application/pattern recipes for organizing applications into units of work that can be viewed as atomic actions.</p><p>The book differentiates between structural and dynamic dependencies. It wasn't clear to me if this differentiation becomes consequential later on. But structural dependencies may become useful for query/transaction planning, and dynamic dependencies for concurrency control. </p><p></p><ul style="text-align: left;"><li>Structural dependencies. These reflect the hierarchical organization of the system into abstract data types of increasing complexity.</li><li>Dynamic dependencies. As explained previously, this type of dependency arises from the use of shared data.</li></ul><p></p><p>The book introduces a graphical notation. I guess it makes for some pretty diagrams. But I would have preferred formal or pseudocode notation.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiyPw4Fcg_kMELo3EwoUQm7QFJ6IienmKxYfucCfGwfVxq0vSUwYN7nemKnoWaJcIBSFjVf-Rs4TchP5AVgNZm5CV3-_WybtIoFSlcxB0V_xqX_VwsXY70Uh45wfsNG3m17_ShQug3jQKg5Lt15mzaOeai05QGDncajG-uGN34B8b9A5VWLBdQ1JvtIzfE" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="798" data-original-width="1112" height="288" src="https://blogger.googleusercontent.com/img/a/AVvXsEiyPw4Fcg_kMELo3EwoUQm7QFJ6IienmKxYfucCfGwfVxq0vSUwYN7nemKnoWaJcIBSFjVf-Rs4TchP5AVgNZm5CV3-_WybtIoFSlcxB0V_xqX_VwsXY70Uh45wfsNG3m17_ShQug3jQKg5Lt15mzaOeai05QGDncajG-uGN34B8b9A5VWLBdQ1JvtIzfE=w400-h288" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgWj6ADdxyjzr3Q23Q12-31tmkBsCEzmav_NXego0KR03cQTUSji8ShKo9SRCA1RPTQJaSk9WQH2mpjjcaaYSyu86DeUixayPwSVISqQ-tj0Ky9LfuC26rbsOpMlNYP9MZB6tTd4mz5UZdE-6INi28UXh605DECHk3qCkcwRzjJs9mAT9Ret9ZJvcZzXg8" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1144" data-original-width="1112" height="400" src="https://blogger.googleusercontent.com/img/a/AVvXsEgWj6ADdxyjzr3Q23Q12-31tmkBsCEzmav_NXego0KR03cQTUSji8ShKo9SRCA1RPTQJaSk9WQH2mpjjcaaYSyu86DeUixayPwSVISqQ-tj0Ky9LfuC26rbsOpMlNYP9MZB6tTd4mz5UZdE-6INi28UXh605DECHk3qCkcwRzjJs9mAT9Ret9ZJvcZzXg8=w388-h400" width="388" /></a></div><p>To settle this problem for the moment, we adopt the following view. If a transaction in the chain aborts during normal processing, then the chain is broken, and the application has to determine how to fix that. However, if a chain breaks because of a system crash, the last transaction, after having been rolled back, should be restarted. This interpretation is shown graphically in Figure 4.11.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgaJqXvQMpM2Rid7VwVk2jtk92sR83AmYb3wR9CneBGcWL0ihIjNCil2qzRH-3UxCx2Ir2SvG1BzMIG7XQQxigedMs3-mq0GgP3cnUuTP6LXTSK1JT1oTbyCqyphe9QbBB1uAEThpNkT8172QkoYuA7cmYUNa0vaOAozDOJP9NrQ7ygXUi30oAlZpsa56w" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="828" data-original-width="1112" height="297" src="https://blogger.googleusercontent.com/img/a/AVvXsEgaJqXvQMpM2Rid7VwVk2jtk92sR83AmYb3wR9CneBGcWL0ihIjNCil2qzRH-3UxCx2Ir2SvG1BzMIG7XQQxigedMs3-mq0GgP3cnUuTP6LXTSK1JT1oTbyCqyphe9QbBB1uAEThpNkT8172QkoYuA7cmYUNa0vaOAozDOJP9NrQ7ygXUi30oAlZpsa56w=w400-h297" width="400" /></a></div><br /><p><b>Nested Transactions: </b>Nested transactions are a generalization of savepoints. Whereas savepoints allow organizing a transaction into a sequence of actions that can be rolled back individually, nested transactions form a hierarchy of pieces of work.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhKCWRMVFRGFEFhlKbpD96kktG3VxiN1ct9zE8krWEFZtTdmP0yyiaXlldQO48JdccEAvEUEKsAzSB9FX34PsGvKxJahF-1-_lk5UZhKiI_Ujsd62XKt7Dtgl20Lh9XaKo52KjcBZM7-1IakiF3AYaqn2_QyQG_ZHn7QS9uyMcx4IQ7Y5KGp8hqb_2u8HM" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="978" data-original-width="1112" height="352" src="https://blogger.googleusercontent.com/img/a/AVvXsEhKCWRMVFRGFEFhlKbpD96kktG3VxiN1ct9zE8krWEFZtTdmP0yyiaXlldQO48JdccEAvEUEKsAzSB9FX34PsGvKxJahF-1-_lk5UZhKiI_Ujsd62XKt7Dtgl20Lh9XaKo52KjcBZM7-1IakiF3AYaqn2_QyQG_ZHn7QS9uyMcx4IQ7Y5KGp8hqb_2u8HM=w400-h352" width="400" /></a></div><br /><p><b>Distributed Transactions:</b> A distributed transaction is typically a flat transaction that runs in a distributed environment and therefore has to visit several nodes in the network, depending on where the data is. The conceptual difference between a distributed transaction and a nested transaction can be put as follows: The structure of nested transactions is determined by the functional decomposition of the application, that is, by what the application views as spheres of control. The structure of a distributed transaction depends on the distribution of data in a network. In other words, even for a flat transaction, from the application's point of view a distributed transaction may have to be executed if the data involved are scattered across a number of nodes. Distributed subtransactions normally cannot roll back independently, either; their decision to abort also affects the entire transaction. This all means that the coupling between subtransactions and their parents is much stronger in our model.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjWOItSpws11SkiOz1MHVMxqJ6xgxX96Gcp0WkqXjlmcFybkEytiMSZxq36hueMyTLEGKFPHuV4rFMBYqtT_YNpT8s_ks8e3_85AIxT2_mi7AziWcE1kvnMOkl-H0TcbMuXhnIFNKHY3sURHqemBgExXe6YpLEc94a55zwqFYxsQi6fkXF_e77r9-IYIlc" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1052" data-original-width="1112" height="378" src="https://blogger.googleusercontent.com/img/a/AVvXsEjWOItSpws11SkiOz1MHVMxqJ6xgxX96Gcp0WkqXjlmcFybkEytiMSZxq36hueMyTLEGKFPHuV4rFMBYqtT_YNpT8s_ks8e3_85AIxT2_mi7AziWcE1kvnMOkl-H0TcbMuXhnIFNKHY3sURHqemBgExXe6YpLEc94a55zwqFYxsQi6fkXF_e77r9-IYIlc=w400-h378" width="400" /></a></div><p>The book also talks about sagas. The concept is in two respects an extension of the notion of chained transactions:</p><p></p><ol style="text-align: left;"><li>It defines a chain of transactions as a unit of control; this is what the term saga refers to.</li><li>It uses the compensation idea from multi-level transactions to make the entire chain atomic.</li></ol><p></p><p>For all the transactions executed before, there must be a semantic compensation, because the updates have already been committed.</p><p><br /></p><p>Finally, the spheres of control framework's generality helps think out of the box as well. </p>
<blockquote class="twitter-tweet"><p dir="ltr" lang="en">Not only Gray has described simulations and chaos engineering, but also event sourcing, in a short off-the-cuff remark. Wondering how many valuable things from that book have gotten overlooked over the years. <a href="https://t.co/YCNsp2vppX">pic.twitter.com/YCNsp2vppX</a></p>— Alex P (@ifesdjeen) <a href="https://twitter.com/ifesdjeen/status/1762012801847960047?ref_src=twsrc%5Etfw">February 26, 2024</a></blockquote> <script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script>
<p><br /></p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-26035782209628429322024-02-26T21:38:00.013-05:002024-02-26T21:44:59.019-05:00 Recent reads (Feb 2024)<p>Here are the three books I have read recently. Argh, I wish I took some notes while going through these books. It feels hard to write these reviews in retrospect.</p><p><br /></p><h1 style="text-align: left;">The Culture Code: The Secrets of Highly Successful Groups (2018)</h1><p>"What do Pixar, Google and the San Antonio Spurs basketball team have in common?" That was the pitch for the culture code book, which came out. That didn't age well, for Google's case at least. Well, the Google example was not about team work, but rather Jeff Dean fixing search for adwords over a weekend, so this is neither here or there. We can forgive the book for trying to choose sensational examples.</p><p>I did like the book overall. It identifies three things to get the culture right: 1) creating belonging, 2) sharing vulnerability, and 3) establishing purpose.</p><p>Creating belonging is about safety/security. Maslow's hierarchy emphasizes safety and security as fundamental human needs. In a work environment where we feel judged or constantly need to prove ourselves, we struggle to be our authentic selves and contribute at a high level. Without a sense of belonging, even the constructive feedback becomes difficult to receive. We need to feel supported and valued, not judged, to truly benefit from feedback. So creating belonging is the foundation of it all.</p><p>Sharing vulnerability goes hand-in-hand with creating safety and trust. It is important to establish this to prevent fail-silent failures in teams/companies. <a href="https://muratbuffalo.blogspot.com/2019/12/on-advisor-mentee-relationship.html">As I previously discussed in a blog post</a>: "The worse thing that can happen is a fail-silent fault: masking problems/issues and pretending everything is fine, and then failing the other party in the project/effort with little heads up." Sharing vulnerability also helps for establishing intellectual honesty, which is important for any knowledge work.</p><p>Finally, establishing purpose is important for aligning everyone and giving them a greater meaning for their work. The prototypical example is that of the cleaning staff in the hospital viewing their job as being associates in helping nurse patients back to health. It is important to overcommunicate purpose and mission, if possible using slogans and catchy phrases/posters. A study by Once Inc. magazine asked executives at 600 companies to roughly estimate the percentage of their employees who could name the company’s top 3 priorities. Their answer was 64 percent. The truth was 2 percent.</p><p><br /></p><h1 style="text-align: left;">Skunk Works: A Personal Memoir of My Years at Lockheed (1994)</h1><p>This book was such a hard contrast to the Culture Code book. This book talks about extreme team productivity between 1940-1990, when <a href="https://en.wikipedia.org/wiki/Skunk_Works">Skunk Works engineers</a> developed the U-2, SR-71 Blackbird, F-117 Nighthawk, F-22 Raptor, and F-35 Lightning II. The book comes from trenches, it doesn't try to codify productivity advice, rather shows how Skunk Work people worked hard, took ownership, and showed pride in their craftsmanship. Kelly Johnson and Ben Rich were quite the characters. They led by example, and the book also establishes principles of extreme-productivity team work by showing not telling. Believe me, I have seen creating belonging, sharing vulnerability, and establishing purpose throughout the Skunk Work story narrations.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhjWCFecgYvgBfJ2NujGL9DgpS6wueVJD7xBDnT1kHGi9jaOKf58vdBRK5ZTy4aEDjl4-t6ir53HcWhzDbh8aZ4yGAeTFHsldBG0hsV3gFVfd3zXgadX0IrtPsHGMlx9t0vy_8NjGuTB3nE5KIJIGfMTLp8OgvYbh92po6LEDE3UTqvG_7fCYPXFGh8aNE" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1200" data-original-width="1920" height="200" src="https://blogger.googleusercontent.com/img/a/AVvXsEhjWCFecgYvgBfJ2NujGL9DgpS6wueVJD7xBDnT1kHGi9jaOKf58vdBRK5ZTy4aEDjl4-t6ir53HcWhzDbh8aZ4yGAeTFHsldBG0hsV3gFVfd3zXgadX0IrtPsHGMlx9t0vy_8NjGuTB3nE5KIJIGfMTLp8OgvYbh92po6LEDE3UTqvG_7fCYPXFGh8aNE" width="320" /></a></div><br /><p></p><p>There is basic science in the book as well. In 1962, Petr Ufimtsev, published a theoretical math physics paper titled "The method of edge waves in the physical theory of diffraction." Around 1972, Dennis Overholzer in Skunk Works stumbled on Petr's paper translated from Russian, and they started putting this theory in to practice, which resulted in F-117 stealth fighter. F-117 was so ahead of its time in terms of stealth technology that, equivalent of that nowadays would probably be producing UFOs that zip around with antigravity engines. </p><p>I loved this book. It was extremely engaging and inspiring. I highly recommend it!</p><p><br /></p><h1 style="text-align: left;">How to Know a Person: The Art of Seeing Others Deeply and Being Deeply Seen (2023)</h1><p>I also highly recommend this book. So called "soft-skills" and "people skills" are very important in living a successful, fulfilling, and meaningful life. It is a pity these are not taught in school. We expect people to know them or intuit them but most people get these wrong most of the time. Only select few master them, and only at an advanced age. </p><p>It is best to learn these skills not from an emotionally-gifted feeler but from a nerd, who had to master them painfully over time. This reasoning is somewhat captured <a href="https://muratbuffalo.blogspot.com/2012/01/tell-me-about-your-thought-process-not.html">by the Haruki Murakami quote</a>.</p><blockquote style="border: medium; margin: 0px 0px 0px 40px; padding: 0px;"><p style="text-align: left;">Gifted writers write without effort; everywhere they touch in the ground the water pours. Other writers have to strive (he gives himself as an example); they have to learn to dig wells to get to the water. But when the water dries (inspiration leaves) for the gifted writer (which happens sooner or later), he becomes stuck and clueless because he has not trained for this. On the other hand, under the same situation, the other type of writer knows how to keep going and succeed.</p></blockquote><p>And for this purpose, David Brooks fits the bill. This is a great book, which I hope you would make time to read. You, your family, and your friends will thank me later. If you need more convincing, and want to sample some lessons from the book, do watch this one hour talk from David Brooks. </p><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" class="BLOG_video_class" height="266" src="https://www.youtube.com/embed/YwENbKn3tqI" width="320" youtube-src-id="YwENbKn3tqI"></iframe></div><div class="separator" style="clear: both; text-align: center;"><br /></div><p>A couple nights ago I watched "Everything Everywhere All At Once". It is a great movie. I would put it up there with Matrix. At its core this 2 hour 20 minutes crazy movie is about <b>how to know/see a person</b>. "Of all the places I could be, I just want to be here with you."</p><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" class="BLOG_video_class" height="266" src="https://www.youtube.com/embed/wxN1T1uxQ2g" width="320" youtube-src-id="wxN1T1uxQ2g"></iframe></div>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-73778871586400671672024-02-22T17:14:00.009-05:002024-02-23T09:35:22.369-05:00 TLA+ modeling of MongoDB logless reconfiguration<p>Here we do a walkthrough of the TLA+ specs for <a href="https://muratbuffalo.blogspot.com/2024/02/design-and-analysis-of-logless-dynamic.html">the MongoDB logless reconfiguration protocol we have reviewed recently.</a></p><p>The specs are available at the <a href="https://github.com/will62794/logless-reconfig">https://github.com/will62794/logless-reconfig</a> repo provided by Will Schultz, Siyuan Zhou, and Ian Dardik. </p><p></p><ul style="text-align: left;"><li><a href="https://github.com/will62794/logless-reconfig/blob/master/MongoLoglessDynamicRaft.tla">This is the protocol model for managing logless reconfiguration.</a> Let's call this the "config state machine" (CSM).</li><li><a href="https://github.com/will62794/logless-reconfig/blob/master/MongoStaticRaft.tla">This is the protocol model for static MongoDB replication protocol based on Raft.</a> Let's call this the "oplog state machine" (OSM). </li><li><a href="https://github.com/will62794/logless-reconfig/blob/master/MongoRaftReconfig.tla">Finally this model composes the above two protocols</a> so they work in a superimposed manner.</li></ul><p></p><p>I really admire how these specs provided a modular composition of the reconfiguration protocol and Raft-based replication protocol. I figured I would explain how this works here, since walkthroughs of advanced/intermediate TLA+ specifications, especially for composed systems, are rare.</p><p>I will cover the structure of the two protocols (CSM and OSM) briefly, before diving into how they are composed.</p><p>At the end I will also show you that by <a href="https://will62794.github.io/tla-web">using tla-web application</a> (developed by Will Schultz), it is possible to interactively-explore the executions of the combined spec without the need to install any TLA+ tooling at all by just using the web browser. This is a great way to share specifications and counterexample traces with your colleagues who don't dabble in TLA+ and I am excited for this.</p><p><br /></p><h1 style="text-align: left;">CSM: MongoLoglessDynamicRaft </h1><p>This file is at <a href="https://github.com/will62794/logless-reconfig/blob/master/MongoLoglessDynamicRaft.tla">https://github.com/will62794/logless-reconfig/blob/master/MongoLoglessDynamicRaft.tla</a> if you like to follow along.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgmx3D7zHGoizeyJuoA2lq4w6bRK5RxcHAMTc_92sNLjn5OMNS6gEfFVHBhMhkgfjfArbLZakgKNoiTGwLxmMOOQePzcReltDSK-y-kuKN8QZxisq8t67Z4MfijSmw14KZ4ogj-9MdnkGKYaF8w1Ly_RvT8ft5ho9nKKQc4pmcrhdMESHgcBNiG7R2pM20" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="330" data-original-width="1268" src="https://blogger.googleusercontent.com/img/a/AVvXsEgmx3D7zHGoizeyJuoA2lq4w6bRK5RxcHAMTc_92sNLjn5OMNS6gEfFVHBhMhkgfjfArbLZakgKNoiTGwLxmMOOQePzcReltDSK-y-kuKN8QZxisq8t67Z4MfijSmw14KZ4ogj-9MdnkGKYaF8w1Ly_RvT8ft5ho9nKKQc4pmcrhdMESHgcBNiG7R2pM20=s16000" /></a></div><p></p><p></p><ul style="text-align: left;"><li>Server is the set of all nodes.</li><li>state denotes whether a node is Primary or Secondary.</li><li>currentTerm is the Raft term of a node</li><li>config is the current config the node knows of</li><li>configVersion is the version associated with the config</li><li>configTerm is the term associated with that config </li></ul><p><a href="https://muratbuffalo.blogspot.com/2024/02/design-and-analysis-of-logless-dynamic.html">Refer back to our post on MongoDB logless reconfiguration</a> to see how the configVersion and configTerm plays out. For ensuring safety, the protocol uses the (configTerm,configVersion) as a pair, and requires that the CSM commits the config on the most recent configTerm, and the OSM follows the committed configs in sequential order. </p><p>These are the four possible actions in the CSM protocol.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjju1lJULAZbQKtO1LpHo2yCrospWBJcmQ5xeCBtT79wM9h40DxbHgYTWoEC6VfH100Aka-joyfOBXRx02Tr9LAsgcl9Z7rTNLAudh2o0NcNUN_sO8TzDF1cCeKd9Mg-h3YpF8l52UQJYBVZgv0C2il18VVfDzt0hi_4_0two6AbU9s0JBeAdSSN3vlV1U" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="206" data-original-width="1268" src="https://blogger.googleusercontent.com/img/a/AVvXsEjju1lJULAZbQKtO1LpHo2yCrospWBJcmQ5xeCBtT79wM9h40DxbHgYTWoEC6VfH100Aka-joyfOBXRx02Tr9LAsgcl9Z7rTNLAudh2o0NcNUN_sO8TzDF1cCeKd9Mg-h3YpF8l52UQJYBVZgv0C2il18VVfDzt0hi_4_0two6AbU9s0JBeAdSSN3vlV1U=s16000" /></a><span style="text-align: left;"> </span></div><p></p><p>The first action does not seem to constrain the newConfig and seem to allow it to be any possible set in all possible subsets of the Server set, claiming an unrestricted/arbitrary next reconfiguration. But, if you look at the preconditions of <b>Reconfig</b> action on line 99, the <b>QuorumsOverlap</b> check requires that all quorums of newConfig share at least one overlapping node with all quorums of the current config. Pew, sanity restored. If we comment out the QuorumsOverlap condition we would encounter a safety violation. You can try this, after I teach you about the tla-web app at the end of this post.</p><p><b>SendConfig</b> is used for sharing config between any two servers. NewerConfig, determined by higher configTerm or if configTerms are equal by higher configVersion, dominates and gets adopted by the other server.</p><p><b>UpdateTerms</b> is used for updating the currentTerm variable. And <b>BecomeLeader</b> is for one server to become a Primary by getting a quorum of votes from others and increasing its currentTerm. These two actions are shared with the OSM protocol, and we will later see how the composition model will insist that these actions get jointly executed by both CSM and OSM for a superposed composition of the two protocols.</p><p><br /></p><h1 style="text-align: left;">OSM: MongoStaticRaft</h1><p>This file is at <a href="https://github.com/will62794/logless-reconfig/blob/master/MongoStaticRaft.tla">https://github.com/will62794/logless-reconfig/blob/master/MongoStaticRaft.tla</a> if you like to follow along. This is <a href="https://muratbuffalo.blogspot.com/2024/01/fault-tolerant-replication-with-pull.html">the pull-based replication protocol based on Raft that we covered in this post earlier</a>.</p><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjc9KyDgJZFR1j4BTWigfN0y9ksL4J8AV2cgmlC28Ln8w05eYXy1uSwa6NTsF3DEtb6tOUJ2JT_h0ShLSoaVfJPrtgoE5Odyg1EbrD4QvPu85h2DhEsu2kfv1vEODkqIJJLcqSK7jZbyFddl-RxXakFz6JKatzM3IASjvOv82CV0ZZu_IjiNYOc0fsg91A" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="360" data-original-width="1268" src="https://blogger.googleusercontent.com/img/a/AVvXsEjc9KyDgJZFR1j4BTWigfN0y9ksL4J8AV2cgmlC28Ln8w05eYXy1uSwa6NTsF3DEtb6tOUJ2JT_h0ShLSoaVfJPrtgoE5Odyg1EbrD4QvPu85h2DhEsu2kfv1vEODkqIJJLcqSK7jZbyFddl-RxXakFz6JKatzM3IASjvOv82CV0ZZu_IjiNYOc0fsg91A=s16000" /></a></div></div><p>Many of these variables/constants are shared with the CSM above, so let's talk about the diff.</p><p></p><ul style="text-align: left;"><li>log is the oplog for the OSM protocol.</li><li>committed is the set of committed (majority replicated) entries in the log.</li></ul><p></p><p>Here are the actions of this OSM protocol</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgQEX9ZZ9N3tyuwX5sTLiKlchg134bkPU4vTuhzXeFKgzhvQWb2ywo3DGAwumJleNl9kEIdwDCSKr-GCuTpetNW8hn6gP91wHrHTyF3du4RMy5fR9jCz3mH6vV3Mh0b-tm7LXgY3ymgzlNCapedI9Xfwd_utnHthbAhHzcycdURm-vFiGvqxfoHANWQEZw" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="282" data-original-width="1268" src="https://blogger.googleusercontent.com/img/a/AVvXsEgQEX9ZZ9N3tyuwX5sTLiKlchg134bkPU4vTuhzXeFKgzhvQWb2ywo3DGAwumJleNl9kEIdwDCSKr-GCuTpetNW8hn6gP91wHrHTyF3du4RMy5fR9jCz3mH6vV3Mh0b-tm7LXgY3ymgzlNCapedI9Xfwd_utnHthbAhHzcycdURm-vFiGvqxfoHANWQEZw=s16000" /></a><span style="text-align: left;"> </span></div><p></p><p><b>BecomeLeader</b> is for becoming a Raft Primary, and <b>UpdateTerms</b> is for propagating the Raft currentTerm variable between two nodes. (These two actions are in fact the same as the same named two actions in the CSM: MongoLoglessDynamicRaft spec.) <b>ClientRequest</b> is for the Primary to accept a request for replication. <b>GetEntries</b> perform pull-based replication of entries in oplog. <b>RollbackEntries</b> performs oplog rollback when warranted upon a primary change, and <b>CommitEntry</b> commits majority replicated entries. Standard Raft stuff.</p><p><br /></p><h1 style="text-align: left;">MongoRaftReconfig</h1><p>This file is at <a href="https://github.com/will62794/logless-reconfig/blob/master/MongoRaftReconfig.tla">https://github.com/will62794/logless-reconfig/blob/master/MongoRaftReconfig.tla</a></p><p>The MongoRaftReconfig file composes the MongoLoglessDynamicRaft (CSM) and MongoStaticRaft (OSM), and regulates the superpositioned execution of the two specs.</p><p></p><ul style="text-align: left;"><li>CSM and OSM share actions for BecomeLeader and UpdateTerms.</li><li>OSM reads from CSM config and uses this as the process set to execute the replication/commit and leader election protocol</li><li>CSM reads from OSM the OplogCommitment condition. (Note that the composition model guards the CSM's Reconfig action by OplogCommitment, and commenting this out will also lead to a safety violation as we will explore below.)</li></ul><p></p><p>Due to the shared actions for CSM and OSM for BecomeLeader and UpdateTerms, the composition spec restricts both CSM's and OSM's possible computations/executions to become refinements of their respective original specs.</p><p>Note that the initial states of CSM and OSM also need to agree for their shared variables, which are currentTerm, state, and config. The variables of the composition spec is the union of the variables of CSM and OSM.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgtoMwRjBPUBPicPLSnrHyneLToxXw-sNEVNwZsfecQ9O1AHvAM9fYJ3tTlymekBN4jsWhHInCUCF0kn8_xSAJjrlVtRGi0NBx0E5D_J-lR44PEUvz3HeUsKByKNDcj3gWlATKNS7sLUcOBeRNmjACpr4BDrMbih2--i-k4qKwOczzF51mFdFr0ivLiDGE" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="414" data-original-width="1268" src="https://blogger.googleusercontent.com/img/a/AVvXsEgtoMwRjBPUBPicPLSnrHyneLToxXw-sNEVNwZsfecQ9O1AHvAM9fYJ3tTlymekBN4jsWhHInCUCF0kn8_xSAJjrlVtRGi0NBx0E5D_J-lR44PEUvz3HeUsKByKNDcj3gWlATKNS7sLUcOBeRNmjACpr4BDrMbih2--i-k4qKwOczzF51mFdFr0ivLiDGE=s16000" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhMcb5WV3l9R-9ykzotLqByq__9bpi_HxGV6tsZq9hFH9af79-uTiL-VKhv0MYctXHzzyfAlOo6mP0oY_VVYe1FawUSsgugy_3UXBbi_veJd4gH7-5mE-SvifNHVsNa7PLyEhqNwtIIr0XvgGqf6TOBIWzNRzc-E_wnlV1-MBXv0z30d4SgSgPhZNXc9Bc" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="128" data-original-width="1268" src="https://blogger.googleusercontent.com/img/a/AVvXsEhMcb5WV3l9R-9ykzotLqByq__9bpi_HxGV6tsZq9hFH9af79-uTiL-VKhv0MYctXHzzyfAlOo6mP0oY_VVYe1FawUSsgugy_3UXBbi_veJd4gH7-5mE-SvifNHVsNa7PLyEhqNwtIIr0XvgGqf6TOBIWzNRzc-E_wnlV1-MBXv0z30d4SgSgPhZNXc9Bc=s16000" /></a></div><p></p><p>The composition spec defines OSMNext and CSMNext as follows. Note the OplogCommitment guarding the CSM reconfig action. The composition spec also restricts the shared actions BecomeLeader and UpdateTerms to execute jointly, meaning that both actions take effect in the corresponding specs at the same step. This keeps the CSM and OSM states in synchrony and prevent the two state machines diverging from each other's views.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgx1jkVdg_hKEDJvELeHo1DO2MOflXydjJQEjQfcyYND7XgWS7QFp_F2OCa4YT9mEWUWgK5rIyUkt9hPO3STKtWuge2t58M39REKq3YoTyoh_ITQ6nXuO1Td8KrQFxU0ng4NuYmbD5kBWutWXmuoZ4E83jRUgrlDzFqVSUkPNbJoSuq9KuQT6ShfbxzmrY" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="242" data-original-width="1268" src="https://blogger.googleusercontent.com/img/a/AVvXsEgx1jkVdg_hKEDJvELeHo1DO2MOflXydjJQEjQfcyYND7XgWS7QFp_F2OCa4YT9mEWUWgK5rIyUkt9hPO3STKtWuge2t58M39REKq3YoTyoh_ITQ6nXuO1Td8KrQFxU0ng4NuYmbD5kBWutWXmuoZ4E83jRUgrlDzFqVSUkPNbJoSuq9KuQT6ShfbxzmrY=s16000" /></a></div><p></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgJUsryBH6Ck68tygZEPqk9V1hVgE7QkH_voFG-bd_jtqlenlvkwBs1SOnJONnWoHlFT8l8UbS6f9umAG1X3CTrcHM0ztb4vLVQAI6UljmvMDrj8cZVn8y72RkC4ldwU_Ycc71FqhjAa6zuF1QhnIEjfCkFPrPJ4KPHxImLWM2G5SdOuiEOZ4Z0q1omd0k" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="916" data-original-width="1314" src="https://blogger.googleusercontent.com/img/a/AVvXsEgJUsryBH6Ck68tygZEPqk9V1hVgE7QkH_voFG-bd_jtqlenlvkwBs1SOnJONnWoHlFT8l8UbS6f9umAG1X3CTrcHM0ztb4vLVQAI6UljmvMDrj8cZVn8y72RkC4ldwU_Ycc71FqhjAa6zuF1QhnIEjfCkFPrPJ4KPHxImLWM2G5SdOuiEOZ4Z0q1omd0k=s16000" /></a></div><br />The Next step here defines a step of the composition spec as either OSMNext, CSMNext, or JointNext. <p></p><p><br /></p><h1 style="text-align: left;">State exploration</h1><p>It is time for us to reach out to the tla-web app and play around with an interactive version of this spec. The spec here is equivalent to the compositional spec we reviewed above, except that it is not specified in the compositional manner, but rather combined into one file together.</p><p>If you are ready, just follow my directions. This will be fun. </p><p><a href="https://will62794.github.io/tla-web/#!/home?specpath=./specs/MongoRaftReconfig.tla&constants%5BServer%5D=%7B%22s1%22,%22s2%22,%22s3%22%7D&constants%5BSecondary%5D=%22Secondary%22&constants%5BPrimary%5D=%22Primary%22&constants%5BNil%5D=%22Nil%22">Open this link</a> in another tab. This will readily load the joint TLA+ spec. And you can see the spec by clicking the spec button at top right. You can even modify the spec, by commenting out a line.</p><p>And that is what we are going to do. The best way to learn something is to break it and see why it breaks, and then fix it back. Go to line 259, and comment the precondition check for OplogCommitment, by putting <b>\*</b> at the beginning of the line. Click the details button up top to come back to the initial page.</p><p>Now, the prompt there is asking us to choose a possible initial state. Choose the one where <b>config={s1,s2} </b>that means we have two replicas in the initial configuration. This is the second choice from top. Inspect the assignments in the variables in this initial state and click on it. Note that the right side of the page changed to show the history of the trace. We only got one step into this execution, so that is what we see on the right. The left pane also changed to list the new actions/transitions that are enabled in this next state resulting from our initial state choice.</p><p>Choose the transition where node 1 becomes the leader. This is the <b>BecomeLeader action</b> with s1 becoming the leader using the quorum s1,s2. Note that there are two sub-buttons enabled, the other option is to choose s2 to become the leader. So, follow me, choose the first button to make s1 the leader.</p><p>Note that our trace grew by another state transition on the right, and we have a new set of options opened up for us. Choose<b> SendConfig s1, s2</b>. Now our trace is of length 3.</p><p>Oh, I almost forgot, on the right pane in the text entry at the top enter LeaderCompleteness and press the AddTraceExpression button next to it. This will let us monitor the invariant LeaderCompleteness, which is line 320 in the Spec tab if you like to check. This evaluates to TRUE now, but since we had introduced a bug in the spec by commenting out the OplogCommitment precondition, we should expect to see this violated soon. </p><p>Let's go faster now.</p><p>Choose<b> Reconfig s1, {s1}</b> so that node 1 reconfigures the system to drop down to only one replica, that is itself.</p><p>Choose <b>ClientRequest s1</b> option. And all the while monitor how the state variables evolve on the right as we take the system through this path of execution. Notice the log variable on s1 become <<1>>.</p><p><b>CommitEntry s1 {s1}</b>. Note that s1 did not need another node to commit, since the reconfiguration took the system to just one node, it can commit locally.</p><p>Now, choose <b>Reconfig s1, {s1,s2}.</b></p><p><b>SendConfig s1, s2</b></p><p><b>Reconfig s1, {s1,s2,s3}</b></p><p><b>SendConfig s1, s3</b></p><p><b>SendConfig s1, s2</b></p><p><b>BecomeLeader s2 {s2,s3}.</b> Wow, notice that LeaderCompleteness became FALSE. This is because s2 does not have what s1 committed (locally), so we lost a commit. Let's continue to see what more trouble this can result in. On the right pane Add Trace Expression StateMachineSafety which is on line 325 in the spec. Note that this still shows as TRUE.</p><p><b>SendConfig s2, s3</b></p><p><b>ClientRequest s2</b>, so s2 accepts to the same slot (1st slot in the log) another value <<2>>.</p><p><b>GetEntries s3, s2</b></p><p><b>CommitEntry s2 {s2, s3}. </b>Wow! Notice the StateMachineSafety is violated now. This invariant said:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjzRHaQHizJwm_UFota8v0EGBVa652kWGih8cUlJOl4fCMWjveYXXJ40IfP6F3RUxCLv_GS1O7we9qYORmIDAVTMBFSumiCLGuhkXzplTwM4JxA1vGUU7hX4V5xp61H7OyAiqNfiDn_HcIH6tNycrFV2XS9DvtS7t63hgSVdTlkYuGpGgadQR2gE_Z28MY" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="120" data-original-width="1502" src="https://blogger.googleusercontent.com/img/a/AVvXsEjzRHaQHizJwm_UFota8v0EGBVa652kWGih8cUlJOl4fCMWjveYXXJ40IfP6F3RUxCLv_GS1O7we9qYORmIDAVTMBFSumiCLGuhkXzplTwM4JxA1vGUU7hX4V5xp61H7OyAiqNfiDn_HcIH6tNycrFV2XS9DvtS7t63hgSVdTlkYuGpGgadQR2gE_Z28MY=s16000" /></a></div><p></p><p>By modifying the spec and introducing a bug, we violated this invariant. We have a 16 step counter example for this violation. By copying the URL, or using the Copy Trace Link, we can share this counterexample with our colleagues.</p><p>But there is a catch for our case. We had done our modification on our local buffer, and not at the spec url, so this link will load the correct version and will only be able to follow the trace to step 8 where the faulty version diverges from the unmodified correct version. If you like to get the full 16 step counterexample you can put the modified version on a publicly accessible link and load that version, and share the steps from that faulty version. </p><div><br /></div>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-29923795112627063562024-02-21T17:52:00.002-05:002024-02-26T08:23:35.714-05:00Adapting TPC-C Benchmark to Measure Performance of Multi-Document Transactions in MongoDB<p><a href="https://www.vldb.org/pvldb/vol12/p2254-kamsky.pdf">This paper appeared in VLDB 2019</a>.</p><p>Benchmarks are a necessary evil for database evaluation. Benchmarks often focus on narrow aspects and specific workloads, creating a misleading picture for broader real-world applications/workloads. However, for a quick comparative performance snapshot, they still remain a crucial tool.</p><p>Popular benchmarks like YCSB, designed for simple key-value operations, fall short in capturing MongoDB's features, including secondary indexes, flexible queries, complex aggregations, and even multi-statement multi-document ACID transactions (since version 4.0).</p><p>Standard RDBMS benchmarks haven’t been a good fit for MongoDB either, since they require normalized relational schema and SQL operations. Consider TPC-C, which simulates a commerce system with five types of transactions involving customers, orders, warehouses, districts, stock, and items represented with data in nine normalized tables. TPC-C requires specific relational schema and prescribed SQL statements.</p><p>Adapting TPC-C to MongoDB demands a delicate balancing act. While mimicking the familiar TPC-C workload and adhering to its ACID requirements is essential for maintaining the benchmark's value for those accustomed to it, significant modifications are necessary to account for MongoDB's unique data structures and query capabilities. This paper provides such an approach, creating a performance test suite that incorporates MongoDB best practices while remaining consistent with TPC-C's core principles.</p><p>To build this benchmark, the paper leverages<a href="https://github.com/apavlo/py-tpcc/wiki."> Andy Pavlo's 2011 repository PyTPCC</a>, a Python-based framework for running TPC-C benchmarks on NoSQL systems. While PyTpCC included an initial driver implementation for MongoDB, it lacked support for transactions since it was written for NoSQL systems of 2011. This paper addresses this gap by adding transaction capability to PyTPCC. The modified benchmark, available at <a href="https://github.com/mongodb-labs/py-tpcc">https://github.com/mongodb-labs/py-tpcc</a>, enables a detailed evaluation of MongoDB multidocument transactions over a single replicaset deployment.</p><p><br /></p><h1 style="text-align: left;">Background</h1><p>Please note that this evaluation focuses only on transactions over a single replicaset deployment using MongoDB 4.0 in 2019. <a href="https://muratbuffalo.blogspot.com/2024/02/verifying-transactional-consistency-of.html">In a previous post, we had reviewed the basics of transaction implementation across three different MongoDB deployments:</a> single node WiredTiger, replicaset, and sharded cluster deployments. While MongoDB now supports general multi-document transactions across sharded deployments, that topic is not included in this paper.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjxr7GcEGAjDAJ9UwMcwCUykCheG5Ali-Ivi5Rs-H5cZmy0qfHiKxt2JVIvsX9sFvvnPsmajiQiDBxUlQMqSqkuhyVCdl9bKJLQk86fdgpXt3ylNeZE3Mo1E0wMlFkFe0RvBRrUWSqw0F5TRHTM8oY1QiLNUAoW7c_C7DzBFAfEsZPzBwVkSJsbsRLlLKQ" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="218" data-original-width="594" height="117" src="https://blogger.googleusercontent.com/img/a/AVvXsEjxr7GcEGAjDAJ9UwMcwCUykCheG5Ali-Ivi5Rs-H5cZmy0qfHiKxt2JVIvsX9sFvvnPsmajiQiDBxUlQMqSqkuhyVCdl9bKJLQk86fdgpXt3ylNeZE3Mo1E0wMlFkFe0RvBRrUWSqw0F5TRHTM8oY1QiLNUAoW7c_C7DzBFAfEsZPzBwVkSJsbsRLlLKQ" width="320" /></a></div><p>MongoDB query language (MQL) does not map directly to SQL, but it supports a similar set of CRUD operations as shown in Table 1. MongoDB supports primary as well as secondary indexes and can speed up queries by looking up documents in indexes including returning full result from index alone (covered index queries). Indexes can be created on regular fields as well as fields embedded inside arrays.</p><p>MongoDB transactions provide the expected ACID guarantees which TPC-C requires for correctness. Specifically, they provide snapshot isolation guarantee. A single snapshot of the data is used for the duration of the transaction. A snapshot is a single point in time view of the data at a distinct cluster time maintained via a cluster-wide logical clock. Once a transaction begins with a snapshot at cluster time, no subsequent writes outside of that transaction's context occurring after that cluster time will be seen within the transaction. However, transactions will be able to view their own subsequent writes that occur after the snapshot’s cluster time, providing the "read your own writes" guarantee. Once a transaction starts, its snapshot view of the data is preserved until it either commits or aborts. When a transaction commits, all data changes made in the transaction are saved and made visible outside the transaction. When a transaction aborts, all data changes made in the transaction are discarded without ever becoming visible.</p><p>Within MongoDB transactions, readConcern is always set to "snapshot". Multi-document transactions in MongoDB are committed with "majority" writeConcern, which means two out of three nodes in the replicaset must commit all operations before acknowledgment.</p><p><a href="https://muratbuffalo.blogspot.com/2024/02/verifying-transactional-consistency-of.html">As we discussed in our previous post on MongoDB transactions</a>, while they are "OCC", thanks to the underlying WiredTiger holding the lock on first access, they are less prone to aborting than a pure OCC transaction. An in-progress transaction stops later writes (be it from other transactions or single writes) instead of getting aborted by them. In other words, transactions immediately obtain locks on documents being written, or abort if the lock cannot be obtained. This ensures that attempts by two transactions to write to the same document will immediately fail for the second transaction at which point it can choose to retry as is appropriate for the application.</p><p><br /></p><h1 style="text-align: left;">Evaluation</h1><p>The replicaset deployment used the default database configuration provided by MongoDB Atlas cloud offering. Performance is reported for M60 Atlas replica set with writeConcern "majority" for durability along with readConcern "snapshot" for most transactions, and committed reads equivalent (readConcern "majority", causal consistency true) for STOCK LEVEL transaction. Figure 1 shows transactions-per-minute-C (tpmC) values for varying number of warehouses and client thread counts. Using more warehouses result in reduced throughput I think due to the need to coordinate transactions across more documents. And remember we don't get to reap the benefits of sharding to the face of increased warehouses as this deployment is a single replicaset deployment. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjYUvy2wR9aZzVYVABrDPWeCB8_xjrnXKJBw7-eQDMMGsJQgtL7Pgrb5cUFP_4MfhT1UzkityWxDo2T9nZKhCiMYQ3TEAC5zH_O_qN97XEbdMOUKg66s9sfANRsM1goRtKbTtjkWFzNdeSxkICCsGkVwIwuYB4cbQAWiVJVUL4M0HYOnDuQ2BVTIjjW_P8" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="640" data-original-width="1096" src="https://blogger.googleusercontent.com/img/a/AVvXsEjYUvy2wR9aZzVYVABrDPWeCB8_xjrnXKJBw7-eQDMMGsJQgtL7Pgrb5cUFP_4MfhT1UzkityWxDo2T9nZKhCiMYQ3TEAC5zH_O_qN97XEbdMOUKg66s9sfANRsM1goRtKbTtjkWFzNdeSxkICCsGkVwIwuYB4cbQAWiVJVUL4M0HYOnDuQ2BVTIjjW_P8=s16000" /></a></div><p>The original PyTPCC benchmark provided a normalized option which mirrored the RDBMS schema exactly, and a denormalized option which embedded all customer information (orders, order lines, history) into the customer document. This specific denormalization, however, is identified as an antipattern as it leads to unbounded growth and performance degradation. Following recommended MongoDB schema practices, the evaluation adopted a modified denormalized schema, maintaining a normalized structure for most data, and only embedding order lines within their respective order documents. This aligns with common document database best practices since frequent access to order lines together with orders justifies the embedding. Order lines within an order are fixed in number, preventing unbounded growth. Interestingly, this optimized denormalized schema resulted in a smaller data footprint compared to the fully normalized one, because redundant information in order lines was eliminated. Results in Figure 2 further highlights the benefits of this modified denormalization. The performance win comes from reducing the number of round trips to the database: you can think of it as pre-joining order lines into the orders table.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgkHkTFVwhzOsNyY6hFiJo8AWyeX-6O1rBws2a82hvqPEWNZ8yGGndWcllvPz_DVa4x1chMYjjvS9FUKDAvmbChJk8kzNkIAlz50RvGq3-xvObL3LNCIOe7td07-T3rzpWWveziL2NTMech8Xv7Umr_IsSTAYauizQmcls1B5Kqj7TUEEf6_sSRG6P9_HU" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="390" data-original-width="602" height="259" src="https://blogger.googleusercontent.com/img/a/AVvXsEgkHkTFVwhzOsNyY6hFiJo8AWyeX-6O1rBws2a82hvqPEWNZ8yGGndWcllvPz_DVa4x1chMYjjvS9FUKDAvmbChJk8kzNkIAlz50RvGq3-xvObL3LNCIOe7td07-T3rzpWWveziL2NTMech8Xv7Umr_IsSTAYauizQmcls1B5Kqj7TUEEf6_sSRG6P9_HU=w400-h259" width="400" /></a></div><p><br /></p><p>Several areas presented opportunities for further latency reduction. Streamlining queries and requesting only necessary fields help reduce data transfer and processing time. Inspecting the logs showed that transaction retries stemmed from performing extensive operations before encountering write conflicts. Re-ordering write operations to expose write conflicts as early in the transaction as possible, as well as moving such writes before reads where possible helped address these inefficiencies. Several transactions followed a pattern of selecting and updating the same record. Using MongoDB's findAndModify operation reduced those two database interactions to one, and significantly improved performance as shown in Figure 3.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgsvZiXqAL5yh_EUfCQQozLnhlOcn1H38PdL0cVXuUsaYC_FpER8xhYBT1BV_r4EgnKOJ9CgnPuSfg5Xdm37ReidoP6GxWKo5uwcisGWjNLpBK0VHENlihl4pWPtHSg-R4Dt2R1VMCGcTJby4XBK_ROpcJ3zIMC7mn9pCdEdeY-e2TTMz5Bqe9FvW_4EXk" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="586" data-original-width="648" height="362" src="https://blogger.googleusercontent.com/img/a/AVvXsEgsvZiXqAL5yh_EUfCQQozLnhlOcn1H38PdL0cVXuUsaYC_FpER8xhYBT1BV_r4EgnKOJ9CgnPuSfg5Xdm37ReidoP6GxWKo5uwcisGWjNLpBK0VHENlihl4pWPtHSg-R4Dt2R1VMCGcTJby4XBK_ROpcJ3zIMC7mn9pCdEdeY-e2TTMz5Bqe9FvW_4EXk=w400-h362" width="400" /></a></div><br /><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEh33V957S7cfWPHkV_j17UR-P2h3Ca0HQaZ3v_Qy_m1SF4HipTEEBU2QS4Q_4Tif024NPHDRPNiboXppsX7D9aq13lH5al41mMqAkd7_19uuY6CMsTisdTLDapfgXEg8YduE0ZMPDiNl75VqiRVn0lFBx7KTQ5P2lhVtFskHZeo1mZMF9cGuyFC8FCdy2o" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1754" data-original-width="1136" src="https://blogger.googleusercontent.com/img/a/AVvXsEh33V957S7cfWPHkV_j17UR-P2h3Ca0HQaZ3v_Qy_m1SF4HipTEEBU2QS4Q_4Tif024NPHDRPNiboXppsX7D9aq13lH5al41mMqAkd7_19uuY6CMsTisdTLDapfgXEg8YduE0ZMPDiNl75VqiRVn0lFBx7KTQ5P2lhVtFskHZeo1mZMF9cGuyFC8FCdy2o=s16000" /></a></div><br /><div><br /></div>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-49449959542937704942024-02-20T16:47:00.006-05:002024-02-20T17:10:17.809-05:00 Fault tolerance (Transaction processing book)<p>This is Chapter 3 from <a href="https://muratbuffalo.blogspot.com/2024/02/transaction-processing-book-grayreuter.html">the Transaction Processing Book Gray/Reuter 1992</a>.</p><p>Why does the fault-tolerance discussion come so early in the book? We haven't even started talking about transactional programming styles, concurrency theory, concurrency control. The reason is that the book uses dealing with failures as a motivation for adopting transaction primitives and a transactional programming style. I will highlight this argument now, and outline how the book builds to that crescendo in about 50 pages.</p><p>The chapter starts with an astounding observation. I'm continuously astounded by the clarity of thinking in this book: <i>"The presence of design faults is the ultimate limit to system availability; we have techniques that mask other kinds of faults."</i></p><p>In the coming sections, the book introduces the concepts of faults, failures, availability, reliability, and discusses hardware fault-tolerance through redundancy. It celebrates wins in hardware reliability through several examples, including experience from Tandem computer systems Gray worked on: <i>"This is just one example of how technology and design have improved the maintenance picture. Since 1985, the size of Tandem’s customer engineering staff has held almost constant and shifted its focus from maintenance to installation, even while the installed base tripled. This is an industry trend; other vendors report similar experience. Hardware maintenance is being simplified or eliminated."</i></p><p><a href="https://muratbuffalo.blogspot.com/2024/02/transaction-processing-book-grayreuter.html">As I mentioned in my previous post</a>, I am impressed by the quantitative approach the book takes. It cites several surveys and studies to back up its claims. One question that occurs to me after Section 3.3 is to check whether we see the extrapolation of this trend for new hardware introduced since, and with modern hardware? It seems like we are doing a great job with hardware reliability through isolating/discarding malfunctioning parts and carrying out operations using redundant copies. Well, of course, there are <a href="https://muratbuffalo.blogspot.com/2019/09/gray-failure-achilles-heel-of-cloud.html">hardware-gray-failures</a> and <a href="https://muratbuffalo.blogspot.com/2021/06/cores-that-dont-count.html">fail</a>-<a href="https://muratbuffalo.blogspot.com/2021/06/silent-data-corruptions-at-scale.html">silent</a> hardware faults that are hard to detect, but we seem to be managing OK overall. </p><p>A more interesting question to ask is: <i>"Why are we unable to have the same kind of reliability/maintenance gains for software as easily?"</i> The book acknowledges this is a hard question, again referring to many surveys and case-studies. It sums these up as follows: <i>"Perfect software of substantial complexity is impossible until someone breeds a species of super-programmers. Few people believe design bugs can be eliminated. Good specifications, good design methodology, good tools, good management, and good designers are all essential to quality software. These are the fault-prevention approaches, and they do have a big pay-off. However, after implementing all these improvements, there will still be a residue of problems."</i></p><p>Building on these, towards the end of the chapter, the book makes its case for transactions:</p><p><b>"In the limit, all faults are software faults --software is responsible for masking all the other faults. The best idea is to write perfect programs, but that seems infeasible. The next-best idea is to tolerate imperfect programs. The combination of failfast, transactions, and system pairs or process pairs seems to tolerate many transient software faults."</b></p><p>This is the technical argument.</p><p><i>"Transactions, and their ACID properties, have four nice features:</i></p><p></p><ul style="text-align: left;"><li><i>Isolation. Each program is isolated from the concurrent activity of others and, consequently, from the failure of others.</i></li><li><i>Granularity. The effects of individual transactions can be discarded by rolling back a transaction, providing a fine granularity of failure.</i></li><li><i>Consistency. Rollback restores all state invariants, cleaning up any inconsistent data structures.</i></li><li><i>Durability. No committed work is lost.</i></li></ul><p></p><p><i>These features mean that transactions allow the system to crash and restart gracefully; the only thing lost is the time required to crash and restart. Transactions also limit the scope of failure by perhaps only undoing one transaction rather than restarting the whole system. </i><b>But the core issue for distributed computing is that the whole system cannot be restarted; only pieces of it can be restarted, since a single part generally doesn’t control all the other parts of the network. A restart in a distributed system, then, needs an incremental technique (like transaction undo) to clean up any distributed state.</b><i> Even if a transaction contains a Bohrbug, the correct distributed system state will be reconstructed by the transaction undo, and only that transaction will fail."</i></p><p>First of all, kudos to Gray/Reuter for thinking big, and aiming to addressing distributed systems challenges that would start going big only in 2000s and becoming ever more prominent since then. This is a solid argument in the book, especially from 1990s point-of-view.</p><p>With 30+ years of hindsight, we notice couple problems with this argument. </p><p></p><ul style="text-align: left;"><li>There are fundamental information horizon limits to distributed transactions (at minimum due to speed of light), and <a href="https://muratbuffalo.blogspot.com/2024/01/scalable-oltp-in-cloud-whats-big-deal.html">scalability limits</a> (due to contended keys).</li><li><a href="https://queue.acm.org/detail.cfm?id=3458812">Fail-fast Is Failing... Fast!</a></li></ul><p>What we came to learn with experience is that "it is futile to "paper over the distinction between local and remote objects... such a masking will be impossible" as <a href="https://scholar.harvard.edu/waldo/publications/note-distributed-computing">Jim Waldo famously stated in A Note on Distributed Computing.</a></p><p>So rather than trying to hide these through transactions in the middleware, we need to design end-to-end systems-level and application-level fault-tolerance approaches that respect distributed systems limitations.</p><p><br /></p><h1 style="text-align: left;">Questions / Comments</h1><p><br /></p><p>1. I really liked how Jim Gray tied the software-fault-tolerance to the transactions concept, and presented the all-or-nothing guarantee of transactions as a remedy/enabler for software fault-tolerance (from the 1990 point-of-view). I think <a href="https://muratbuffalo.blogspot.com/2011/01/crash-only-software-hotos03.html">crash-only software</a> was also a very good idea. It wasn't extended to distributed systems, but provides a good base for fault-tolerance at the node level. Maybe transactional thinking can be relaxed towards crash-only software thinking and the ideas could be combined.</p><p><br /></p><p>2. The book over-indexes on process-pair approaches with a primary and secondary. <i>"The concept of process pair (covered in Subsection 3.7.3) specifies that one process should instantly (in milliseconds) take over for the other in case the primary process fails. In the current discussion, we take the more Olympian view of system pairs, that is two identical systems in two different places. The second system has all the data of the first and is receiving all the updates from the first. Figure 3.2 has an example of such a system pair. If one system fails, the other can take over almost instantly (within a second). If the primary crashes, a client who sent a request to the primary will get a response from the backup a second later. Customers who own such system pairs crash a node once a month just as a test to make sure that everything is working—and it usually is."</i></p><p>This is expected because distributed consensus and <a href="https://muratbuffalo.blogspot.com/search?q=paxos">Paxos approaches</a> did not get well-known in 1990. These process-pair approaches are prone to split brain scenarios: where the secondary thinks primary is crashed, and takes over, but the primary is oblivious to this serving requests. There needs to be either leader election build via Paxos (which would require 3 node deployments at minimum), or to use Paxos as the configuration-metadata box to adjudicate over who is the primary and secondary. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgoA7O2F4yftDEGI1yeGoSsxZn3Uqn10jcCkbWCFLgVjyBVhuluo54pPuLz-A4vBMBHitQ1KDccXt4UIGsyWQeOU8x3sJzwLfsezutAMSKxQZfJOG_z64Rmwz2Sv9lB4hqdiVOj4kpspgjIaw-TLo2ZbcIvAHIE-aLKUlqKIiztUF2XFPXrzDoqafdyQC4" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1024" data-original-width="1006" height="400" src="https://blogger.googleusercontent.com/img/a/AVvXsEgoA7O2F4yftDEGI1yeGoSsxZn3Uqn10jcCkbWCFLgVjyBVhuluo54pPuLz-A4vBMBHitQ1KDccXt4UIGsyWQeOU8x3sJzwLfsezutAMSKxQZfJOG_z64Rmwz2Sv9lB4hqdiVOj4kpspgjIaw-TLo2ZbcIvAHIE-aLKUlqKIiztUF2XFPXrzDoqafdyQC4=w393-h400" width="393" /></a></div><p><br /></p><p>3. <i>"To mask the unreliability of the ATMs, the bank puts two at each customer site. If one ATM fails, the client can step to the adjacent one to perform the task. This is a good example of analyzing the overall system availability and applying redundancy where it is most appropriate."</i> </p><p>What about today? Are there redundant computers in ATMs today? I think today this is mostly restart based fault-tolerance, no?</p><p><br /></p><p>4. The old-master, new-master technique in Section 3.1.2 reminded me of <a href="https://github.com/jonhoo/left-right/">the left-right primitive</a> in the <a href="https://muratbuffalo.blogspot.com/2022/12/noria-dynamic-partially-stateful.html">Noria paper</a>. Not the same thing, but I think it has similar ideas. And this kind of old-master new-master approach can even be used to provide some kind of tolerance to a poison-pill operation. </p><p><br /></p><p>5. <i>"Error recovery can take two forms. The first form of error recovery, backward error recovery, returns to a previous correct state. Checkpoint/restart is an example of backward error recovery. The second form, forward error recovery, constructs a new correct state. Redundancy in time, such as resending a damaged message or rereading a disk page, is an example of forward error recovery."</i></p><p>Today we don't hear backward error recovery versus forward error recovery distinction frequently. It sounds like backward recovery is more apt suitable larger recovery/correction. And it seems to me that over time recovery got more finer granular, and forward error recovery become the dominant model. There may have been some convergence and blurring the lines between the two over time.</p><p><br /></p><p>6. <i>"As Figure 3.7 shows, software is a major source of outages. The software base (number of lines of code) grew by a factor of three during the study period, but the software MTTF held almost constant. This reflects a substantial improvement in software quality. But if these trends continue, the software will continue to grow at the same rate that the quality improves, and software MTTF will not improve."</i></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgG1KDWbPaPT_j1UGDiXIsEvL_lQvSuOTytZpwRcWOhAWIH4tI2fvO8pTGxANFEaDT0h2yxCPYpYP-kR1QKlhhva1paeQdTq0DKMhfn-nRppPgPz2WVqD2SRi_pTpAq_D4NzdnP63o60TTtmXomKyShZYLvNLAwAEjVhdcQlQ859b3uQuLesM8I6rgzKnc" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="796" data-original-width="1006" height="317" src="https://blogger.googleusercontent.com/img/a/AVvXsEgG1KDWbPaPT_j1UGDiXIsEvL_lQvSuOTytZpwRcWOhAWIH4tI2fvO8pTGxANFEaDT0h2yxCPYpYP-kR1QKlhhva1paeQdTq0DKMhfn-nRppPgPz2WVqD2SRi_pTpAq_D4NzdnP63o60TTtmXomKyShZYLvNLAwAEjVhdcQlQ859b3uQuLesM8I6rgzKnc=w400-h317" width="400" /></a></div><p>Do we have a quantitative answer to whether this trend shaped up? Jim Gray had published the paper: "<a href="https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf">Why Do Computers Stop and What Can Be Done About It?</a>" This <a href="https://muratbuffalo.blogspot.com/2016/11/why-does-cloud-stop-computing-lessons.html">2016 study seems to be a follow up</a> to revisit some of these questions. I think Jim's prediction was correct. For today's systems, software is still the single point of failure for availability/reliability of systems. It also seems like the outages rate continued to shrink significantly.</p><p><br /></p><p>7. <i>"Production software has ≈3 design faults per 1,000 lines of code. Most of these bugs are soft; they can be masked by retry or restart. The ratio of soft to hard faults varies, but 100:1 is usual."</i> Does anybody know of a recent study that evaluated this and updated these numbers?</p><p><br /></p><p>8. In Section 3.6.1, the book describes N-version programming: <i>"Write the program n times, test each program carefully, and then operate all n programs in parallel, taking a majority vote for each answer. The resulting design diversity should mask many failures."</i> N-version programming indeed flopped as the book predicted. It was infeasible to have multiple teams develop diverse versions of the software. But with the rise of AI and LLMs, can this be feasible?</p><p><br /></p><p>9. Is this a seedling of the later RESTful design idea? <i>"There is a particularly simple form of process pair called a persistent process pair. Persistent process pairs have a property that is variously called context free, stateless, or connectionless. Persistent processes are almost always in their initial state. They perform server functions, then reply to the client and return to their initial state. The primary of the persistent process pair does not checkpoint or send I’m Alive messages, but just acts like an ordinary server process. If the primary process fails in any way, the backup takes over in the initial state."</i></p><p><br /></p><p>10. Is this a seedling of disaggregated architecture and service-oriented-architecture ideas? <i>"A persistent process server should maintain its state in some form of transactional storage: a transaction-protected database. When the primary process fails, the transaction mechanism should abort the primary’s transaction, and the backup should start with a consistent state."</i></p><p><br /></p><p>11. This discussion in Section 3.9 seems relevant for <a href="https://muratbuffalo.blogspot.com/2023/09/metastable-failures-in-wild.html">metastable failures</a>. It may be possible to boil metastability failures to state-dis-synchrony problem in the subparts of a system: cache, database, queue, client, etc. "Being out of touch with reality!", this is what Gray calls as system delusion.</p><p><i>"The point of these two stories is that a transaction system is part of a larger closed-loop system that includes people, procedures, training, organization, and physical inventory, as well as the computing system. Transaction processing systems have a stable region; so long as the discrepancy between the real world and the system is smaller than some threshold, discrepancies get corrected quickly enough to compensate for the occurrence of new errors. However, if anything happens (as in the case of the improperly trained clerk) to push the system out of its stable zone, the system does not restore itself to a stable state; instead, its delusion is further amplified, because no one trusts the system and, consequently, no one has an incentive to fix it. If this delusion process proceeds unchecked, the system will fail, even though the computerized part of it is up and operating."</i></p><p><br /></p><p>12. <i>"System delusion doesn’t happen often, but when it does there is no easy or automatic restart. Thus, to the customer, fault tolerance in the transaction system is part of a larger fault-tolerance issue: How can one design the entire system, including the parts outside the computer, so that the whole system is fault tolerant?" </i></p><p>I think <a href="https://muratbuffalo.blogspot.com/2017/08/cloud-fault-tolerance.html">self-stabilization theory</a> provides a good answer to this question. The theory needs to be extended with control-theory and maybe queueing theory to take care of workload related problems as well.</p><p><br /></p><p>13. "Most large software systems have data structure repair programs that traverse data structures, looking for inconsistencies. Called auditors by AT&T and salvagers by others, these programs heuristically repair any inconsistencies they find. The code repairs the state by forming a hypothesis about what data is good and what data is damaged beyond repair. In effect, these programs try to mask latent faults left behind by some Heisenbug. Yet, their techniques are reported to improve system mean times to failure by an order of magnitude (for example, see the discussion of functional recovery routines."</p><p>This reminds me of<a href="https://pawan-bhadauria.medium.com/distributed-systems-part-3-managing-anti-entropy-using-merkle-trees-443ea3fc6213"> the anti-entropy processes</a> used by distributed storage systems. </p><p><br /></p><p>14. <i>"The presence of design faults is the ultimate limit to system availability; we have techniques that mask other kinds of faults."</i> </p><p>I understand the logic behind this, but are we sure there are no fundamental impossibility laws that prohibit perfect availability even when we have perfect design? <a href="https://muratbuffalo.blogspot.com/2015/02/paper-summary-perspectives-on-cap.html">CAP, FLP, attacking generals</a> impossibility results come to mind. Even without partitions, we seem to have emergent failures and metastability laws. So there may even be another impossibility result there as well involving a tradeoff between scale/throughput and availability of a system. It is possible to build highly available systems but they work within well-known workload and environment environments, so they are not suitable for general computing applications.</p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-22538075200713335602024-02-16T22:01:00.004-05:002024-02-16T22:01:51.828-05:00Transaction Processing Book, Gray/Reuter 1992<p> I started reading this book as part of <a href="https://discord.com/channels/824628143205384202/1193544147072729198">Alex Petrov's book club</a>.</p><p>I am loving the book. We will do Chapter 3 soon, and I am late on reporting on this, but here are some material from the first chapters. I don't have the time to write more fulfilling reviews about the first chapters unfortunately, so I will try to give you the gist. </p><h1 style="text-align: left;">Foreword: Why We Wrote this Book</h1><p><i>The purpose of this book is to give you an understanding of how large, distributed, heterogeneous computer systems can be made to work reliably.</i></p><p><i>An integrated (and integrating) perspective and methodology is needed to approach the distributed systems problem.</i><b> Our goal is to demonstrate that transactions provide this integrative conceptual framework, and that distributed transaction-oriented operating systems are the enabling technology. </b><i>In a nutshell: without transactions, distributed systems cannot be made to work for typical real-life applications.</i></p><p>I am very much impressed by the distributed systems insight Gray provides throughout the book. Jim Gray was a distributed systems person. Fight me! Or bite me, I don't care.</p><p><i>Transaction processing concepts were conceived to master the complexity in single-processor online applications. If anything, these concepts are even more critical now for the successful implementation of massively distributed systems that work and fail in much more complex ways. This book shows how transaction concepts apply to distributed systems and how they allow us to build high- performance, high-availability applications with finite budgets and risks.</i></p><p><i>There are many books on database systems, both conventional and distributed; on operating systems; on computer communications; on application development—you name it. Such presentations offer many options and alternatives, but rarely give a sense of which are the good ideas and which are the not-so-good ones, and why. More specifically, were you ever to design or build a real system, these algorithm overviews would rarely tell you how or where to start.</i></p><p>This book has a very practical and quantitative approach to the topics, indeed. I am also very much impressed by the quantitative approach Gray takes throughout the book. For fault-tolerance analysis, the book uses many surveys (some of which were Gray's studies) to back up the arguments/claims it puts forward. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigr9utktU1m7Rd4n46UseTxcKiWI7WU4_EABu31dDGW5R9pu-pOxY4SRrYEdC8LT4zwjTnmU8iWi_qeUVR3DPH6puyKpBghEm1T1fA7IsCP-4nmF1sYOTEodJOtlM0cATru2aoiIiGpg5KQP74ESwJm98vevpG7W-XGFP0JHplzgAORIa11j9HvESrfMs/s912/Screenshot%202024-02-16%20at%209.52.39%E2%80%AFPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="912" data-original-width="862" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigr9utktU1m7Rd4n46UseTxcKiWI7WU4_EABu31dDGW5R9pu-pOxY4SRrYEdC8LT4zwjTnmU8iWi_qeUVR3DPH6puyKpBghEm1T1fA7IsCP-4nmF1sYOTEodJOtlM0cATru2aoiIiGpg5KQP74ESwJm98vevpG7W-XGFP0JHplzgAORIa11j9HvESrfMs/w378-h400/Screenshot%202024-02-16%20at%209.52.39%E2%80%AFPM.png" width="378" /></a></div><p>As I read the book, I am in constant awe of the clarity of thinking behind the book. Gray was a <a href="https://muratbuffalo.blogspot.com/2023/09/beyond-code-tla-and-art-of-abstraction.html">master of abstraction.</a></p><h1 style="text-align: left;">1. Introduction</h1><p><i>Six thousand years ago, the Sumerians invented writing for transaction processing. An abstract system state, represented as marks on clay tablets/ledgers, was maintained. Today, we would call this the </i><b>database</b><i>. Scribes recorded state changes with new records (clay tablets) in the database. Today, we would call these state changes </i><b>transactions</b><i>.</i></p><p><i>This book contains considerably more information about the ACID properties. For now, however, a transaction can be considered a collection of actions with the following properties:</i></p><p></p><ul style="text-align: left;"><li><i>Atomicity. A transaction’s changes to the state are atomic: either all happen or none happen. These changes include database changes, messages, and actions on transducers.</i></li><li><i>Consistency. A transaction is a correct transformation of the state. The actions taken as a group do not violate any of the integrity constraints associated with the state. This requires that the transaction be a correct program.</i></li><li><i>Isolation. Even though transactions execute concurrently, it appears to each transaction, T, that others executed either before T or after T, but not both.</i></li><li><i>Durability. Once a transaction completes successfully (commits), its changes to the state survive failures.</i></li></ul><p></p><h1 style="text-align: left;">2. Basic Computer System Terms</h1><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhze2nxDgIBjzRNOgPX2sPChG83dDlrJK7qSGwrg5rtEkUtvSUEKBqOKjN5p0ZxcR_k22La8tVlJj_Hu5umEBHur00A-3Mzy9OaO0rWho9mBO33JzFEdbn1uKbmBXK_2SshL8coM38fComXXBjXpJqmYxZCGXsrLfxIVRKvEJJ8y7uLlU3RkvMp67HBcZw/s862/Screenshot%202024-02-16%20at%209.56.10%E2%80%AFPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="572" data-original-width="862" height="265" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhze2nxDgIBjzRNOgPX2sPChG83dDlrJK7qSGwrg5rtEkUtvSUEKBqOKjN5p0ZxcR_k22La8tVlJj_Hu5umEBHur00A-3Mzy9OaO0rWho9mBO33JzFEdbn1uKbmBXK_2SshL8coM38fComXXBjXpJqmYxZCGXsrLfxIVRKvEJJ8y7uLlU3RkvMp67HBcZw/w400-h265/Screenshot%202024-02-16%20at%209.56.10%E2%80%AFPM.png" width="400" /></a></div><p><b>The Five-Minute Rule. </b><i>How shall we manage these huge memories? The answers so far have been clustering and sequential access. However, there is one more useful technique for managing caches, called the five-minute rule. Given that we know what the data access patterns are, when should data be kept in main memory and when should it be kept on disk? The simple way of answering this question is, Frequently accessed data should be in main memory, while it is cheaper to store infrequently accessed data on disk. Unfortunately, the statement is a little vague: What does frequently mean? The five-minute rule says frequently means five minutes, but the rule reflects a way of reasoning that also applies to any cache-secondary memory structure. In those cases, depending on relative storage and access costs, frequently may turn out to be milliseconds, or it may turn out to be days.</i></p><p>The basic principles of the five-minute rule held well over the years.</p><p></p><ul style="text-align: left;"><li>The Five-Minute Rule: <a href="https://queue.acm.org/detail.cfm?id=1413264">20 Years Later and How Flash Memory Changes the Rules </a></li><li>The Five-Minute Rule <a href="https://dl.acm.org/doi/pdf/10.1145/3318163">30 Years Later and Its Impact on the Storage Hierarchy </a></li></ul><p></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRTdxC5H6RSjhsp6NrtpTutHr3bag0nCN2n7D4-mkf8LfKqgiHSnjm4NL9DGQdG4dkpJENS1Ph5hYvHS7MUiHiaXPgMmjgO-wtTGzmXyj2bhdaCMlkbqpDlEFdCtnN_PnUIaXRcMvxEDEhqcFVHlp8rROSAJDuE1kwsKl99Oy0MSnxLF16k8uj5_FVycU/s862/Screenshot%202024-02-16%20at%209.58.40%E2%80%AFPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="752" data-original-width="862" height="349" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRTdxC5H6RSjhsp6NrtpTutHr3bag0nCN2n7D4-mkf8LfKqgiHSnjm4NL9DGQdG4dkpJENS1Ph5hYvHS7MUiHiaXPgMmjgO-wtTGzmXyj2bhdaCMlkbqpDlEFdCtnN_PnUIaXRcMvxEDEhqcFVHlp8rROSAJDuE1kwsKl99Oy0MSnxLF16k8uj5_FVycU/w400-h349/Screenshot%202024-02-16%20at%209.58.40%E2%80%AFPM.png" width="400" /></a></div><i><b>Shared nothing.</b> In a shared-nothing design, each memory is dedicated to a single processor. All accesses to that data must pass through that processor. Processors communicate by sending messages to each other via the communications network.</i><p></p><p><i><b>Shared global.</b> In a shared-global design, each processor has some private memory not accessible to other processors. There is, however, a pool of global memory shared by the collection of processors. This global memory is usually addressed in blocks (units of a few kilobytes or more) and is RAM disk or disk.</i></p><p><i><b>Shared memory.</b> In a shared-memory design, each processor has transparent access to all memory. If multiple processors access the data concurrently, the underlying hardware regulates the access to the shared data and provides each processor a cur- rent view of the data.</i></p><p><br /></p><p><b>Unfortunately, both for today and for many years to come, software dominates the cost of databases and communications. </b><i>It is possible, but not likely, that the newer, faster processors will make software less of a bottleneck.</i></p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com1tag:blogger.com,1999:blog-8436330762136344379.post-21303490355154420792024-02-13T15:20:00.008-05:002024-02-13T15:23:17.643-05:00 Verifying Transactional Consistency of MongoDB<p><a href="https://arxiv.org/abs/2111.14946">This paper</a> presents pseudocodes for the transaction protocols for the three possible MongoDB deployments: WiredTiger, ReplicaSet, and ShardedCluster, and shows that these satisfy different variants of snapshot isolation: namely StrongSI, RealtimeSI, and SessionSI, respectively.</p><h1 style="text-align: left;">Background</h1><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhRKUHzqn1zQVi3iX4HWslSosCz9pXgV3-eaSeVaT85OHKL10swTozrHOgl2j04-FqU_Ezj-eyHL3lmHThPupINUNjpDa96uaF64PYsbkYkYRRuhDWOzkEvNUKkPhbOr_bTVEuk1G6Fd7uZYokKJ6_L4xXsdzRunhrWlfxTT1SWI8NWw0r1zPh5h1Q8GcI" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="918" data-original-width="1270" height="289" src="https://blogger.googleusercontent.com/img/a/AVvXsEhRKUHzqn1zQVi3iX4HWslSosCz9pXgV3-eaSeVaT85OHKL10swTozrHOgl2j04-FqU_Ezj-eyHL3lmHThPupINUNjpDa96uaF64PYsbkYkYRRuhDWOzkEvNUKkPhbOr_bTVEuk1G6Fd7uZYokKJ6_L4xXsdzRunhrWlfxTT1SWI8NWw0r1zPh5h1Q8GcI=w400-h289" width="400" /></a></div><p><a href="https://docs.mongodb.com/manual/core/transactions/#transactions-and-atomicity">MongoDB transactions</a> have evolved in three stages (Figure 1):</p><p></p><ul style="text-align: left;"><li>In version 3.2, MongoDB used the WiredTiger storage engine as the default storage engine. Utilizing the Multi-Version Concurrency Control (MVCC) architecture of WiredTiger storage engine, MongoDB was able to support single-document transactions in the standalone deployment.</li><li>In version 4.0, MongoDB supported multi-document transactions in replica sets (which consists of a primary node and several secondary nodes),</li><li>In version 4.2, MongoDB further introduced distributed multi-document transactions in sharded clusters (which is a group of multiple replica sets among which data is sharded).</li></ul><p></p><p>I love that the paper managed to present the transactional protocols on these three deployment types in a layered/superimposed manner. We start with the bottom layer, the WiredTiger transactions in Algorithm 1. Then the replicaset algorithm, Algorithm 2, is presented, which uses primitives from Algorithm 1. Finally, the ShardedCluster transactions algorithm is presented, using primitives from Algorithm 2. Ignore the underlined and highlighted lines in Algorithms 1 and 2; they are needed for the higher layer algorithms, which are discussed later on.</p><p>If you need a primer on transaction isolation layers, you can check <a href="https://muratbuffalo.blogspot.com/2022/06/seeing-is-believing-client-centric.html">this</a> and <a href="https://jepsen.io/consistency">this</a>. <a href="https://muratbuffalo.blogspot.com/2023/09/a-snapshot-isolated-database-modeling.html">The TLA+ model I presented for snapshot isolation</a> is also useful to revisit to understand how snapshot isolated works in principle. </p><h1 style="text-align: left;">WiredTiger (WT) transactions</h1><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhKirhOFucrjpFq7agQEFt1KhB-1FbVCj0s6dSYrLeLgd8YfInAspoSZyPkOQcDwhh6NbK8bzZVANawOX7KmI_LNPFfdLbfrh4D9sBwtdUfyRjaW06e1ad7M4JlMcawPewrPHmofllYuaMl8jQY3Asfb5YNz27feU8OC7850JsUwSXDFLy-8NDCvP3i2eM" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1536" data-original-width="1536" height="240" src="https://blogger.googleusercontent.com/img/a/AVvXsEhKirhOFucrjpFq7agQEFt1KhB-1FbVCj0s6dSYrLeLgd8YfInAspoSZyPkOQcDwhh6NbK8bzZVANawOX7KmI_LNPFfdLbfrh4D9sBwtdUfyRjaW06e1ad7M4JlMcawPewrPHmofllYuaMl8jQY3Asfb5YNz27feU8OC7850JsUwSXDFLy-8NDCvP3i2eM" width="240" /></a></div><p>Clients interact with WiredTiger via sessions. Each client is bound to a single session with a unique session identifier wt_sid. At most one transaction is active on a session at any time. Intuitively, each transaction is only aware of all the transactions that have already been committed before it starts. To this end, a transaction txn maintains txn.concur (the set of identifiers of currently active transactions that have obtained their identifiers) and txn.limit (the next transaction identifier, tid, when txn starts).</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEg4kxAZ7dpoxzjEqkxuTetBGpYEJ8UVeDICc-fFHLWotLgLNJhuZU0FFNUidWPP2XCp_crgEIQH_ZE2UFymL0XDTqEA2Cr24YG38GRBuy1-7_3rcKbHK9zAYRPZruVGAxTnddyXQ9pSkbtxjnxGYgTA3qOcuzxqNcwS_S5AJyXu2SU6pqiaIbZd5V049bg" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="314" data-original-width="2206" src="https://blogger.googleusercontent.com/img/a/AVvXsEg4kxAZ7dpoxzjEqkxuTetBGpYEJ8UVeDICc-fFHLWotLgLNJhuZU0FFNUidWPP2XCp_crgEIQH_ZE2UFymL0XDTqEA2Cr24YG38GRBuy1-7_3rcKbHK9zAYRPZruVGAxTnddyXQ9pSkbtxjnxGYgTA3qOcuzxqNcwS_S5AJyXu2SU6pqiaIbZd5V049bg=s16000" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgp_HzmQt9awt2agKII7M5tvA05j8yi-1GL_yuXK8LRQFeigGabK5uk3menTGYDvKUixyUoPUoauUgfIZvrGWGZfihstC37jNPL2Q1TRV3RM9nK32IEyiv57xzniyUzkG478uZ0xzwa3-XY3oCRF4JRwMU3Nwv-Hs9ZfMFfAmVMao-ovIr9KURa5HRyS_A" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="2520" data-original-width="2248" src="https://blogger.googleusercontent.com/img/a/AVvXsEgp_HzmQt9awt2agKII7M5tvA05j8yi-1GL_yuXK8LRQFeigGabK5uk3menTGYDvKUixyUoPUoauUgfIZvrGWGZfihstC37jNPL2Q1TRV3RM9nK32IEyiv57xzniyUzkG478uZ0xzwa3-XY3oCRF4JRwMU3Nwv-Hs9ZfMFfAmVMao-ovIr9KURa5HRyS_A=s16000" /></a></div><p>A client starts a transaction on a session wt_sid by calling wt_start, which creates and populates a transaction txn (lines 1:2–1:5). Particularly, it scans wt_global to collect the concurrently active transactions on other sessions into txn.concur. tid tracks the next monotonically increasing transaction identifier to be allocated. When a transaction txn starts, it initializes txn.tid to 0. The actual (non-zero) txn.tid is assigned when its first update operation is successfully executed. A transaction txn with txn.tid $\neq$ 0 may be aborted due to a conflict caused by a later update.</p><p>To read from a key, we iterate over the update list store[key] forward and returns the value written by the first visible transaction (line 1:10). To update a key, we first check whether the transaction, denoted txn, should be aborted due to conflicts (lines 1:14–1:17). To this end, we iterates over the update list store[key]. If there are updates on key made by transactions that are invisible to txn and are not aborted, txn will be rolled back. If txn passes the conflict checking, it is assigned a unique transaction identifier, i.e., tid, in case it has not yet been assigned one (line 1:19). Finally, the key-value pair ⟨key,val⟩ is added into the modification set txn.mods and is inserted at the front of the update list store[key].</p><p>To commit the transaction on session wt_sid, we simply reset wt_global[wt_sid] to $\bot tid$, indicating that there is currently no active transaction on this session (line 1:32). To roll back a transaction txn, we additionally reset txn.tid in store to −1 (line 1:38). Note that read-only transactions (which are characterized by txn.tid=0) can always commit successfully.</p><h1 style="text-align: left;">Replica Set Transactions</h1><p>A replica set consists of a single primary node and several secondary nodes. All transactional operations, i.e., start, read, update, and commit, are first performed on the primary. Committed transactions are wholesale-replicated to the secondaries<a href="https://muratbuffalo.blogspot.com/2024/01/fault-tolerant-replication-with-pull.html"> via a leader-based consensus protocol similar to Raft.</a> In other words, before the completion of the transaction, the entire effect of transaction is sent to secondaries and are majority replicated with the assigned timestamp.</p><p>We don't go in to this in the protocol description here, but there is a clever speculative snapshot isolation algorithm used by the primary for transaction execution. <a href="https://muratbuffalo.blogspot.com/2024/02/tunable-consistency-in-mongodb.html">I summarized that at the end of my review for "Tunable Consistency in MongoDB"</a>. Here is the relevant part: MongoDB uses an innovative strategy for implementing readConcern within transactions that greatly reduced aborts due to write conflicts in back-to-back transactions. When a user specifies readConcern level “majority” or “snapshot”, the returned data is guaranteed to be committed to a majority of replica set members. Outside of transactions, this is accomplished by reading at a timestamp at or earlier than the majority commit point in WiredTiger. However, this is problematic for transactions: It is useful for write operations to read the freshest version of a document, since the write will abort if there is a newer version of the document than the one it read. This motivated the implementation of “speculative” majority and snapshot isolation for transactions. Transactions read the latest data for both read and write operations, and at commit time, if the writeConcern is w:“majority”, they wait for all the data they read to become majority committed. This means that a transaction only satisfies its readConcern guarantees if the transaction commits with writeConcern w:“majority”. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgQcOwZLAHPdTZ1dHxnIv2BuRhzGVl8BRcpbw88IhYx16acsiphwNv7e83uft1orqr9w_B4kM6-NatFicaJvlWkKWHhbYGQ6fecHvKz9RaA-3uesYwV2wG9v9KdybLLYgktaWIektlvbhaJPdHsMVzMXTJXUAHX_t-1giSMlOu9DM87VvfQaEkR84Fp8iU" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1372" data-original-width="2188" src="https://blogger.googleusercontent.com/img/a/AVvXsEgQcOwZLAHPdTZ1dHxnIv2BuRhzGVl8BRcpbw88IhYx16acsiphwNv7e83uft1orqr9w_B4kM6-NatFicaJvlWkKWHhbYGQ6fecHvKz9RaA-3uesYwV2wG9v9KdybLLYgktaWIektlvbhaJPdHsMVzMXTJXUAHX_t-1giSMlOu9DM87VvfQaEkR84Fp8iU=s16000" /></a></div><p>ReplicaSet uses <a href="https://muratbuffalo.blogspot.com/2014/07/hybrid-logical-clocks.html">hybrid logical clocks (HLC)</a> as the read and commit timestamps of transactions. When a transaction starts, it is assigned a read timestamp on the primary such that all transactions with smaller commit timestamps have been committed in WiredTiger. That is, the read timestamp is the maximum point at which the oplog of the primary has no gaps (1:42).</p><p>When the primary receives the first operation of an transaction (lines 2:4 and 2:11), it calls open_wt_session to open a new session wt_sid to WiredTiger, start a new WiredTiger transaction on wt_sid, and more importantly set the transaction’s read timestamp. The primary delegates the read/update operations to WiredTiger (lines 2:7 and 2:14). If an update succeeds, the ⟨key,val⟩ pair is recorded in txn_mods[rs_sid] (line 2:16). To commit a transaction, the primary first atomically increments its cluster time ct via tick, takes it as the transaction’s commit timestamp (line 2:23), uses it to update max_commit_ts, and records it in wt_global (lines 2:24 and 1:46).</p><p>If this is a read-only transaction, the primary appends a noop entry to its oplog (line 2:27; Section 4.1.2). Otherwise, it appends an entry containing the updates of the transaction. Each oplog entry is associated with the commit timestamp of the transaction. Then, the primary asks WiredTiger to locally commit this transaction in wt_commit (line 2:30), which associates the updated key-value pairs in store with the commit timestamp (line 1:31). Note that wt_commit needs not to be atomically executed with tick and wt_set_commit_ts.</p><p>Finally, the primary waits for all updates of the transaction to be majority committed (line 2:31). Specifically, it waits for last_majority_committed \geq ct, where last_majority_committed is the timestamp of the last oplog entry that has been majority committed.</p><h1 style="text-align: left;">Sharded cluster transactions</h1><p>A client issues distributed transactions via a session connected to a mongos. The mongos, as a transaction router, uses its cluster time as the read timestamp of the transaction and forwards the transactional operations to corresponding shards. The shard which receives the first read/update operation of a transaction is designated as the transaction coordinator.</p><p>If a transaction has not been aborted due to write conflicts in sc_update, the mongos can proceed to commit it. If this transaction is read-only, the mongos instructs each of the participants to directly commit locally via rs_commit; otherwise, the mongos instructs the transaction coordinator to perform a variant of two-phase commit (2PC) that always commits among all participants (line 4:9). So, the coordinator sends a prepare message to all participants. After receiving the prepare message, a participant computes a local prepare timestamp and returns it to the coordinator in a prepare_ack message. When the coordinator receives prepare_ack messages from all participants, it calculates the transaction’s commit timestamp by taking the maximum of all prepare timestamps (line 4:14), and sends a commit message to all participants. After receiving dec_ack messages from all participants, the coordinator replies to the mongos (line 4:18).</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjC_e-w-WcFVGwXCl4CKJOAsdd8ahUUMJPRW5Gzm5sfekGls_hXx5lNVRKW_byqmY83hzHuMHY0znixYXqat3HqeA2f2R4m-U4xJVFxlQpSq8bONdsMW09MTa3j-It7-EDVe0dzHhCOH5vMTa3Y90Lzoh7ePi2Mwi7cOLShCqhuu2w6Dglrz_jtU4bmzp4" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1872" data-original-width="2188" src="https://blogger.googleusercontent.com/img/a/AVvXsEjC_e-w-WcFVGwXCl4CKJOAsdd8ahUUMJPRW5Gzm5sfekGls_hXx5lNVRKW_byqmY83hzHuMHY0znixYXqat3HqeA2f2R4m-U4xJVFxlQpSq8bONdsMW09MTa3j-It7-EDVe0dzHhCOH5vMTa3Y90Lzoh7ePi2Mwi7cOLShCqhuu2w6Dglrz_jtU4bmzp4=s16000" /></a></div><p>Consider a session sc_sid connected to a mongos. We use read_ts[sc_sid] to denote the read timestamp, assigned by the mongos, of the currently active transaction on the session. ShardedCluster uses HLCs which are loosely synchronized to assign read and commit timestamps to transactions. Due to clock skew or pending commit, a transaction may receive a read timestamp from a mongos, but the corresponding snapshot may not yet be fully available at transaction participants. This will lead to delaying of the read/update operations until the snapshot becomes available. These cases are referred to as Case-XXX in the below description.</p><p>If this is the first operation the primary receives, it calls sc_start to set the transaction’s read timestamp in WiredTiger (line 4:23). In sc_start, it also calls wait_for_read_concern to handle Case-Clock-Skew and Case-Holes (line 4:24). The primary then delegates the operation to ReplicaSet (lines 4:3 and 4:7). To handle Case-Pending-Commit-Read, rs_read has been modified to keep trying reading from WiredTiger until it returns a value updated by a committed transaction (line 2:8). To handle Case-Pending-Commit-Update, rs_update first performs an sc_read on the same key (line 2:12). Moreover, if the update fails due to write conflicts, the mongos will send an abort message to the primary nodes of all other participants, without entering 2PC.</p><p>In 2PC, the transaction coordinator behaves as we described above. On the dual side, after receiving a prepare message, the participant advances its cluster time and takes it as the prepare timestamp (lines 4:27, 4:28, 1:54, and 1:58). Note that the transaction’s tid in wt_global is reset to ⊥tid (line 1:59). Thus, according to the visibility rule, this transaction is visible to other transactions that starts later in WiredTiger. Next, the participant creates an oplog entry containing the updates executed locally or a noop oplog entry for the “speculative majority” strategy. Then, it waits until the oplog entry has been majority committed (line 4:34). When a participant receives a commit message, it ticks its cluster time. After setting the transaction’s commit timestamp (line 4:39), it asks WiredTiger to commit the transaction locally (line 4:40). Note that the status of the transaction is changed to committed (line 1:70). Thus, this transaction is now visible to other waiting transactions (line 2:8). Then, the participant generates an oplog entry containing the commit timestamp and waits for it to be majority committed.</p><h1 style="text-align: left;">Discussion</h1><p>The paper provides a nice simplified/understandable overview of MongoDB transactions. It mentions some limitations of this simplified model. The paper assumed that each procedure executes atomically, but the implementation of MongoDB is highly concurrent with intricate locking mechanisms. The paper also did not consider failures and explore the fault-tolerance and recovery of distributed transactions.</p><p>As an interesting future research direction, I double down on the cross-layer opportunities I had mentioned in my previous post. The layered/superimposed presentation in this paper strengthens my hunch that we can have more cross layer optimization opportunities in MongoDB going forward.</p><p>So what did we learn about MongoDB transactions. They are general transactions, rather than <a href="https://muratbuffalo.blogspot.com/2023/08/distributed-transactions-at-scale-in.html">limited one-shot transactions</a>. They use snapshot isolation, reading from a consistent snapshot, and aborting only on a write-write conflict, similar to major RDBMS transactions.</p><p>They are "OCC", but thanks to the underlying WiredTiger holding the lock on first access, they are less prone to aborting then a pure OCC transaction. An in-progress transaction stops later writes (be it from other transactions or single writes) instead of getting aborted by them. In other words, the first writer claims the item (for some time).</p><p>That being said, this is not truly locking and holding. MDB transactions still favor progress, as they do not like <a href="https://www.cs.colostate.edu/~cs551/CourseNotes/Deadlock/WaitWoundDie.html">waiting</a>. They would just die instead of waiting, ain't nobody got time for that.</p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com1tag:blogger.com,1999:blog-8436330762136344379.post-77499742982095126432024-02-08T19:45:00.002-05:002024-02-08T20:58:37.464-05:00Tunable Consistency in MongoDB<p><a href="https://www.vldb.org/pvldb/vol12/p2071-schultz.pdf">This paper appeared in VLDB 2019.</a> It discusses the tunable consistency models in <a href="https://www.mongodb.com/">MongoDB</a> and how MongoDB's speculative execution model and data rollback protocol enable this spectrum of consistency levels efficiently.</p><h1 style="text-align: left;">Motivation</h1><p>Applications often tolerate short or infrequent periods of inconsistency, so it may not make sense for them to pay the high cost of ensuring strong consistency at all times. These types of trade-offs have been partially codified in the <a href="http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html">PACELC theorem</a>. The <a href="https://muratbuffalo.blogspot.com/2019/01/paper-review-probabilistically-bounded.html">Probabilistically Bounded Staleness work</a> (and <a href="https://muratbuffalo.blogspot.com/2014/02/consistency-based-service-level.html">many</a> <a href="https://muratbuffalo.blogspot.com/2016/03/paper-summary-measuring-and.html">followup</a> <a href="https://muratbuffalo.blogspot.com/2023/05/keep-calm-and-crdt-on.html">work</a>) explored the trade-offs between operation latency and data consistency in distributed database replication and showcased their importance. </p><p>To provide users with a set of tunable consistency options, MongoDB exposes writeConcern and readConcern levels as parameters that can be set on each database operation. writeConcern specifies what durability guarantee a write must satisfy before being acknowledged to a client. Similarly, readConcern determines what durability or consistency guarantees data returned to a client must satisfy. For safety it is preferable to use readConcern “majority” and writeConcern “majority”. However, when users find stronger consistency levels to be too slow, they switch to using weaker consistency levels.</p><p>In MongoDB when reading from and writing to the primary, users usually read their own writes and the system behaves like a single node. There are fault implications to durability of data this way, but for many applications these are well tolerated. An example of this is a game site that matches active players. This site has a high volume of writes, since its popularity means there are many active players looking to begin games. Durability is not important in this use case, since if a write is lost, the player typically retries immediately and is matched into another game.</p><p>As another example, a popular review site uses readConcern “local” and writeConcern “majority” for its reviews. Write loss is painful, since users may spend significant time writing a review, and using “majority” guards against write loss. Reviews are read with readConcern “local”, since users benefit from reading the most recent data, and there is no harm in displaying an unacknowledged write that might be rolled back. Moreover, often a long online form will include multiple save points, where a partial write of the form is sent to the database. The writes at save points are performed using w:1, and the final form submission write is performed using w:“majority”. </p><p>Finally as an example from the other end of the spectrum, consider Majority Reads with Causal Consistency and Local Writes. This combination is useful when writes are small, but double writes are painful. Consider a social media site with short posts. Low-latency posts are desirable, and write loss is acceptable, since a user can rewrite their post, so writes use w:1. However, double writes are painful, since it is undesirable user behavior to have the same post twice. For this reason, reads use readConcern level “majority” with causal consistency so that a user can definitively see whether their post was successful.</p><p>To characterize the consistency levels used by MongoDB application developers, the paper collected operational data from 14,820 instances running 4.0.6 that are managed by MongoDB Atlas. These counts are from 2019, and they are also fairly low because around the data collection time all nodes had been restarted in order to upgrade them to 4.0.6. But they give an idea about the spectrum of read and write concerns used by customer applications. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEggu5QvC6Atfl7jTni2ZO_ar1j-MW5vgMjTjB2n9C8u8e6dWzCmsfVOzFesFTw2f4IF7VzM4jX0A_tBUJczoWKfcHGH7BvhoPIK7hBgsXEqrVOh257rvayH9RMhvD8-igtIpc0R4_coWAaUmQDw2eUC_IQEuGO6ehKWTnjszxEPMttryRUAI8h6515lr0s" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1030" data-original-width="1292" height="319" src="https://blogger.googleusercontent.com/img/a/AVvXsEggu5QvC6Atfl7jTni2ZO_ar1j-MW5vgMjTjB2n9C8u8e6dWzCmsfVOzFesFTw2f4IF7VzM4jX0A_tBUJczoWKfcHGH7BvhoPIK7hBgsXEqrVOh257rvayH9RMhvD8-igtIpc0R4_coWAaUmQDw2eUC_IQEuGO6ehKWTnjszxEPMttryRUAI8h6515lr0s=w400-h319" width="400" /></a></div><h1 style="text-align: left;">Background</h1><p>MongoDB is a NoSQL, document oriented database that stores data in JSON-like objects. All data in MongoDB is stored in a binary form of JSON called BSON. A MongoDB database consists of a set of collections, where a collection is a set of unique documents. MongoDB utilizes <a href="https://github.com/wiredtiger">the WiredTiger storage engine</a>, which is a transactional multi-version concurrency control (MVCC) key value data store that manages the interface to a local durable storage medium.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEisyN95YDzTvzAcKEvhUKwkJDReaN_fG_EKLNzNIVRah6CnSzQfg-Kgyyso0PPyIzCL_d80eFL6-Rsk4SLcjpNajoB5T51c7gjUf_5KWsaWfF9rPC-WNqxTH6FalSA1DvvKW_-UtfSJ3tUFDCwkAloW4VhX0pia46MlTlmaVLLDsLMXsRWnvJpxjqXMmSk" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1158" data-original-width="1292" height="358" src="https://blogger.googleusercontent.com/img/a/AVvXsEisyN95YDzTvzAcKEvhUKwkJDReaN_fG_EKLNzNIVRah6CnSzQfg-Kgyyso0PPyIzCL_d80eFL6-Rsk4SLcjpNajoB5T51c7gjUf_5KWsaWfF9rPC-WNqxTH6FalSA1DvvKW_-UtfSJ3tUFDCwkAloW4VhX0pia46MlTlmaVLLDsLMXsRWnvJpxjqXMmSk=w400-h358" width="400" /></a></div><p>To provide high availability, MongoDB provides the ability to run a database as a replica set <a href="https://muratbuffalo.blogspot.com/2024/01/fault-tolerant-replication-with-pull.html">using a leader based consensus protocol based on Raft</a>. In a replica set there exists a single primary and a set of secondary nodes. The primary node accepts client writes and inserts them into a replication log known as the oplog, where each entry contains information about how to apply a single database operation. Each entry is assigned a timestamp; these timestamps are unique and totally ordered within a node’s log. Oplog entries do not contain enough information to undo operations.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhzo0KaQQ3aHf35GG49_wilbkdow_8akiso4K7AU0Oodosee5lOjlDbY5qXSOedPQEZUp6JQaLyGC0o5xV-WYclXetV_h9T2w62THn6S9J_Qf31mfiE3w5wqR9EfcZD8sf1FlnjvrC7isi7UrcUGl722WbRownYNtxtPk6gxnCsE2mxbQTBmD81CsOkqTQ" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="876" data-original-width="1292" height="271" src="https://blogger.googleusercontent.com/img/a/AVvXsEhzo0KaQQ3aHf35GG49_wilbkdow_8akiso4K7AU0Oodosee5lOjlDbY5qXSOedPQEZUp6JQaLyGC0o5xV-WYclXetV_h9T2w62THn6S9J_Qf31mfiE3w5wqR9EfcZD8sf1FlnjvrC7isi7UrcUGl722WbRownYNtxtPk6gxnCsE2mxbQTBmD81CsOkqTQ=w400-h271" width="400" /></a></div><p>The MongoDB replication system serializes every write that comes into the system into the oplog. When an operation is processed by a replica set primary, the effect of that operation must be written to the database, and the description of that operation must also be written into the oplog. All operations in MongoDB occur inside WiredTiger transactions. When an operation’s transaction commits, we call the operation locally committed. Once it has been written to the database and the oplog, it can be replicated to secondaries, and once it has propagated to enough nodes that meet the necessary conditions, the operation will become majority committed (marked as such in primary and later learned by secondaries) which means it is permanently durable in the replica set.</p><p>For horizontal scaling, MongoDB also offers sharding, which allows users to partition their data across multiple replica sets, but we won't discuss it in this paper.</p><h3 style="text-align: left;">writeConcern</h3><p>writeConcern can be specified either as a numeric value or as “majority”. Write operations done at w:N will be acknowledged to a client when at least N nodes of the replica set (including the primary) have received and locally committed the write. Clients that issue a w:“majority” write will not receive acknowledgement until it is guaranteed that the write operation is majority committed. This means that the write will be resilient to any temporary or permanent failure of any set of nodes in the replica set, assuming there is no data loss at the underlying OS or hardware layers. </p><h3 style="text-align: left;">readConcern</h3><p>For a read operation done at readConcern “local”, the data returned will reflect the local state of a replica set node at the time the query is executed. There are no guarantees that the data returned is majority committed in the replica set, but it will reflect the newest data known to a particular node. Reads with readConcern “majority” are guaranteed to only return data that is majority committed. For majority reads, there is no strict guarantee on the recency of the returned data: The data may be staler than the newest majority committed write operation. (We revisit this in the Consistency Spectrum section, and discuss how majority reads differs from that in Cassandra.)</p><p>MongoDB also provides “linearizable” readConcern, which, when combined with w:“majority” write operations provides the strongest consistency guarantees. <a href="https://muratbuffalo.blogspot.com/2021/10/linearizability.html">Reads with readConcern level “linearizable” are guaranteed to return the effect of the most recent majority write that completed before the read operation began.</a></p><p>Additionally, MongoDB provides “available” and “snapshot” read concern levels, and the ability for causally consistent reads. The “snapshot” read concern only applies to multi-document transactions, and guarantees that clients see a consistent snapshot of data i.e. snapshot isolation. Causal consistency provides the ability for clients to get session guarantees, including read-your-writes behavior in a given session.</p><h3 style="text-align: left;">Speculative execution model</h3><p>In Raft, log entries are not applied to the state machine until they are known to be committed, which means that they will never be erased from the log. In contrast, in order to support the whole consistency spectrum under one roof, MongoDB replicas apply log entries to the database as soon as they are received. This means that a server may apply an operation in its log even if the operation is uncommitted. This allows MongoDB to provide the “local” read concern level. As soon as a write operation is applied on some server, a “local” read is able to see the effects of that write on that server, even before the write is majority committed in the replica set. Recall that in MongoDB, the database itself is the state machine, and entries in the oplog correspond to operations on this state machine. Without the log being applied to the database, the local read would not be possible.</p><h3 style="text-align: left;">Data Rollback</h3><p>MongoDB’s speculative execution model makes it necessary for the replication system to have a procedure for data rollback in case these log entries may need to be erased from the log due to a leader takeover. In a protocol like Raft, this rollback procedure consists of truncating the appropriate entries from a log. In MongoDB, in addition to log truncation, it must undo the effects of the operations it deletes from a log. This requires modifying the state of the database itself, and presents several engineering challenges. The process is initiated by the rollback node when it detects that its log has diverged from the log of the sync source node, i.e. its log is no longer a prefix of that node’s log. The rollback node will then determine the newest log entry that it has in common with the sync source. The timestamp of this log entry is referred to as t_common. The node then needs to truncate all oplog entries with a timestamp after t_common, and modify its database state in such a way that it can become consistent again.</p><h3 style="text-align: left;">Recover to Timestamp (RTT) Algorithm</h3><p>Since MongoDB version 4.0, the WiredTiger storage engine has provided the ability to revert all replicated database data to some previous point in time. The replication system periodically informs the storage engine of a stable timestamp (t_stable), which is the latest timestamp in the oplog that is known to be majority committed and also represents a consistent database state. The algorithm works as follows. First, the rollback node asks the storage engine to revert the database state to the newest stable timestamp, t_stable. Note that t_stable may be a timestamp earlier than the rollback common point, t_common. Then, the node applies oplog entries forward from t_stable up to and including t_common. From t_common onwards normal oplog replication commences.</p><h1 style="text-align: left;">Consistency spectrum</h1><p>To understand the impact of readConcern on the rest of the system, it is necessary to discuss reads in the underlying WiredTiger storage engine. All reads in WiredTiger are done as transactions with snapshot isolation. While a transaction is open, all later updates must be kept in memory. Once there are no active readers earlier than a point in time t, the state of the data files at time t can be persisted to disk, and individual updates earlier than t can be forgotten. Thus a long-running WiredTiger transaction will cause memory pressure, so MongoDB reads must avoid performing long-running WiredTiger transactions in order to limit their impact on the performance of the system.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEg7dJjfjtadiey7udO4rfrFzU7HMKHdV-iWkuPtwcjJMG7XL9ZEDepZmUgyHLL4ZRHrkXvGkw85NSPzWW_cF3Vk8EXiX-uyxUZXpnLdj-ZuBCJduZMUmOr79UL9-54wIPz4YgnUXoz-kZVD3cHfjHDPCyD8r9D8B11QYjiQ6OBe3MXAsdX6qjJRpRwmZOQ" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="836" data-original-width="1292" height="259" src="https://blogger.googleusercontent.com/img/a/AVvXsEg7dJjfjtadiey7udO4rfrFzU7HMKHdV-iWkuPtwcjJMG7XL9ZEDepZmUgyHLL4ZRHrkXvGkw85NSPzWW_cF3Vk8EXiX-uyxUZXpnLdj-ZuBCJduZMUmOr79UL9-54wIPz4YgnUXoz-kZVD3cHfjHDPCyD8r9D8B11QYjiQ6OBe3MXAsdX6qjJRpRwmZOQ=w400-h259" width="400" /></a></div><h3 style="text-align: left;">Local Reads</h3><p>Reads with readConcern “local” read the latest data in WiredTiger. However, local reads in MongoDB can be arbitrarily long-running due to the reach of the query. In order to avoid keeping a single WiredTiger transaction open for too long, they perform “query yielding” (Algorithm 1): While a query is running, it will read in a WiredTiger transaction with snapshot isolation and hold database and collection locks, but at regular intervals, the read will “yield”, meaning it aborts its WiredTiger transaction and releases its locks. After yielding, it opens a new WiredTiger transaction from a later point in time and reacquires locks (the read will fail if the collection or index it was reading from was dropped). This process ensures that local reads do not perform long-running WiredTiger transactions, which avoids memory pressure. The consequence is that local reads do not see a consistent cut of data, but this is acceptable for this isolation level.</p><h3 style="text-align: left;">Majority Reads</h3><p>Reads with readConcern level “majority” also perform query yielding, but they read from the majority commit point of the replica set. Each time a majority read yields, if the majority commit point has advanced, then the read will be able to resume from a later point in time. Again, majority reads may not read a consistent cut of data. A majority read could return 5 documents, yield and open a WiredTiger transaction at a later point in time, then return 5 more documents. It is possible that a MongoDB transaction that touched all 10 documents would only be reflected in the last 5 documents returned, if it committed while the read was running. (It is worth recalling <a href="https://muratbuffalo.blogspot.com/2022/02/ramp-tao-layering-atomic-transactions.html">the fractured read problem in Facebook TAO</a> at this point.) This inconsistent cut is acceptable for this isolation level. Since the read is performed at the majority commit point, we guarantee that all of the data returned is majority committed.</p><p>It is instructional to contrast MongoDB's majority readConcern with Cassandra's majority reads here. Cassandra’s QUORUM reads do not guarantee that clients only see majority committed data, differing from MongoDB’s readConcern level “majority”. Instead Cassandra’s QUORUM reads reach out to a majority of nodes with the row and return the most recent update, regardless of whether that write is durable to the set.</p><p>Another point to note here is that the combination of write-majority and read-majority does not give us linearizability. This is to be expected in any Raft/Paxos state machine replication. In order to get to linearizability, there is need for an additional client-side protocol <a href="https://muratbuffalo.blogspot.com/2019/09/linearizable-quorum-reads-in-paxos.html">as we discussed in our Paxos Quorum Reads paper.</a></p><h3 style="text-align: left;">Snapshot Reads</h3><p>Reads with readConcern level “snapshot” must read a consistent cut of data. This is achieved by performing the read in a single WiredTiger transaction, instead of doing query yielding. In order to avoid long-running WiredTiger transactions, MongoDB kills snapshot read queries that have been running longer than 1 minute.</p><h1 style="text-align: left;">Experiments</h1><p>The paper performed three experiments on 3-node replica sets using different geographical distributions of replica set members. Each experiment performed 100 single-document updates, and all operations specified that journaling was required in order to satisfy the given writeConcern.</p><h3 style="text-align: left;">Local Latency Comparison</h3><p>In this experiment, all replica set members and the client were in the same AWS Availability Zone (roughly the same datacenter) and Placement Group (roughly the same rack). All replica set members were running MongoDB 4.0.2 with SSL disabled. The cluster was deployed using sys-perf, the internal MongoDB performance testing framework.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjfwWNahdW9R87-yvT_ayoEhaKBAs2XZvEe_GqIn7JXZDYkkgUgWxW5CavIf8Tc1pu-w-_i9ivc-pwrO8H2ZTA0WMJ9tInmUAktTpHmR9UchbJOT9OhTfRh8epL355hsbHWFx0bY34ucS9TpxrTfSTQ9Uxbc1pZ0_nPE9FieqDU7EeJaLbEg38vIdmSUdo" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1196" data-original-width="1292" height="371" src="https://blogger.googleusercontent.com/img/a/AVvXsEjfwWNahdW9R87-yvT_ayoEhaKBAs2XZvEe_GqIn7JXZDYkkgUgWxW5CavIf8Tc1pu-w-_i9ivc-pwrO8H2ZTA0WMJ9tInmUAktTpHmR9UchbJOT9OhTfRh8epL355hsbHWFx0bY34ucS9TpxrTfSTQ9Uxbc1pZ0_nPE9FieqDU7EeJaLbEg38vIdmSUdo=w400-h371" width="400" /></a></div><h3 style="text-align: left;">Cross-AZ Latency Comparison</h3><p>In this experiment, all replica set members were in the same AWS Region (the same geographic area), but they were in different Availability Zones. Client 1 was in the same Availability Zone as the primary, and Client 2 was in the same Availability Zone as a secondary. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjQWXQWY-izpOiUY4pB7dkX9pmkAG1891koXu_MQAU4e-dqxEu_g_uXrRorX5d1eeV_ri11hX9BxuKgW49oXRGRr3H07Sr0jdUqaa_dKSIKQ4MNyZzu5fYK2VwZvybNO8KEtfiCxjNP52LkcdW9289QA4TnS1TF8IdW3CND-i3xspkUq3qYvtIS141mIbk" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1268" data-original-width="1292" height="392" src="https://blogger.googleusercontent.com/img/a/AVvXsEjQWXQWY-izpOiUY4pB7dkX9pmkAG1891koXu_MQAU4e-dqxEu_g_uXrRorX5d1eeV_ri11hX9BxuKgW49oXRGRr3H07Sr0jdUqaa_dKSIKQ4MNyZzu5fYK2VwZvybNO8KEtfiCxjNP52LkcdW9289QA4TnS1TF8IdW3CND-i3xspkUq3qYvtIS141mIbk=w400-h392" width="400" /></a></div><br /><h3 style="text-align: left;">Cross-Region Latency Comparison</h3><p>In this experiment, all replica set members were in different AWS Regions. The primary was in US-EAST1, one secondary was in EU-WEST-1, and the other secondary was in US-WEST-2. Client 1 was in US-EAST1, and Client 2 was in EU-WEST-1.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEisyYB2L9I5Xp-v4rRgsFUlNzqPaOioCg8N4y0GYHG8nOK0RTJmIRi6dMwln8-ErNBcKEaoM8P1phKvJzGjbCJGmHXufBQQM6RbqPGXsdPBJm3eROYYiUt-Dw8UvbDsamE5hha0ceBcrT0xp7hQkA4CiKzsPSiA8jEhp2iyFIF0ikl3YLu4oQeDEe_dInI" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1268" data-original-width="1292" height="392" src="https://blogger.googleusercontent.com/img/a/AVvXsEisyYB2L9I5Xp-v4rRgsFUlNzqPaOioCg8N4y0GYHG8nOK0RTJmIRi6dMwln8-ErNBcKEaoM8P1phKvJzGjbCJGmHXufBQQM6RbqPGXsdPBJm3eROYYiUt-Dw8UvbDsamE5hha0ceBcrT0xp7hQkA4CiKzsPSiA8jEhp2iyFIF0ikl3YLu4oQeDEe_dInI=w400-h392" width="400" /></a></div><h1 style="text-align: left;">Cross-layer optimization opportunities</h1><p>The paper also discusses about multidocument transactions. I find it very interesting to think about how MongoDB interplays with the underlying WiredTiger storage system. (I read more about this in the <a href="https://arxiv.org/abs/2111.14946">"Verifying Transactional Consistency of MongoDB" paper</a>.) This paper only scratches the surface on this, but having a powerful storage engine like WiredTiger opens the way to powerful cross-layer optimization opportunities.</p><h3 style="text-align: left;">Speculative majority and snapshot isolation for multi statement transactions</h3><p>MongoDB uses an innovative strategy for implementing readConcern within transactions that greatly reduced aborts due to write conflicts in back-to-back transactions. When a user specifies readConcern level “majority” or “snapshot”, the returned data is guaranteed to be committed to a majority of replica set members. Outside of transactions, this is accomplished by reading at a timestamp at or earlier than the majority commit point in WiredTiger. However, this is problematic for transactions: It is useful for write operations to read the freshest version of a document, since the write will abort if there is a newer version of the document than the one it read. This motivated the implementation of “speculative” majority and snapshot isolation for transactions. Transactions read the latest data for both read and write operations, and at commit time, if the writeConcern is w:“majority”, they wait for all the data they read to become majority committed. This means that a transaction only satisfies its readConcern guarantees if the transaction commits with writeConcern w:“majority”. </p><p>Waiting for the data read to become majority committed at commit time rarely adds latency to the transaction, since if the transaction did any writes, then to satisfy the writeConcern guarantees, we must wait for those writes to be majority committed, which will imply that the data read was also majority committed. Only read-only transactions require an explicit wait at commit time for the data read to become majority committed. Even for read-only transactions, this wait often completes immediately because by the time the transaction commits, the timestamp at which the transaction read is often already majority committed.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjAyzOmcTgJosxbUGOI2Pu1GPzmtKNBFI1nrE3ZvW7_d-Opwyk7WQn4GVpfSrno7WpkGRfUpOD0uiLCQacL-J-4gGw5u1q6yRykYVCbUsvRA06GOj7xsnYj3-g0xkiDTFc2UGez5TEbDUgsi7Gn2YuTslUCCR-LGEM9r6fqIhEWWDCcetXxvzm7pqBCVTY" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="886" data-original-width="1292" height="274" src="https://blogger.googleusercontent.com/img/a/AVvXsEjAyzOmcTgJosxbUGOI2Pu1GPzmtKNBFI1nrE3ZvW7_d-Opwyk7WQn4GVpfSrno7WpkGRfUpOD0uiLCQacL-J-4gGw5u1q6yRykYVCbUsvRA06GOj7xsnYj3-g0xkiDTFc2UGez5TEbDUgsi7Gn2YuTslUCCR-LGEM9r6fqIhEWWDCcetXxvzm7pqBCVTY=w400-h274" width="400" /></a></div>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-59380119956599770152024-02-03T22:47:00.010-05:002024-02-06T08:59:01.698-05:00Design and Analysis of a Logless Dynamic Reconfiguration Protocol<p><a href="https://arxiv.org/pdf/2102.11960.pdf">This paper appeared in OPODIS'21 and describes dynamic reconfiguration in MongoDB.</a></p><p>So, what is dynamic reconfiguration? The core Raft protocol implements state machine replication (SMR) using a static set of servers. (<a href="https://muratbuffalo.blogspot.com/2024/01/fault-tolerant-replication-with-pull.html">Please read this to learn about how MongoDB adopted Raft for a pull-based SMR</a>.) To ensure availability in the presence of faults, SMR systems must be able to dynamically (and safely) replace failed nodes with healthy ones. This is known as dynamic reconfiguration.</p><h2 style="text-align: left;">MongoDB logless reconfiguration</h2><p>Since its inception, the MongoDB replication system has provided a custom, ad hoc, legacy protocol for dynamic reconfiguration of replicas. This legacy protocol managed configurations in a logless fashion, i.e., each server only stored its latest configuration. It decoupled reconfiguration processing from the main database operation log. The legacy protocol, however, was known to be unsafe in certain cases. </p><p style="text-align: left;">Revising that legacy protocol, this paper presents a redesigned safe reconfiguration protocol, <b>MongoRaftReconfig</b>, with rigorous safety guarantees. A primary goal of the protocol was to keep design and implementation complexity low.</p><p>Why didn't MongoDB use Raft re-configuration protocol? </p><p>The Raft consensus protocol (2014) provided a dynamic reconfiguration algorithm (<a href="https://groups.google.com/g/raft-dev/c/t4xj6dJTP6E/m/d2D9LrWRza8J">a critical safety bug was found later, showing that reconfiguration protocols are tricky</a>). Raft uses the main operation log (oplog) for both normal operations and reconfiguration operations. This coupling imposes fundamental restrictions on the operation of the two logs.</p><p>MongoRaftReconfig avoids this by separating the oplog and "config state machine" (CSM), allowing reconfigurations to bypass the oplog SMR. We will revisit this in the evaluation section. Decoupling the CSM from the main operation log SMR also allows for a logless optimization: it is sufficient to store only the latest version of the config state. This allows the CSM to avoid complexities related to garbage collection of old log entries and simplifies the mechanism for state propagation between servers.</p><p>Below, we present the MongoRaftReconfig protocol and discuss its correctness. The paper includes <a href="https://zenodo.org/records/5715511">TLA+ model checking</a> and a manual proof, which are used for verifying MongoRaftReconfig’s key safety properties.</p><p><br /></p><h1 style="text-align: left;">MongoRaftReconfig protocol</h1><p>Raft reconfiguration consists of two alternate algorithms: single server membership change and joint consensus. This paper focuses exclusively on the single server membership change protocol. The single server change approach aims to simplify reconfiguration by allowing only reconfigurations that add or remove a single server.</p><p>As in Raft-reconfig, by restricting to a single server change at a time, MongoRaftReconfig ensures that all quorums of two adjacent configurations (C to C') overlap with each other. MongoRaftReconfig also imposes additional restrictions to ensure</p><p></p><ul style="text-align: left;"><li>deactivation of old configurations (to prevent them from executing disruptive operations --e.g. electing a primary or committing a write), and</li><li>state transfer from the old configuration to the new configuration before the new one becomes active.</li></ul><p></p><p>Let's formalize this a bit. </p><p>A configuration is defined as a tuple (m,v,t), where m is a member set, v is a numeric configuration version, and t is the numeric term of the configuration. The v and t together tie the reconfiguration protocol, CSM, to the replicated state machine protocol for the oplog as we discuss below in the algorithm description. This is achieved by totally ordering configurations by their (version, term) pair, where term is compared first, followed by version.</p><p>Reconfigurations can only be executed on primary servers, and they update the primary's current local configuration C to the specified configuration C'. As in RaftReconfig, in MongoRaftReconfig any reconfiguration that moves from C to C' is required to satisfy the quorum overlap condition i.e. QuorumsOverlap(C.m,C'.m). The below conditions must also be satisfied before a primary server in term T can execute a reconfiguration out of its current configuration C.</p><p></p><ul style="text-align: left;"><li>Q1. Config Quorum Check: There must be a quorum of servers in C.m that are currently in configuration C.</li><li>Q2. Term Quorum Check: There must be a quorum of servers in C.m that are currently in term T.</li><li>P1. Oplog Commitment: All oplog entries committed in terms =< T must be committed on some quorum of servers in C.m.</li></ul><p></p><p>Q1, when coupled with the election restrictions (as we discuss below in elections section), achieves deactivation by ensuring that configurations earlier than C can no longer elect a primary.</p><p>Q2 ensures that term information from older configurations is correctly propagated to newer configurations, while P1 ensures that previously committed oplog entries are properly transferred to the current configuration, ensuring that any primary in a current or later configuration will contain these entries.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjfsRLxxceahzCzHCHm4KZMoarhOoPGe4t_KWC6aBr4mddOpyyCOWTeFo_qgg5hwe-BrMVXeGDXEBEcLwMwbnp8VRvPX6C69c3J2nmcqJqcy2oQOk1w8ilcTK31ycPCzuM3IRYhgPOm0gzq5uVHxfxbgv1ur3ozbue-WAp6dI7C3qWPJ5OFPXPu3RKmXJU" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="938" data-original-width="1292" src="https://blogger.googleusercontent.com/img/a/AVvXsEjfsRLxxceahzCzHCHm4KZMoarhOoPGe4t_KWC6aBr4mddOpyyCOWTeFo_qgg5hwe-BrMVXeGDXEBEcLwMwbnp8VRvPX6C69c3J2nmcqJqcy2oQOk1w8ilcTK31ycPCzuM3IRYhgPOm0gzq5uVHxfxbgv1ur3ozbue-WAp6dI7C3qWPJ5OFPXPu3RKmXJU=s16000" /></a></div><p>A big insight in the algorithm was to realize that the CSM and oplog protocols need to be combined/pinned together for safety, and this is done by using the (v,t) as a pair, and requiring that CSM commits the config on the most recent term, and the RSM follows the committed configs in sequential order. </p><p>After a reconfiguration has occurred on a primary, the updated configuration needs to be communicated to secondaries. In MongoRaftReconfig, config state propagation is implemented by the SendConfig action, which transfers configuration state from one server to another. Secondaries receive information about the configurations of other servers via periodic heartbeats. They determine whether one configuration is newer than another using the total lexicographical ordering on the (version, term) pair. A secondary can update its configuration to any that is newer than its current configuration.</p><p>When a node runs for election in MongoStaticRaft, it must ensure its log is appropriately up to date and that it can garner a quorum of votes in its term. In MongoRaftReconfig, there is an additional restriction on voting behavior that depends on configuration ordering. If a replica set server is a candidate for election in configuration Ci, then a prospective voter in configuration Cj may only cast a vote for the candidate if Cj is less than or equal to Ci .</p><p>Furthermore, when a node wins an election, it must update its current configuration with its new term before it is allowed to execute subsequent reconfigurations. That is, if a node with current configuration (m, v, t) wins election in term t', it will update its configuration to (m, v, t') before allowing any reconfigurations to be processed. This behavior is necessary to deactivate concurrent reconfigurations that may occur on primaries in a different term.</p><p><br /></p><h1 style="text-align: left;">Correctness</h1><p>LeaderCompleteness property states that if a log entry has been committed in term T, then it must be present in the logs of all primary servers in terms > T.</p><p>ElectionSafety is a key, auxiliary lemma that is required in order to show LeaderCompleteness. Election safety state that: For all s, t \in Server such that s ̸= t, it is not the case that both s and t are primary and have the same term.</p><p>In order not to violate the property that all quorums of any two configurations overlap (which MongoStaticRaft relies on for safety), MongoRaftReconfig must appropriately deactivate past configurations before creating new configurations. Deactivated configurations cannot elect a new leader or execute a reconfiguration. Otherwise, the old primary (which still thinks it is primary) can institute a reconfiguration, and pull the rug (one crucial node needed for majority intersection) under the new primary, and this causes the violation the property that all quorums of any two configurations overlap, which MongoStaticRaft relies on for safety.</p><p>In addition to deactivation of configurations, MongoRaftReconfig must also ensure that term information from one configuration is properly transferred to subsequent configurations, so that later configurations know about elections that occurred in earlier configurations. For example, if an election occurred in term T in configuration C, even if C is deactivated by the time C' is created, the protocol must also ensure that C' is aware of the fact that an election in T occurred in C.</p><p>Moreover, MongoRaftReconfig also ensures that newer configurations appropriately disable commitment of log entries in older terms. CSM only moves ahead through committed configs sequentially: the CSM can choose the next config and commit it only if its current one is committed. The primary must write the current config again with its latest term and wait for it to be propagated to a majority.</p><p><a href="https://zenodo.org/records/5715511">The paper is accompanied with TLA+ models,</a> which seem really nice. I will start playing with them. I think the people who worked on the TLA+ models had a deep understanding of the protocol. This reminds me of this quote from <a href="https://www.youtube.com/watch?v=zoE3DqglcgM&t=3055s">Byron Cook's recent talk (recommended watch).</a></p><blockquote style="border: medium; margin: 0px 0px 0px 40px; padding: 0px;"><p style="text-align: left;">You (the formal methods person) become the only one who actually understands the system right. They don't understand it... There's fantasy, the documentation, the code, and the individuals. And no one agrees on what's going on for any complex system.</p></blockquote><p><br /></p><h1 style="text-align: left;">Evaluation</h1><p>As we mentioned in the introduction, in standard Raft, the main operation log is used for both normal operations and reconfiguration operations. This coupling imposes fundamental restrictions on the operation of the two logs.</p><p>This behavior is stronger than necessary for safety: it is not strictly necessary to commit these log entries before executing a reconfiguration. The only fundamental requirements are that previously committed log entries are committed by the rules of the current configuration, and that the current configuration has satisfied the necessary safety preconditions. Raft achieves this goal implicitly, but more conservatively than necessary, by committing the entry Cj and all entries behind it. This ensures that all previously committed log entries, in addition to the uncommitted operations U , are now committed in Cj , but it is not strictly necessary to pipeline a reconfiguration behind commitment of U.</p><p>MongoRaftReconfig avoids this by separating the oplog and config state machine and their rules for commitment and reconfiguration, allowing reconfigurations to bypass the oplog if necessary. Note that Oplog Commitment (P1) is easier to satisfy (if not already satisfied) than Raft's insistence on committing all the entries that happened to fall before Cj in the oplog. </p><p>P1. Oplog Commitment: All oplog entries committed in terms =< T must be committed on some quorum of servers in C.m.</p><p><br /></p><p>The evaluation section simulates a degraded disk scenario to highlight the benefit of the decoupled CSM execution. It argues that decoupling CSM execution allows MongoRaftReconfig to successfully reconfigure the system in such a degraded state, restoring oplog write availability by removing the failed nodes and adding in new, healthy nodes.</p><p>The paper examines the degraded disk scenario to mention a caveat, and to argue that even under that caveat MongoRaftReconfig provides an advantage. "Note that if a replica set server experiences a period of degradation (e.g. a slow disk), both the oplog and reconfiguration channels will be affected, which would seem to nullify the benefits of decoupling the reconfiguration and oplog replication channels. In practice, however, the operations handled by the oplog are likely orders of magnitude more resource intensive than reconfigurations, which typically involve writing a negligible amount of data. So, even on a degraded server, reconfigurations should be able to complete successfully when more intensive oplog operations become prohibitively slow, since the resource requirements of reconfigurations are extremely lightweight."</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjjqbyJxAfjGmtYRVbgx1BOhpxzOSbLXd6Y6jTUBPjNczeKgu7piF4C9pMzeAaK_58ABCmP4b8EHdEI0xbkpXNeyRwtEle2_FFAw87EwY2blCFtDSb3BOJyW6A50cK4tEKlFmI9_NuyuN_9MXyuamvB1Lm2tB7M98ranWMEjrFGfGmoR0gzSARCD3u4_nk" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="846" data-original-width="1292" src="https://blogger.googleusercontent.com/img/a/AVvXsEjjqbyJxAfjGmtYRVbgx1BOhpxzOSbLXd6Y6jTUBPjNczeKgu7piF4C9pMzeAaK_58ABCmP4b8EHdEI0xbkpXNeyRwtEle2_FFAw87EwY2blCFtDSb3BOJyW6A50cK4tEKlFmI9_NuyuN_9MXyuamvB1Lm2tB7M98ranWMEjrFGfGmoR0gzSARCD3u4_nk=s16000" /></a></div><br /><div><h1 style="text-align: left;">Related work</h1><div>When presenting Paxos, for reconfiguration, Lamport proposed limiting the length of the command pipeline window to $\alpha > 0$ and only activating the new config chosen at slot i after slot $i + \alpha$. Depending on the value of $\alpha$, this approach either limits throughput or latency of the system. </div><div><br /></div><div>In contrast, in MongoDB, the wait on command commit is only done on-demand when reconfiguration is happening.</div><div><br /></div><div><br /></div><div>I had written about reconfiguration in SMR a couple times before. <a href="https://muratbuffalo.blogspot.com/2022/12/vertical-paxos-and-primary-backup.html">Vertical Paxos delves down in to the leader handover and reconfiguration, and shows a practical application of these in the context of primary-backup replication protocols.</a></div><div><br /></div><div><a href="https://muratbuffalo.blogspot.com/2020/05/matchmaker-paxos-reconfigurable.html">Matchmaker Paxos</a> is a realization/implementation of Vertical Paxos with a more integrated deployment to the Paxos protocol. Vertical Paxos requires an external master, which is itself implemented using state machine replication. The matchmakers in MM are analogous to that external master/Paxos-box and show that such a reconfiguration does not require a nested invocation of state machine replication. Matchmaker Paxos uses and generalizes the approach in Vertical Paxos for reconfiguration and is OK with lazy/concurrent state transfer.</div><div><br /></div><div>I had also <a href="https://muratbuffalo.blogspot.com/2023/01/reconfiguring-replicated-atomic-storage.html">written about reconfiguration for atomic storage,</a> but that is an easier problem than reconfiguration on a state machine replication system.</div></div>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-18146837079147118122024-01-30T14:53:00.009-05:002024-01-30T14:55:28.602-05:00 Fault-Tolerant Replication with Pull-Based Consensus in MongoDB<p><a href="https://www.usenix.org/system/files/nsdi21-zhou.pdf">This paper, from NSDI 2021,</a> presents the design and implementation of strongly consistent replication in MongoDB using a consensus protocol derived from Raft.</p><p><a href="https://raft.github.io/">Raft</a> provides fault-tolerant state-machine-replication (SMR) over asynchronous networks. Raft (like most SMR protocols) uses push-based replication. But MongoDB uses pull-based replication scheme, so when integrating/invigorating MongoDB's SMR with Raft, this caused challenges. The paper focuses on examining and solving these challenges, and explaining the resulting MongoSMR protocol (my term, not the paper's). </p><p>The paper restricts itself to the strongest consistency level, linearizability, but it also talks about how serving weaker models interact/shape decisions made in MongoDB's replication protocol. The paper talks about extensions/optimizations of MongoDB SMR protocol, but I skip those for brevity. I also skip the evaluation section, and just focus on the core of the SMR protocol.</p><h1 style="text-align: left;">Design</h1><h2 style="text-align: left;">Background</h2><p>Unlike conventional primary-backup replication schemes where updates are usually pushed from the primary to the secondaries, in MongoDB a secondary pulls updates from other servers, and not necessarily from the primary.</p><p>The pull-based approach provides more control of how data is transmitted over the network. Depending on users' needs, the data transmission can be in a star topology, a chaining topology, or a hybrid one. This has big performance and monetory cost implications. For example, when deployed in clouds like Amazon EC2, data transmission inside a datacenter is free and fast, but is expensive and subject to limited bandwidth across datacenters. Using a linked topology, rather than a star topology, a secondary can sync from another secondary in the same datacenter, rather than use up another costly data-transmission link to the primary in the other datacenter.</p><p>In earlier releases, MongoDB assumed a semi-synchronous network: either there is manual control of failover, or all messages are bound to arrive within 30 seconds for failure detection. Starting from 2015, the MongoDB replication scheme is remodeled based on the Raft protocol. This new protocol (MongoSMR, which is the topic of this paper) guarantees safety in an asynchronous network (i.e., messages can be arbitrarily delayed or lost) and supports fully autonomous failure recovery with a smaller failover time. Same as before, MongoSMR is still pull-based. </p><h2 style="text-align: left;">Oplog</h2><p>An oplog is a sequence of log entries that feeds the SMR. Each log entry contains a database operation. Figure 1 shows an example of oplog entry. Notice that each entry is a JSON document. The oplog is stored in the oplog collection, which behaves in almost all regards as an ordinary collection of documents. The oplog collection automatically deletes its oldest documents when they are no longer needed and appends new entries at the other end.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgUtFCtgum2YaWITRWKUCuQtKoQY2gTjuIBHKoVw0V9pcM8DlzeRER8ZWZMyQ7RYrCxAvj-GBsvjaQ_aONSVTjBlMCMjBZMxa7Alzj_wuq6an954mjQNNq9KEue37rL6Tju9VdAj-XeN_rNruE8gWKAtbud58iqVm4NqOMhrqwm6Oawqzl-il8D5LS2Ky4" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="882" data-original-width="1068" height="240" src="https://blogger.googleusercontent.com/img/a/AVvXsEgUtFCtgum2YaWITRWKUCuQtKoQY2gTjuIBHKoVw0V9pcM8DlzeRER8ZWZMyQ7RYrCxAvj-GBsvjaQ_aONSVTjBlMCMjBZMxa7Alzj_wuq6an954mjQNNq9KEue37rL6Tju9VdAj-XeN_rNruE8gWKAtbud58iqVm4NqOMhrqwm6Oawqzl-il8D5LS2Ky4" width="291" /></a></div><p>Each slot of the SMR (i.e., each oplog entry) is timestamp based, not sequence number based. Each oplog entry is assigned a timestamp and annotated with the term of the primary. The timestamp is a monotonically increasing logical clock that exists in the system before this work. A pair of term and timestamp, referred to as an OpTime, can identify an oplog entry uniquely in a replica set and give a total order of all oplog entries among all replicas.</p><h2 style="text-align: left;">Data replication</h2><p>In Raft the primary initiates AppendEntries RPCs to secondaries to replicate new log entries. In MongoSMR, the primary waits for the secondaries to pull the new entries that are to be replicated.</p><p>The principle is to decouple data synchronization via AppendEntries in Raft into two parts: replicas pulling new data from the peers, and replicas reporting their latest replication status so that a request can commit after it reaches a majority of replicas.</p><p>The primary processes two types of RPCs from secondaries: PullEntries and UpdatePosition. A secondary will use PullEntries to fetch new logs, and use UpdatePosition to report its status so that the primary can determine which oplog entries have been safely replicated to a majority of servers and commit them. Similar to Raft, once an entry is committed, all prior entries are committed indirectly.</p><h3 style="text-align: left;">PullEntries</h3><p>A secondary continuously sends PullEntries to the selected sync source (which may not be the primary) to retrieve new log entries. The PullEntries RPC includes the latest oplog timestamp (prevLogTimestamp) of the syncing server as an argument.</p><p>When receiving PullEntries, a server will reply with its oplog entries after and including that timestamp if it has a longer or the same log, or the server could reply with an empty array if its log sequence is shorter. Before returning a response when the log is the same, PullEntries waits for new data for a given timeout (5 seconds by default) to avoid busy looping.</p><h3 style="text-align: left;">UpdatePosition</h3><p>After retrieving new entries into its local oplog with PullEntries, the secondary sends UpdatePosition to its sync source to report on its latest log entry's OpTime.</p><p>When receiving the UpdatePosition, the server will forward the message to its sync source, and so forth, until the UpdatePosition reaches the primary.</p><p>The primary maintains a non-persistent map in memory that records the latest known log entry's OpTime on every replica, including its own, as their log positions. When receiving a new UpdatePosition, if the received one is newer, the primary replaces its local record with the received OpTime. Then, the primary will do a count on the log positions of all replicas: If a majority of replicas have the same term and the same or greater timestamp, the primary will update its lastCommitted to that OpTime and notify secondaries of the new lastCommitted by piggybacking onto other messages, such as heartbeats and the responses to PullEntries. lastCommitted is also referred to as the commit point.</p><h2 style="text-align: left;">Oplog replication</h2><p>Recall that each oplog entry is a document, and oplog is a collection. MongoSMR leverages this to implement oplog replication as a streaming query. Instead of initiating continuous RPC's on the syncing node, the PullEntries RPC is implemented as a query on the oplog collection with a "greater than or equal to" filter on the timestamp field. The query can be optimized easily since the oplog is naturally ordered by timestamp. Using database cursors allows the syncing node to fetch oplog entries in batches and also allows the RPC to work in a streaming manner, so that a sync source can send new data without waiting for a new request, reducing the latency of replication.</p><h2 style="text-align: left;">Sync source selection</h2><p>MongoSMR introduced Heartbeats RPC to decoupled the heartbeat responsibility from Raft's AppendEntries RPC. Heartbeats are sent among *all* replicas, and are used for liveness monitoring, commit point propagation and sync source selection.</p><p>A server chooses its sync source only if the sync source has newer oplog entries than itself by comparing their log positions (learned via Heartbeat RPC). This total order on log positions guarantees that the replicas can never form a cycle of sync sources.</p><p><br /></p><h1 style="text-align: left;">Correctness </h1><h2 style="text-align: left;">A crucial difference between MongoRep and Raft</h2><p>In Raft, if a server has voted for a higher term in an election, the server cannot take new log entries sent from an old primary with a lower term. In contrast in MongoSMR, even if the sync source is a stale primary with a lower term number, the server would still fetch new log entries generated by the stale primary. This is because the PullEntries RPC does not check the term of the sync source (it only checks OpTimes).</p><p>Before we explore the correctness implications of this, let's talk about why MongoDB does not check the term of the sync source, and eagerly replicate entries. This has to do with achieving faster failovers and preserving uncommitted oplog entries.</p><p>In addition to strong consistency considered in this paper, MongoDB supports fast but weak consistency levels that acknowledge writes before they are replicated to a majority. Thus, a failover could cause a large loss of uncommitted writes. Though the clients are not promised durability with weak consistency levels, MongoDB still prefers to preserve these uncommitted writes as much as possible.</p><p>For this purpose, it introduced an extra phase for a newly elected primary: the primary catchup phase. The new primary will not accept new writes immediately after winning an election. Instead, it will keep retrieving oplog entries from its sync source until it does not see any newer entries, or a timeout occurs. This timeout is configurable in case users prefer faster failovers to preserving uncommitted oplog entries. This primary catchup design is only possible because in MongoSMR, a server (including the new primary) is allowed to keep syncing oplog entries generated by the old primary after voting for a higher term as long as it hasn't written any entry with its new term. This important difference between MongoDB and Raft allows MongoDB to preserve uncommitted data as much as possible during failovers.</p><h2 style="text-align: left;">Correctness</h2><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjdxDSR4dEDiXMAaJww3os4KG6UIQXbO1zREG9NsYPJZtgGCaUyhJLEA4opcxy592KQGMcyFPW0dFErZhbIaztdC_2VwqtMDFdNfzjlSr9gk0nogdfJxdVhyj-lMk8GRxV9Wx_Z_KDh7OpNcZtdneU5tqmtdGiBIeyHlA7ImdbZxcEXeoOeBUHyI5qjkMU" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="918" data-original-width="1068" height="344" src="https://blogger.googleusercontent.com/img/a/AVvXsEjdxDSR4dEDiXMAaJww3os4KG6UIQXbO1zREG9NsYPJZtgGCaUyhJLEA4opcxy592KQGMcyFPW0dFErZhbIaztdC_2VwqtMDFdNfzjlSr9gk0nogdfJxdVhyj-lMk8GRxV9Wx_Z_KDh7OpNcZtdneU5tqmtdGiBIeyHlA7ImdbZxcEXeoOeBUHyI5qjkMU=w400-h344" width="400" /></a></div><p>Let's explore the correctness implications of this difference. Consider Fig 2.c. There it seems like value 2 (in blue) is anchored and decided, but it is not! It is just replicated widely, but some of those replicas are from a newer term and was done only to be able to recover more entries when using weaker consistency levels.</p><p>If we take Raft's rule that "a log entry is committed once the leader that created the entry has replicated it on a majority of the servers" without any qualifiers, indeed value 2 would be counted as committed, only later to be overturned. To prevent cases like this from happening, MongoSMR adds a new argument in the UpdatePosition RPC: the term of the syncing server. The recipient of UpdatePosition will update its local term if the received term is higher. If the recipient is the stale primary, seeing a higher term will make the primary step down before committing anything, thus avoiding any safety issue.</p><p>Therefore, in the above example, when server A receives UpdatePosition from Server C/D, it will see term 3 and step down immediately without updating its lastCommitted. Even though the entry with term 2 is in a majority of servers’ logs, it is not committed.</p><p>This revised UpdatePosition manages to maintain a key invariant in Raft --Leader Completeness Property. This property refers to the fact that "if a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms."</p><p>To verify that MongoSMR design and implementation are correct, the team had done extensive verification and testing on the protocol including model checking using TLA+, unit testing, integration testing, fuzz testing, and fault-injection testing. <a href="https://github.com/mongodb/mongo/blob/master/src/mongo/tla_plus/RaftMongo/RaftMongo.tla">The TLA+ specification of the protocol is available here.</a> </p><p><br /></p><h1 style="text-align: left;">Discussion </h1><h2 style="text-align: left;">Chain replication</h2><p>Reading MongoSMR may lead people to think that the lines between Paxos/Raft SMR and chain-replication is somewhat blurred. If MongoRep uses a chained topology, what would be the differences from chain-replication?</p><p>Well, there are big differences. In chain replication, you fall back to the next node in the chain as the new primary. This is a restrictive (inflexible in terms of options) albeit efficient way to have log monotonicity/retainment. In MongoSMR, any node can become the new primary. The log monotonicity/retainment comes from Raft leader election rather than the topology.</p><p>A bigger difference is of course in the philosophy of the two approaches. Chain replication requires a separate consensus box to maintain the topology of the chain. Having an external consensus box hosted outside the replicaset causes logistics and fatesharing issues about whether what the consensus box agrees on has good fidelity to the field/replicaset. (Well, there are versions of chain replication which puts consensus in the chain, and yeah that blurs the lines a bit.) In Paxos/Raft SMR, the consensus is part of the SMR. So it comes with batteries included for fault-tolerant state machine replication. </p><h2 style="text-align: left;">PigPaxos</h2><p>We talked about the advantages of pull-based replication over push-based replication, and mentioned that it allows more flexible topologies rather than just the star topology where the primary is in the middle. It is in fact possible to be flexible with push-based replication and solve throughput/performance problems stemming from using the star topology. <a href="https://muratbuffalo.blogspot.com/2020/03/pigpaxos-devouring-communication_18.html">In our 2020 work, PigPaxos, we showed how that is possible using relay nodes.</a> At that time, we did not know of MongoDB's chaining topology/approach, and hadn't mentioned it. </p><p><br /></p><h2 style="text-align: left;">Links</h2><p><a href="https://www.youtube.com/watch?v=04ZI8HpFnCA&ab_channel=USENIX">Here is the NSDI'21 presentation of the paper.</a></p><p>Aleksey has <a href="https://charap.co/reading-group-fault-tolerant-replication-with-pull-based-consensus-in-mongodb/">a review of the paper</a> accompanied by a <a href="https://youtu.be/nY_As3VooB8">presentation video</a>.</p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-28667058382140507442024-01-19T10:49:00.005-05:002024-01-19T13:20:07.532-05:00 Looking Back at Postgres <p><a href="https://arxiv.org/pdf/1901.01973.pdf">This is a 2019 article</a> by Joe Hellerstein, the Jim Gray Professor of Computer Science at UC Berkeley. Last year at Sigmod, <a href="https://muratbuffalo.blogspot.com/2023/07/sigmod23-industry-talks-and-ted-codd.html">Joe was awarded the Ted Codd innovation award where he gave an awesome overview of his research agenda.</a></p><p>This article, written to be included in Stonebraker’s Turing Award book, provides a retrospective on Postgres project, which Stonebraker led from the mid-1980’s to the mid-1990’s. I love this article a lot, because as I have written before, <a href="https://muratbuffalo.blogspot.com/2021/12/learning-technical-subject.html">context is my crack</a>: "The more context I know, the better I become able to locate something almost spatially, and the more I can make sense of it. Even reading the history and motivation for the subject can give my understanding a big boost." The entire paper is context, and it even has a section titled "Context", how cool is that? The footnotes on the article are also excellent! Very interesting gems there as well.</p><p>Disclaimer: I use a lot of text from the article to summarize it. The features and the impact sections in this write up are just text lifted off from the article. (Don't taze me bro!) </p><p><br /></p><h1 style="text-align: left;">Postgres origin story</h1><p>Riding on the success of Ingres project at Berkeley, and the subsequent start-up Relational Technology, Inc. (RTI), Stonebraker began working on database support for data types beyond the traditional rows and columns of Codd's relational model in the early 1980s. A motivating example was to provide database support for Computer-Aided Design (CAD) tools for the microelectronics industry, including "new data types such as polygons, rectangles, text strings, etc." "efficient spatial searching" "complex integrity constraints" and "design hierarchies and multiple representations" of the same physical constructions.</p><p>What the hey? I didn't expect this to be the origin story of Postgres!! This is almost exactly the motivation for MongoDB document database in 2007, a good 25 years later. </p><p>Postgres was "Post-Ingres": a system designed to take what Ingres could do, and go beyond. The signature theme of Postgres was the introduction of what Stonebraker eventually called Object-Relational database features: support for object-oriented programming ideas within the data model and declarative query language of a database system. But Stonebraker also decided to pursue a number of other technical challenges in Postgres that were independent of object-oriented support, including active database rules, versioned data, tertiary storage, and parallelism.</p><p>So Postgres was Stonebraker's grand effort to build a one-size-fits-all database system. This is ironic, because later he (when he joined MIT) published the <a href="https://cs.brown.edu/~ugur/fits_all.pdf">"One size does not fit all" paper</a>. Joe also picks up on this, and he sides with the Berkeley-Stonebraker's approach "that a broad majority of database problems can be solved well with a good general-purpose architecture." </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiQyKNDqWwDr5pttQ5syr2-Eamy21Ojf77xs79FR5p4DVvKEIaHtOR-wH6Q6KLeANhHJwdMMtnbtnAJAjfc0b5wJNIw12Qpws0vMUfJBz6smroWCrX2xjMsp8ovEptNtmnzKXXI7wgm3nz-nRBaAl6i7NJMqF4OyMDmtpFD6PRrNlB6tusNxrc0Ll43dyc" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1502" data-original-width="1452" src="https://blogger.googleusercontent.com/img/a/AVvXsEiQyKNDqWwDr5pttQ5syr2-Eamy21Ojf77xs79FR5p4DVvKEIaHtOR-wH6Q6KLeANhHJwdMMtnbtnAJAjfc0b5wJNIw12Qpws0vMUfJBz6smroWCrX2xjMsp8ovEptNtmnzKXXI7wgm3nz-nRBaAl6i7NJMqF4OyMDmtpFD6PRrNlB6tusNxrc0Ll43dyc=s16000" /></a></div><p>The article reviews these features. What intruged me was that only a couple of these was a hit, and most of these are misses. Joe says that: "Many of these topics were addressed in Postgres well before they were studied or reinvented by others; in many cases Postgres was too far ahead of its time and the ideas caught fire later, with a contemporary twist."</p><p>So, then, what made Postgres so successful? What was the big hit? It may be the flexible opensource code base and dynamic group behind it that kept this going. Berkeley had been a hot bed of software development with <a href="https://en.wikipedia.org/wiki/Berkeley_Software_Distribution">BSD</a> and other projects coming out at the time.</p><p>Stonebraker acknowleges this very humbly, and Joe summarizes the lesson here as "do something important and set it free." </p><p></p><blockquote>[A] pick-up team of volunteers, none of whom have anything to do with me or Berkeley, have been shep- herding that open source system ever since 1995. The system that you get off the web for Postgres comes from this pick-up team. It is open source at its best and I want to just mention that I have nothing to do with that and that collection of folks we all owe a huge debt of gratitude to.</blockquote><p></p><p>In the lessons section at the end of the article, Joe also talks about the second system effect as follows. I agree with what he says. I think another reason this worked is due to the incremental delivery of the system, with piecewise student projects. </p><p>The highest-order lesson I draw comes from the fact that that Postgres defied Fred Brooks’ “Second System Effect” (1975). Brooks argued that designers often follow up on a successful first system with a second system that fails due to being overburdened with features and ideas. Postgres was Stonebraker’s second system, and it was certainly chock full of features and ideas. Yet the system succeeded in prototyping many of the ideas, while delivering a software infrastructure that carried a number of the ideas to a successful conclusion. This was not an accident --at base, Postgres was designed for extensibility, and that design was sound. With extensibility as an architectural core, it is possible to be creative and stop worrying so much about discipline: you can try many extensions and let the strong succeed. Done well, the “second system” is not doomed; it benefits from the confidence, pet projects, and ambitions developed during the first system. This is an early architectural lesson from the more “server-oriented” database school of software engineering, which defies conventional wisdom from the “component-oriented” operating systems school of software engineering.</p><p><br /></p><h1 style="text-align: left;">Features</h1><h2 style="text-align: left;">Complex objects</h2><p>Relational modeling religion dictated that data should be restructured and stored in an unnested format, using multiple flat entity tables (orders, products) with flat relationship tables (product_in_order) connecting them. But in some cases you want to store the nested representation, because it is natural for the application. </p><p>Postgres retained tables as its "outermost" data type, but allowed columns to have "complex" types including nested tuples or tables. One of its more esoteric implementations, first explored in the ADT-Ingres prototype, was to allow a table-typed column to be specified declaratively as a query definition: "Quel as a data type".</p><p>As Postgres has grown over the years (and shifted syntax from Postquel to versions of SQL that reflect many of these goals), it has incorporated support for nested data like XML and JSON into a general-purpose DBMS without requiring any significant rearchitecting. The battle swings back and forth, but the Postgres approach of extending the relational framework with extensions for nested data has shown time and again to be a natural end-state for all parties after the arguments subside.</p><h2 style="text-align: left;">User-defined abstract data types and functions</h2><p>Postgres pioneered the idea of having opaque, extensible Abstract Data Types (ADTs), which are stored in the database but not interpreted by the core database system. To enable queries that interpret and manipulate these objects, an application programmer needs to be able to register User-Defined Functions (UDFs) for these types with the system, and be able to invoke those UDFs in queries. User-Defined Aggregate (UDA) functions are also desirable to summarize collections of these objects in queries. Postgres was the pioneering database system supporting these features in a comprehensive way.</p><p>Why put this functionality into the DBMS, rather than the applications above? The classic answer was the significant performance benefit of “pushing code to data,” rather than “pulling data to code.” Postgres showed that this is quite natural within a relational framework: it involved modest changes to a relational metadata catalog, and mechanisms to invoke foreign code, but the query syntax, semantics, and system architecture all worked out simply and elegantly.</p><h2 style="text-align: left;">Extensible access methods for new datatypes</h2><p>This problem was au courant at the time of Postgres, and <a href="https://en.wikipedia.org/wiki/R-tree">the R-tree</a> developed by Antonin Guttman (1984) in Stonebraker’s group was one of the most successful new indexes developed to solve this problem in practice. Still, the invention of an index structure does not solve the end-to-end systems problem of DBMS support for multi-dimensional range queries. Many questions arise. Can you add an access method like R-trees to your DBMS easily? Can you teach your optimizer that said access method will be useful for certain queries? Can you get concurrency and recovery correct?</p><p>R-trees became a powerful driver and the main example of the elegant extensibility of Postgres’ access method layer and its integration into the query optimizer. Postgres demonstrated— in an opaque ADT style --how to register an abstractly described access method (the R-tree, in this case), and how a query optimizer could recognize an abstract selection predicate (a range selection in this case) and match it to that abstractly described access method.</p><p>PostgreSQL today leverages both the original software architecture of extensible access methods (it has B-tree, GiST, SP-GiST, and Gin indexes) and the extensibility and high concurrency of the Generalized Search Tree (GiST) interface as well. GiST indexes power the popular PostgreSQL-based PostGIS geographic information system; Gin indexes power PostgreSQL’s internal text indexing support.</p><h2 style="text-align: left;">Active Databases and Rule Systems</h2><p>Stonebraker’s work on database rules began with Eric Hanson’s Ph.D., which initially targeted Ingres but quickly transitioned to the new Postgres project. It expanded to the Ph.D. work of Spyros Potamianos on PRS2: Postgres Rules System 2. A theme in both implementations was the potential to implement rules in two different ways. One option was to treat rules as query rewrites, reminiscent of the work on rewriting views that Stonebraker pioneered in Ingres. In this scenario, a rule logic of "on condition then action" is recast as "on query then rewrite to a modified query and execute it instead." For example, a query like "append a new row to Mike’s list of awards" might be rewritten as "raise Mike’s salary by 10%." The other option was to implement a more physical "on condition then action," checking conditions at a row level by using locks inside the database. When such locks were encountered, the result was not to wait (as in traditional concurrency control), but to execute the associated action.</p><p>In the end, neither the query rewriting scheme nor the row-level locking scheme was declared a "winner" for implementing rules in Postgres—both were kept in the released system. Eventually all of the rules code was scrapped and rewritten in PostgreSQL, but the current source still retains both the notions of per-statement and per-row triggers.</p><h2 style="text-align: left;">Log-centric Storage and Recovery</h2><p>Stonebraker described his design for the Postgres storage system this way:</p><p>When considering the POSTGRES storage system, we were guided by a missionary zeal to do something different. All current commercial systems use a storage manager with <a href="https://muratbuffalo.blogspot.com/2023/04/aries-transaction-recovery-method.html">a write-ahead log (WAL)</a>, and we felt that this technology was well understood. Moreover, the original Ingres prototype from the 1970s used a similar storage manager, and we had no desire to do another implementation. </p><p>Over the years, Stonebraker repeatedly expressed distaste for the complex write-ahead logging schemes pioneered at IBM and Tandem for database recovery. One of his core objections was based on a software engineering intuition that nobody should rely upon something that complicated--especially for functionality that would only be exercised in rare, critical scenarios after a crash.</p><p>In the end, the Postgres storage system never excelled on performance; versioning and time-travel were removed from PostgreSQL over time and replaced by write-ahead logging. This is because, once the commercial vendors had write-ahead logs working well, they had innovated on follow-on ideas such as transactional replication based on log shipping, which would be difficult in the Postgres scheme.</p><h2 style="text-align: left;">Support for Multiprocessors: XPRS</h2><p>Stonebraker never architected a large parallel database system, but he led many of the motivating discussions in the field. His “Case for Shared Nothing” paper (1986) documented the coarse-grained architectural choices in the area; it popularized the terminology used by the industry, and threw support behind shared-nothing architectures like those of Gamma and Teradata, which were rediscovered by the Big Data crowd in the 2000s.</p><p>The basic idea of what Stonebraker called “The Wei Hong Optimizer” was to cut the problem in two: run a traditional single-node query optimizer in the style of System R, and then “parallelize” the resulting single-node query plan by scheduling the degree of parallelism and placement of each operator based on data layouts and system configuration. This approach is heuristic, but it makes parallelism an additive cost to traditional query optimization, rather than a multiplicative cost. Although “The Wei Hong Optimizer” was designed in the context of Postgres, it became the standard approach for many of the parallel query optimizers in industry.</p><h2 style="text-align: left;">Support for a Variety of Language Models</h2><p><b><u>One of Stonebraker’s recurring interests since the days of Ingres was the programmer API to a database system. </u></b>The OODB idea was to make programming language objects be optionally marked “persistent,” and handled automatically by an embedded DBMS. Postgres supported storing nested objects and ADTs, but its relational-style declarative query interface meant that each roundtrip to the database was unnatural for the programmer (requiring a shift to declarative queries) and expensive to execute (requiring query parsing and optimization). To compete with the OODB vendors, Postgres exposed a so-called “Fast Path” interface: basically a C/C++ API to the storage internals of the database. This enabled Postgres to be moderately performant in academic OODB benchmarks, but never really addressed the challenge of allowing programmers in multiple languages to avoid the impedance mismatch problem. Instead, Stonebraker branded the Postgres model as “Object-Relational,” and simply sidestepped the OODB workloads as a “zero-billion dollar” market. Today, essentially all commercial relational database systems are “Object-Relational” database systems.</p><p>This application-level approach is different than both OODBs and Stonebraker’s definition of Object-Relational DBs. In addition, lightweight persistent key-value stores have succeeded as well, in both non-transactional and transactional forms. These were pioneered by Stonebraker’s Ph.D. student Margo Seltzer, who wrote BerkeleyDB as part of her Ph.D. thesis at the same time as the Postgres group, which presaged the rise of distributed “NoSQL” key-value stores like Dynamo, MongoDB, and Cassandra.</p><p> </p><h1 style="text-align: left;">Impact</h1><h2 style="text-align: left;">Opensource Impact</h2><p>As the Postgres research project was winding down, two students in Stonebraker’s group—Andrew Yu and Jolly Chen—modified the system’s parser to accept an extensible variant of SQL rather than the original Postquel language. The first Postgres release supporting SQL was Postgres95; the next was dubbed PostgreSQL.</p><p>A set of open-source developers became interested in PostgreSQL and “adopted” it even as the rest of the Berkeley team was moving on to other interests. Over time the core developers for PostgreSQL have remained fairly stable, and the open-source project has matured enormously. Early efforts focused on code stability and user-facing features, but over time the open source community made significant modifications and improvements to the core of the system as well, from the optimizer to the access methods and the core transaction and storage system.</p><p>While many things have changed in 25 years, the basic architecture of PostgreSQL remains quite similar to the university releases of Postgres in the early 1990s, and developers familiar with the current PostgreSQL source code would have little trouble wandering through the Postgres3.1 source code (c. 1991). Everything from source code directory structures to process structures to data structures remain remarkably similar. <b><u>The code from the Berkeley Postgres team had excellent bones.</u></b></p><p>PostgreSQL today is without question the most high-function open-source DBMS, supporting features that are often missing from commercial products. It is also (according to one influential rankings site) the most popular widely used independent open source database in the world and its impact continues to grow: in both 2017 and 2018 it was the fastest-growing database system in the world in popularity PostgreSQL is used across a wide variety of industries and applications, which is perhaps not surprising given its ambition of broad functionality.</p><p>Heroku is a cloud SaaS provider that is now part of Salesforce. Postgres was adopted by Heroku in 2010 as the default database for its platform. Heroku chose Postgres because of its operational reliability. With Heroku’s support, more major application frameworks such as Ruby on Rails and Python for Django began to recommend Postgres as their default database.</p><h2 style="text-align: left;">Commercial adaptations</h2><p>Many of the commercial efforts that built on PostgreSQL have addressed what is probably its key limitation: the ability to scale out to a parallel, shared-nothing architecture. Illustra, Netezza, Greenplum, EnterpriseDB, AsterData, ParAccel (acquired by Amazon and formed a basis for AWS Redshift), and Citus.</p><p>Although the article doesn't mention it AWS RDS and AWS Aurora also provide managed Postgres services and are big.</p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com1tag:blogger.com,1999:blog-8436330762136344379.post-70216990565982192212024-01-17T20:21:00.007-05:002024-01-17T20:21:56.332-05:00Scalable OLTP in the Cloud: What’s the BIG DEAL?<p><a href="https://www.cidrdb.org/cidr2024/papers/p63-helland.pdf">This paper</a> is from Pat Helland, the apostate philosopher of database systems, overall a superb person, and a good friend of mine. The paper appeared this week at CIDR'24. (Check out <a href="https://www.cidrdb.org/cidr2024/program.html">the program</a> for other interesting papers). The motivating question behind this work is: "<b>What are the asymptotic limits to scale for cloud OLTP (OnLine Transaction Processing) systems?</b>" Pat says that the CIDR 2023 paper <a href="https://muratbuffalo.blogspot.com/2023/01/is-scalable-oltp-in-cloud-solved.html">"Is Scalable OLTP in the Cloud a Solved Problem?"</a> prompted this question. </p><p>The answer to the question? Pat says that the answer lies in the joint responsibility of database and the application. If you know of Pat's work, which I have summarized <a href="https://muratbuffalo.blogspot.com/search?q=Helland">several in this blog</a>, you would know that Pat has been advocating along these lines before. But this paper provides a very crisp, specific, concrete answer. Read on for my summary of the paper.</p><p>Disclaimer: This is a wisdom and technical information/detail packed 13-page paper, so I will try my best to summarize the salient points. I will be using text from the paper to explain/summarize it. (Don't taze me bro!) </p><p><br /></p><h1 style="text-align: left;">Snapshot Isolation (SI) is a BIG DEAL</h1><p>The database and the application have a BIG DEAL: their isolation semantics! In particular, snapshot isolation (SI) is the sweet spot. At this point, I got a nice database history lesson on how the isolation semantics evolved. I would have guessed the semantics had become more strict over time. No, on the contrary, they evolved to be more relaxed to meet performance and scalability expectations. And SI does hit a sweet point in that it still provides the user good isolation guarantees without jeopardizing the scaling behavior of the database by requiring it to serialize everything. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEixuW5RhqakvtCihi8JM91NYzNa7Ud6z3G2SM7D2fS4z1B60hF8w8pZOtnL7TW6n5HJHu9Uap4RNJO9lBCCy5LlXR2FaLterr1EOyZo7cNy7G2hvI0Z82BNQFdPywPLidWFFa76aPyaiVGg5G7ZdpneCtiDASM2uDZZZcLYmMhOW-tlJJVi4FdTMs7tMBI" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="722" data-original-width="676" height="400" src="https://blogger.googleusercontent.com/img/a/AVvXsEixuW5RhqakvtCihi8JM91NYzNa7Ud6z3G2SM7D2fS4z1B60hF8w8pZOtnL7TW6n5HJHu9Uap4RNJO9lBCCy5LlXR2FaLterr1EOyZo7cNy7G2hvI0Z82BNQFdPywPLidWFFa76aPyaiVGg5G7ZdpneCtiDASM2uDZZZcLYmMhOW-tlJJVi4FdTMs7tMBI=w375-h400" width="375" /></a></div><p>In the rest of the paper, keep in mind that, an OLTP system is defined as a domain-specific application using a <b>RCSI (READ COMMITTED SNAPSHOT ISOLATION) SQL database</b> to provide transactions across many concurrent users.</p><p>The BIG DEAL splits the scaling responsibilities between the database and the application.</p><p></p><ul style="text-align: left;"><li><b><u>Scalable DBs don’t coordinate across disjoint TXs updating different keys.</u></b></li><li><b><u>Scalable apps don’t concurrently update the same key.</u></b></li></ul><p></p><p>The big deal provides guarantees from the DB to the App. A scalable application can read all it wants. Updates to disjoint records don’t coordinate across TXs. Row-locks on disjoint records don’t coordinate across TXs.</p><p>Applications must tolerate these big deal disclaimers. Reads return snapshots: Records have no "current" value. There is no NOW in a BIG DEAL database! Transactions may abort any time but not too often. SELECT with SKIP LOCKED may subset the set of qualifying records as it returns results.</p><p>This means applications should change business behavior in order to scale. They can only provide a fuzzy/blurry view of the "current" state/changes. So, apps introduce ambiguity in biz domain specific ways: online retail makes ambiguous promises such as "Usually ships in 24 hours". And apps provide delayed truth: finances of a large company may take days to summarize. Many OLTP apps aggregate values synchronously as they interact with humans. Public TPC benchmarks (e.g., TPC-A, TPC-B, and TPC-C) mandated synchronous aggregations. But, as applications scale they should rethink concentrating the aggregated values of business state in dedicated records. By slowly and asynchronously aggregating these business state, the application can scale in a domain-specific manner.</p><p><br /></p><h1 style="text-align: left;">Today's OLTP databases don't scale</h1><p>Before suggesting a hypothetical scalable database that satisfies the database side of the big deal, Pat shows us why today’s databases don’t scale!</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEha3LKPdGbzAgf0z8Gr8UwOfZIE5gviSpIYJl7gan52cd7k-tmgXKuzJbXt_je7-vnRpsi_TNc_61xbOZuP5hlRFPPyYdawO4ldh4tXQdCAZH_qH7ehfd-C9Y_t2WJbnrRgLIaNO11wTXITw-yvVn03mjkqR92P3qgcfQ56BRi-5yEAG9EvEs-8oSjyRKs" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="306" data-original-width="1528" src="https://blogger.googleusercontent.com/img/a/AVvXsEha3LKPdGbzAgf0z8Gr8UwOfZIE5gviSpIYJl7gan52cd7k-tmgXKuzJbXt_je7-vnRpsi_TNc_61xbOZuP5hlRFPPyYdawO4ldh4tXQdCAZH_qH7ehfd-C9Y_t2WJbnrRgLIaNO11wTXITw-yvVn03mjkqR92P3qgcfQ56BRi-5yEAG9EvEs-8oSjyRKs=s16000" /></a></div><p>In today's MVCC databases, reads & writes fight to access the "current" value of a record. The current version has a <b>home</b> location (a partition, server, or a B+ tree) holding the most recently committed version of the record or perhaps an uncommitted version. To update a record, exclusive access to the record's home is required. This causes infighting, contention, and coordination between the updating TX and any concurrent reading TXs.</p><p>Even reads contend with each other, since these implementations force MVCC readers to start out looking at the latest version of a key first. Coordination may also be needed to access neighboring records. Accessing key-ranges in B+Trees or similar data structures that may be changing needs cross-transaction coordination.</p><p>Readers coordinate with writers. Writers coordinate with readers. Readers coordinate with other readers!</p><p>Having a home for a record also makes online repartitioning/sharding (which is required for scalability) very difficult. Moving record keys from one partition to another is complex and impacts application availability.</p><p>To address these challenges, Pat proposes a prototype design. The database is structured so that there is no pre-assigned home for a record per key. Unlike partitioned DBs, this allows the database to seamlessly adapt to workload changes.</p><p>I liken this to the <a href="https://en.wikipedia.org/wiki/Everything_Is_Miscellaneous">miscellaneous manifesto</a> or how instead of neatly organizing/allocating everything a place (which inevitably fails, requiring incessant re-orgs), embracing the messiness and using a search engine to get to information quickly.</p><p><br /></p><h1 style="text-align: left;">Rethinking OLTP databases</h1><p>The architecture is based on <a href="https://muratbuffalo.blogspot.com/2022/01/decoupled-transactions-low-tail-latency.html">a design Pat explored in a previous work</a>. The work is very technical, and I missed the nuance and contributions of it because I didn't read through the appendix about details.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgBmeGMqsLQWCyBjr7ikas0mIkFdZ-6nVculH07Da5GPn2-TZ7B2XNXNvM_7Ux11JAAAWOR_nbOV98pZr72ILraXKqnhoIGV9n8Kr_rXX8F4Hge4Qj5Z5-GZowCyH6Fg25yoejZlQNgUTfQcDVGL-ELNaeE4eATUrT-3Go1hAPgzlnFsEBXkQE3ZBiRngo" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="502" data-original-width="832" src="https://blogger.googleusercontent.com/img/a/AVvXsEgBmeGMqsLQWCyBjr7ikas0mIkFdZ-6nVculH07Da5GPn2-TZ7B2XNXNvM_7Ux11JAAAWOR_nbOV98pZr72ILraXKqnhoIGV9n8Kr_rXX8F4Hge4Qj5Z5-GZowCyH6Fg25yoejZlQNgUTfQcDVGL-ELNaeE4eATUrT-3Go1hAPgzlnFsEBXkQE3ZBiRngo=s16000" /></a></div><p><b>Owner servers</b> verify that concurrent transactions have not created any conflicting updates for each key row-locked or updated by the TX that optimistically hopes to commit. Owner servers partitioned by both key-range and time-range. Repartitioning happens dynamically to accommodate scale. </p><p><b>Worker servers</b> are also horizontally scalable, and each have their own transaction log. As TX load increases, workers are added. Each TX happens at a single worker server. The worker servers accept connections from app servers, perform transactions & their queries, commit transactions to their per-worker log, and periodically flush committed new record-versions to the <a href="https://en.wikipedia.org/wiki/Log-structured_merge-tree">LSM (log structured merge tree)</a>.</p><p><b>LSM servers</b> accept flushes from workers and incorporate them into the orderly past stored in the LSM. Record-versions are organized first by time, second by key. Each LSM layer contains record-versions for a band of time. With an LSM, the past scales without coordinating across disjoint transactions reading and updating!</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhyhoxQSRPZsc1QWeYCex7Gm209SlW0kQMH53tGgHyt5YwnwXwOdDyPbOgYCk8dtiVP-pSxdK_xCI1tS4JCdDukqtg_-TvgwRCHoRFIcHWBs7mITLcdpkmqsx5fwMuOAU9xdweo7NHRS6V11VNBnLVlU_fH6iD1DBLVCSr6bk6oI9E23xoqwLe_FtNzAos" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="354" data-original-width="832" height="170" src="https://blogger.googleusercontent.com/img/a/AVvXsEhyhoxQSRPZsc1QWeYCex7Gm209SlW0kQMH53tGgHyt5YwnwXwOdDyPbOgYCk8dtiVP-pSxdK_xCI1tS4JCdDukqtg_-TvgwRCHoRFIcHWBs7mITLcdpkmqsx5fwMuOAU9xdweo7NHRS6V11VNBnLVlU_fH6iD1DBLVCSr6bk6oI9E23xoqwLe_FtNzAos=w400-h170" width="400" /></a></div><p><br /></p><h1 style="text-align: left;">Transaction execution and commit</h1><p>We now deep dive to workers and owners as they are the most significant components in this OLTP architecture. The owners do the concurrency control (the adjudication of transactions with respect to other concurrent transactions), but the workers do the actual work of the transaction. The transaction is centralized in the worker's log. The workers' logs are ingested by LSM servers for later consumption and durability.</p><p>The worker will accept incoming connections from application servers, and plan/execute SQL statements: Reading with snapshots by key or key-range, acquiring row-locks using their unique record key, and updating records by their unique key. The worker will guess a future time to hopefully commit after being verified that updates and row-locks don’t conflict with concurrent TXs. The worker will then log the transaction’s updates & commit record in its local transaction log, which will then be fed into LSM servers.</p><p>As a commit-time for a transaction is guessed by the worker, every update and row-lock must be verified for no conflicting updates from snapshot to commit by the owner servers. As incoming proposed-updates and verify-locks arrive, they include a proposed-commit-time. Owner-servers align commit-time for records & workers. An incoming request from a worker hopefully arrives at the owner-server before its local clock has reached the proposed-commit-time. If it arrives after commit-time, owner-server returns an error and the TX aborts. If it arrives before commit-time, the owner-server waits until its local clock reaches commit-time.</p><p>What are row-locks you ask?</p><p>Row-locks allow the application to ask the database for help with concurrency across transactions. Traffic cops provide pessimistic concurrency control. They will stall later transactions if they acquire a row-lock held by an earlier transaction. This pessimistic ordering of transactions may be violated when failures happen. Competing transactions usually wait to allow the lock holder to go first but that may be flawed. So correctness will be enforced by OCC prior to commit. Of course, row-locks are moot when scalable apps avoid concurrent updates to the same records. But if the app experience concurrent updates to the same records, row-locks can help with the liveness of transactions when the DB uses them to function as a traffic cop.</p><p>Ok, let's wrap up the transaction execution discussion by talking about how owners can be horizontally scaled. Owners can close for new business and direct new proposed-updates elsewhere. An owner closed for new business only accept worker requests for snapshot reads in their rectangle of key & time ranges, proposed updates, and notifications of transaction outcome. In contrast an owner open for new business also allow new proposed-updates, and new verify-locks.</p><p><br /></p><h1 style="text-align: left;">Massive Scale: It’s About Time!</h1><p>As we have seen, the DB leverages time to provide snapshots, commits, and external consistency. External Consistency ensures new incoming requests see all previously exposed data, even by other database connections. That means, snapshot reads from new incoming work must be after all committed work previously visible outside the database.</p><p>By using current time, T-now, as the snapshot time, this is easy. But this would get trickier and more complex as the geographic scope of a DB grows past a single datacenter.</p><p>Overall, this prototype database architecture is a big vindication for using time in systems. (Some of these ideas have been explored in Pat's earlier paper, under seniority and retirement.) Everything in the database is versioned by the record-version commit time. The database organizes data by its creation time to achieve scaling. Reads are old record-versions as of a past snapshot. Row-locks ensure locked records unchanged until commit time. And updates materialize as new record-versions for later snapshots.</p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com5tag:blogger.com,1999:blog-8436330762136344379.post-54537242224832711042024-01-11T21:31:00.001-05:002024-01-11T21:31:11.755-05:00 Oblivious Paxos: Privacy-Preserving Consensus Over Secret-Shares<p><a href="https://fadhil.id/papers/opaxos-techreport.pdf">This paper appeared in SOCC'23.</a> The paper presents a primary-backup secret-shared state machine (PBSSM) architecture and the associated consensus protocol, Oblivious Paxos (OPaxos). OPaxos enables privacy-preserving consensus by allowing acceptors to safely and consistently agree on a secret-shared value without untrusted acceptors knowing the value.</p><p><br /></p><h2 style="text-align: left;">OPaxos protocol overview</h2><p>OPaxos uses <b>(t, n)</b> threshold secret-sharing. This means generating <b>n</b> secret-shares from a single secret value such that it is possible to reconstruct the secret with just <b>t</b> shares.</p><p>In order to make (t, n) threshold secret-sharing play well with Paxos, the protocol requires that the cardinality of the intersection of any phase1 quorum and phase2 quorum is larger than t. </p><p>This can be achieved by choosing the quorum size as (n+t)/2. More accurately, one quorum (say phase1) would have cardinality as the ceiling of (n+t)/2, and the other quorum (phase2) would have cardinality as floor of this (n+t)/2. The justification is quite simple. If we sum the cardinality of the two quorums, we get n+t, but there are only n <a href="https://en.wikipedia.org/wiki/Pigeonhole_principle">pigeon-holes</a> (erm, total nodes), so we know that these two quorums intersect at least at t nodes.</p><p>In the setup below n=5 and t=2. That means p1=4 nodes, and p2=3 nodes. Note that leaders are only allowed to be located in the trusted sites, and an acceptor in an untrusted site is not allowed to be a leader.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgvrxW-TiVctbRBIFCfTwloBXpy2siVA4OCxq1cBC0h2-KqVnjuF3ZPdnPeu-pcRH163neNGSeh2bTcmKU2aLGRHFmRKBdgqjyRRZ0T7P_IsY_lr-CStk0PsAbqzTeNWhWMkJFUw0wq6J1Danf4nCekV3D8vZ9qhEP9y_40fqicYg5sEWUxisZ3ckiglvQ" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="442" data-original-width="642" height="275" src="https://blogger.googleusercontent.com/img/a/AVvXsEgvrxW-TiVctbRBIFCfTwloBXpy2siVA4OCxq1cBC0h2-KqVnjuF3ZPdnPeu-pcRH163neNGSeh2bTcmKU2aLGRHFmRKBdgqjyRRZ0T7P_IsY_lr-CStk0PsAbqzTeNWhWMkJFUw0wq6J1Danf4nCekV3D8vZ9qhEP9y_40fqicYg5sEWUxisZ3ckiglvQ=w400-h275" width="400" /></a></div><br /><p></p><p></p><p>In the appendix, the paper shows how the (t,n) idea can be extended to Fast Paxos. Since the idea is simple and applies at the quorum intersection level, I think the idea also applies for other flavors, even for <a href="http://muratbuffalo.blogspot.com/2023/12/nezha-deployable-and-high-performance.html">Nezha, which we reviewed recently.</a></p><p><br /></p><h2 style="text-align: left;">What problem is the paper solving?</h2><p>Integrating the (t,n) threshold secret-sharing to Paxos is cute, but the motivation is not well justified. The paper gives hybrid cloud deployments as an application, where the cloud sites are designated as untrusted and leaders are run only at the trusted sites. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiN9mrWYdWFf4XJcet1fZFGvywZLHGi9LnLvCGJdz0nDJx1u_l-lLKP7l3hRWSrmeC9y8eMZhXvEdfIcKZmb6PIfv8umxtIECu9pOBlGgVn937i0LR5LNjagMjvPd7xUmBnN_th_Z5HEc16lDA2bq6-nB-W6C0JP6Rz1lhDGDzO27UlfCy7Wk99NHbuVjA" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="442" data-original-width="642" height="275" src="https://blogger.googleusercontent.com/img/a/AVvXsEiN9mrWYdWFf4XJcet1fZFGvywZLHGi9LnLvCGJdz0nDJx1u_l-lLKP7l3hRWSrmeC9y8eMZhXvEdfIcKZmb6PIfv8umxtIECu9pOBlGgVn937i0LR5LNjagMjvPd7xUmBnN_th_Z5HEc16lDA2bq6-nB-W6C0JP6Rz1lhDGDzO27UlfCy7Wk99NHbuVjA=w400-h275" width="400" /></a></div><p>If the leader on the trusted site uses encryption for the content of the messages, and uses acceptors at the untrusted sites to accept and replicate encrypted content, we would achieve the same objective using vanilla Paxos. The untrusted sites would still be oblivious to the content of the messages they accept and logs they replicate, since they don't have the encryption keys.</p><p>The paper mentions that encryption does not offer information-theoretic privacy because a computationally rich adversary may break it eventually. This is not a practical concern. This is in fact much more unlikely than the cloud providers used in (t,n) secret sharing colluding with each other to reveal the secret.</p><p>The paper does not justify the need for coupling secrecy with Paxos-consensus, when it is easy to achieve both benefits in a decoupled manner. Decoupled components is better for managing complexity and keeping flexible options for software/system evolution.</p><p>Note that, in OPaxos, none of those acceptors in the untrusted sites are allowed to become a leader anyways, because if they do they would have access to information from other nodes and can decrypt the secrets as they would obtain more than t shares. Actually, this opens an interesting attack vector for OPaxos. What stops them from becoming leaders?</p><p>So what is the benefit we gain from the untrusted sites, if availability of the system is limited to availability of at least one node in the trusted sites? The paper suggests that, if the nodes at the trusted sites become unavailable, the untrusted nodes can fall back to using <a href="https://en.wikipedia.org/wiki/Secure_multi-party_computation">secure multi-party computation (SMPC)</a> to preserve availability of the system while keeping secrecy. There are no details in the paper (except for a paragraph) as to how this can be achieved. SMPC is an involved topic and is currently not available for general computation required for state machine replication. So I don't think we can count this as an argument in favor of using OPaxos and (t,n) secret sharing. </p><p>But, as a rebound, the paper does mention an advantage to using OPaxos and (t,n) sharing if we consider the simple stateless problem of key-value store maintenance with just GET and PUT. Then thanks to (t,n) secret sharing, even when no trusted site is available, the client can reach out to the untrusted nodes, and get service in a secrecy preserving manner as follows: "To execute a request, a client dealer fetches the state partition it manages, e.g., a key-value pair, locally computes the result and re-distributes secret shares of any state modifications back to the untrusted servers (shown in Figure 8). It is straightforward to ensure request privacy by performing both read and write operations in a manner so as to appear identical, e.g., by always updating a nonce in the key-value pair even for reads to make them indistinguishable from writes."</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEivOgeaYx-NBxNFdxpqonrYgO5tY1VpwCG3j2ziU6QUK2Xbeu4xs5B_ZYAX40pXAAjQPRfsp99W1A-pQ644HXJ5FsY-Mu7ZgRmoWaX3KDiiEfb074UeJOehhdBHg1kYjwysXtRl1ChOPbePFWfu4xA0AmuQa2aSovrLR1dkohKO5x1SpWihqGRoMG20fJw" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="282" data-original-width="492" height="183" src="https://blogger.googleusercontent.com/img/a/AVvXsEivOgeaYx-NBxNFdxpqonrYgO5tY1VpwCG3j2ziU6QUK2Xbeu4xs5B_ZYAX40pXAAjQPRfsp99W1A-pQ644HXJ5FsY-Mu7ZgRmoWaX3KDiiEfb074UeJOehhdBHg1kYjwysXtRl1ChOPbePFWfu4xA0AmuQa2aSovrLR1dkohKO5x1SpWihqGRoMG20fJw" width="320" /></a></div><br /><p></p><p>The paper has Go implementation for <a href="https://github.com/opaxos/opaxos/blob/main/untrustedopaxos/untrustedopaxos.go">this untrusted execution mode of key-value store maintenance in its Github repo</a>. (And of course, no implementation for the SMPC mode.) For implementation of OPaxos and Fast-OPaxos, the paper builds on our (Ailidani Ailijiang, Aleksey Charapko, and Murat Demirbas) <a href="https://github.com/ailidani/paxi">Paxi framework implementation</a>.</p><p><br /></p><h2 style="text-align: left;">SMR maintenance in OPaxos</h2><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgCXMpeNnhsJ893OlLqACid8sBTWWi2Nyc6gidX88UyDs4Nck3wkSh8eN6tBtdIPNacWRbxcCfxsm9OJC9Xna4usulSf5Bly0k18aj4gliubDwi3-trGw6Z96_5bqhfzrrxBfNywvZWQKwbWVNsTogDqOh2nbJxAdxjzst36WULKyCeO96LpKwWyZhjVj0" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="411" data-original-width="642" height="256" src="https://blogger.googleusercontent.com/img/a/AVvXsEgCXMpeNnhsJ893OlLqACid8sBTWWi2Nyc6gidX88UyDs4Nck3wkSh8eN6tBtdIPNacWRbxcCfxsm9OJC9Xna4usulSf5Bly0k18aj4gliubDwi3-trGw6Z96_5bqhfzrrxBfNywvZWQKwbWVNsTogDqOh2nbJxAdxjzst36WULKyCeO96LpKwWyZhjVj0=w400-h256" width="400" /></a></div><p>I want to mention another interesting tidbit about the paper. OPaxos shares the state-diff resulting from speculative operation execution at one of the leaders at the trusted sites. The paper argues that this avoids the need to have deterministic op-logs, but they miss the importance of <a href="https://en.wikipedia.org/wiki/Change_data_capture">change data capture (CDC).</a> Oplogs are essential for the CDC-based data/system integration in the cloud ecosystem via capturing and delivering changes made.</p><p>I think OPaxos would still be amenable to using oplogs, by having learners at trusted domains running as hot-swap-host catching up and executing from oplogs.</p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-22305465387142884232024-01-09T23:49:00.002-05:002024-01-09T23:50:33.627-05:00Dude, where's my Emacs?<p>It had been 3 years since I had to setup a new laptop. Past Murat had left me <a href="http://muratbuffalo.blogspot.com/2017/04/setting-up-new-mac-laptop.html">helpful</a> <a href="http://muratbuffalo.blogspot.com/2020/03/my-emacs-setup.html">instructions</a>, but I got a bad surprise while configuring my <a href="https://en.wikipedia.org/wiki/Emacs">Emacs</a> setup. My beloved starter-kits, which worked smoothly under Emacs 27, stopped working under Emacs 29. I tried to fix the errors, but this proved futile due to my limited elisp/emacs skills. </p><p>I explored alternative starter-kits to configure Emacs. Doom emerged as the dominant choice, but it appeared large, bloated, and complex for my liking. I tried a couple small starter kits, but I faced other problems and was unable to integrate my customizations and get to a reasonable setup. </p><p>I am an Emacs user for 25 years, and I know a thing or two, but it seems everyone's now an Emacs pro. I don't get how people are able to accept and work with those complicated config files/directories. I started to speculate. Maybe the ordinary users left, leaving behind the proficient Emacs enthusiasts, who kept writing more and more intricate config files/directories. They got radicalized, man! That is why there are no user-friendly starter kits anymore. </p><p><br /></p><h2 style="text-align: left;">My Emacs needs</h2><p>I hate Emacs yak shaving, so I despised wasting several hours on this. Nevertheless, configuring Emacs to a decent functional state is essential for my productivity. I have two top priorities in my Emacs setup.</p><p>The first is the org-mode for todo scheduling/tracking and for hierarchical note-taking. I rely on org-mode for taking meetings notes, writing long form text like blog posts, and in general thinking through writing. I even use it for<a href="https://en.wikipedia.org/wiki/Literate_programming"> literate programming</a> for developing programs/protocols, and writing TLA+ models. Those org-mode files get pretty long, exceeding 10K lines, filled with many hierarchical tasks and todo lists. </p><p>I use global-hi-lock-mode in Emacs to introduce color-highlighting into my notes using special punctuation. For instance, when I use "??", it renders the line yellow, signaling outstanding questions. "@@" transforms the line into green, emphasizing noteworthy ideas or observations. Finally, "!!" marks the line as red, indicating a warning or an important point. This system significantly enhances my writing/thinking workflow.</p><p>My second top priority is the crafting of plain text presentations using org-mode, compiling them with LaTeX into beamer-pdf slides to achieve polished results. This proved invaluable in times when I needed to quickly put together a presentation or customize an existing one. I hate the constant fiddling required by graphical presentation software like Keynote and Powerpoint. Writing presentations in org-mode is significantly more convenient and faster. It provides a frictionless experience, allowing me to concentrate solely on content without being distracted by formatting, yet producing visually appealing presentations. Content is king.</p><p><br /></p><h2 style="text-align: left;">Back to the future</h2><p>After investing additional hours, I got my Emacs setup to function well. I reverted to my approach from eight years ago, using a single bare-knuckle init.el file instead of starter-kits. Surprisingly, Emacs defaults improved in recent years, and encouraged by this, I adopted this minimalist approach. I made minimal customizations in the init.el file, specifically for org-mode, and I also incorporated the "leuven" theme and the "Monaco" font.</p><p>Having installed Emacs via homebrew, I leveraged Elpa for package management. With mactex installed through homebrew, I used Elpa to install auctex, and this sufficed for successful LaTeX compilation and smooth beamer-pdf generation from org files.</p><p>I decided the adopt the default mainstream shortcuts in org-mode for task transitions and date manipulation, as they have improved significantly. Although it took time for my fingers to adjust, I am happy to avoid overly customized shortcuts to prevent potential issues in the future. My goal was to get a usable setup, prioritizing practicality over super-optimization. Premature optimization is the root of all evil.</p><p><br /></p><h2 style="text-align: left;">Reflections</h2><p>Losing my Emacs setup would have been a significant blow to my productivity. Without those capabilities, I'd struggle with note-taking, thinking through writing, blog post creation, presentation writing, and task/project management.</p><p>I remember <a href="https://muratbuffalo.blogspot.com/2019/03/book-review-draft-no-4-by-john-mcphee.html">reading about McPhee's experience</a> using an old text Unix-based editor. The discontinuation of the editor led him to continue on old computers and eventually enlist a developer to adapt it to his new computer.</p><p>Similarly, some colleagues heavily reliant on vcc for proofs faced challenges when it wasn't supported in recent OS versions. They had to use docker containers to recreate that environment.</p><p>It's unsettling how fragile systems become and how dependent we become of our tools. Losing them almost amounts to a loss of memory.</p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-40357678344836865452024-01-05T09:19:00.004-05:002024-01-05T09:19:26.093-05:00Recent reads<p>Here are brief reviews for the 6 books I read in the last couple of months.</p><p><br /></p><h2 style="text-align: left;">How Not to Be Wrong: The Power of Mathematical Thinking (Jordan Ellenberg, 2014)</h2><div><div><a href="https://en.wikipedia.org/wiki/How_Not_to_Be_Wrong">This book</a> discusses applying mathematical thinking in everyday situations for decision-making in order to avoid common pitfalls.</div><div><br /></div><div>The book shows how mathematical principles can be applied to real-world scenarios involving statistics, probability, and game theory. There are some fun real world examples discussed including lotteries, election polls, finance, tax schemes, and medicine. The book aims to show readers how embracing mathematical reasoning can guard them against being misled by faulty arguments or misinterpretations of data.</div><div><br /></div><div>The book is written in an engaging and accessible style, but it gets tedious as it is more than 450 pages long. Inevitably, it delves into repetitive explanations, creating monotony and disorganization. The book loses focus as it disperses its attention. Did we really need this long of a book? Couldn't it be half or quarter the size?</div><div><br /></div><div>Personally, I found more enjoyment in the <a href="https://muratbuffalo.blogspot.com/2020/11/mathematics-for-human-flourishing-book.html">"Mathematics for Human Flourishing" by Francis Su (2020).</a></div><div><br /></div><div><br /></div><h2 style="text-align: left;">On grand strategy (John Lewis Gaddis, 2018)</h2><div>This book explores the concept of <a href="https://en.wikipedia.org/wiki/Grand_strategy">grand strategy</a>: how leaders formulate and implement long-term plans. Being a historian, Gaddis draws on the writings of Sun Tzu, Clausewitz, and Machiavelli, and military campaigns going as far back as 480BC to Xerxes's invasion of Greece, to make his points.</div><div><br /></div><div>He employs <a href="https://en.wikipedia.org/wiki/The_Hedgehog_and_the_Fox">the fox versus hedgehog framework</a> (popularized by Berlin in 1953) to discuss different philosophies leaders adopt for grand strategy. "A fox knows many things, but a hedgehog knows one big thing." --Archilochus</div><div><br /></div><div>Foxes are adaptable and versatile. They draw on a wide range of experiences and ideas, avoiding a fixed, singular perspective. Hedgehogs, in contrast, are characterized by a singular focus. They have a specific, central vision or principle that guides their actions. Gaddis argues that an effective grand strategy requires a deep understanding of the geopolitical landscape, clear objectives, yet flexibility in approach, combining strengths from both fox and hedgehog frameworls. </div><div><br /></div><div>This quote from Gaddis is a stark reminder: "Commonsense is like oxygen: the higher you go, the thinner it gets." In complex situations, commonsense is insufficient and inapplicable. It becomes important to understand things deeply and think strategically. This reminded me of the "Are Right A Lot" and "Earn Trust" leadership principle from Amazon.</div><div><ul style="text-align: left;"><li>Are Right, A Lot: Leaders are right a lot. They have strong judgment and good instincts. <b>They seek diverse perspectives and work to disconfirm their beliefs</b>.</li><li>Earn Trust: Leaders listen attentively, speak candidly, and treat others respectfully. <b>They are vocally self-critical, even when doing so is awkward or embarrassing.</b> Leaders do not believe their or their team’s body odor smells of perfume. They benchmark themselves and their teams against the best. </li></ul></div><div>Despite the promise and boasting of brevity in the first chapter, the book extends to nearly 400 pages. This confused and frustrated me -- I guess I need to calibrate expectations with the work of historians. I found the book to have a dry and academic tone. As the book lingered on in detailed and dry accounting of ancient military campaigns, I got bored and moved to other things.</div><div><br /></div><h2 style="text-align: left;">American Nations: A History of the Eleven Rival Regional Cultures of North America (Colin Woodard, 2011)</h2><div><a href="https://upload.wikimedia.org/wikipedia/en/5/51/Cover_of_American_Nations.jpg"><img src="https://upload.wikimedia.org/wikipedia/en/5/51/Cover_of_American_Nations.jpg" /></a></div><div><br /></div><div><a href="https://en.wikipedia.org/wiki/American_Nations">This book</a> explores the cultural divisions within the United States through a historical reference of frame. Woodard argues that North America is comprised of 11 distinct regional cultures, each with its own historical roots, values, and political characteristics. Woodard traces the development of these 11 nations from the colonial period to the present day, discussing how their unique cultural and historical backgrounds continue to influence politics and society.</div><div><br /></div><div>The book offers a thought-provoking exploration of the regional diversity. Of course, the book simplifies and generalizes heavily, as the author briefly acknowledges. But this was an engaging book, and I enjoyed it. Coming to the States as an immigrant, this book felt like the manual I was missing for understanding American history and current state of affairs.</div><div><br /></div><h2 style="text-align: left;">American Gods (Neil Gaiman, 2001)</h2><div><a href="https://en.wikipedia.org/wiki/American_Gods">This fantasy-fiction book</a> combines intricate world-building, and exploration of contemporary issues within the framework of old myth. The central premise is that gods and mythical beings exist because people believe in them, and their power diminishes as belief fades. In the book, the old gods start a campaign to resist being forgotten, and this starts a conflict with the modern gods. The book has superb storytelling as is the hallmark of Neil Gaiman.</div><div><br /></div><h2 style="text-align: left;">Norse mythology (Neil Gaiman, 2017)</h2><div><a href="https://en.wikipedia.org/wiki/Norse_Mythology_(book)">This book</a> retells the ancient Norse myths with a modern (and superb) narration. Gaiman's short sentences and simple clear prose weaves a captivating narrative of the world of Norse folklore, including stories of Odin, Thor, Loki, Frey, and Freyja. It is funny. I imagined this book was Gaiman's writer block therapy book. It is just retelling of Norse myths, an easy book, that helps him exercise prose and prowess in writing. I mistakenly thought this predated the American Gods. Anyways, I love reading Neil Gaiman, and I loved the simplicity and elegance of this book.</div><div><br /></div><div>"Perfection is achieved when there is nothing left to take away." --Antoine de Saint-Exupéry</div><div><br /></div><h2 style="text-align: left;">The creative act: A way of being (Rick Rubin, 2023)</h2><div>Rick Rubin is a legendary American record producer and music executive. <a href="https://www.youtube.com/watch?v=EUbUn9FnrME&t=133s">He is very memeable.</a></div><div><br /></div><div>This audiobook is superbly narrated by the calming voice of Rick Rubin himself. The book features snippets of Rick Rubin's insights on creativity, punctuated by a resonant gong sound following each brief section.</div><div><br /></div><div>I liked the insights. Most of them are very relatable. You recognize them from your own work, but realize that you never articulated it this way before. </div><div><ul style="text-align: left;"><li>Creativity is not a rare ability. It is not difficult to access. Creativity is a fundamental aspect of being human. It’s our birthright. And it’s for all of us.</li><li>Living life as an artist is a practice. You are either engaging in the practice or you’re not. It makes no sense to say you’re not good at it. It’s like saying, "I’m not good at being a monk." You are either living as a monk or you’re not. We tend to think of the artist’s work as the output. The real work of the artist is a way of being in the world.</li><li>To live as an artist is a way of being in the world. A way of perceiving. A practice of paying attention. Refining our sensitivity to tune in to the more subtle notes. Looking for what draws us in and what pushes us away. Noticing what feeling tones arise and where they lead.</li><li>All that matters is that you are making something you love, to the best of your ability, here and now.</li><li>To the best of my ability, I’ve followed my intuition to make career turns, and been recommended against doing so every time. It helps to realize that it’s better to follow the universe than those around you.</li><li>Art may only exist, and the artist may only evolve, by completing the work.</li><li>The call of the artist is to follow the excitement. Where there’s excitement, there’s energy. And where there is energy, there is light.</li></ul></div><div>The book is very disorganized, which makes it less engaging compared to previous work on the topic, such as "<a href="http://muratbuffalo.blogspot.com/2016/04/book-review-war-of-art-by-steven.html">The War of Art: Break Through the Blocks and Win Your Inner Creative Battles (Steven Pressfield, 2012)</a>". This is also reminiscent of "<a href="http://muratbuffalo.blogspot.com/2020/07/the-great-work-of-your-life-by-stephen.html">The great work of your life (Stephen Cope, 2012)</a>" and <a href="https://muratbuffalo.blogspot.com/2017/10/what-does-authentic-mean.html">work by Seth Godin</a>. </div></div>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-41077681961388275682024-01-03T12:22:00.001-05:002024-01-03T12:23:52.032-05:00 The checklist manifesto (Dr. Atul Gawande, 2009)<p style="text-align: justify;"><a href="https://en.wikipedia.org/wiki/The_Checklist_Manifesto">This book</a> advocates for integrating checklists as potent safety and fault-tolerance tools across diverse domains. Atul Gawande, a prominent surgeon, enriches the narrative with numerous surgery cases to emphasize the effectiveness of checklists in handling intricate tasks.</p><p>The surgery cases are graphic. The unemotional, matter-of-fact tone of the audiobook paradoxically intensified the emotional impact for me. Listening to those accounts gave me sweaty palms, and I instinctively clenched myself with pain. The graphic details effectively drive home the point. While listening, I couldn't help but lament that a simple checklist could have caught the mistake, averting all the blood and suffering.</p><p>The book also delves into how the aviation industry has successfully embraced checklists to reduce errors and improve communication. Gawande argues that disciplined checklist use in aviation significantly enhances safety and reliability. He details how the aviation industry rigorously operationalizes checklists, vetting them through simulations and real-world tests, ensuring they are brutally succinct, and continually improving them for practicality. This stands in stark contrast to other industries, including medicine and surgery.</p><p>Perhaps the early and eager adoption of checklists in aviation, compared to medicine and surgery, stems from pilots having <a href="https://en.wikipedia.org/wiki/Skin_in_the_Game_(book)">skin in the game</a>. Pilots face the same fate as the plane – assigning blame doesn't change the dire outcome of death due to a mishap. In contrast, surgeons don't share the same fate as patients and can shift blame to other factors (as if that matters).</p><p>I loved the focus on the operational aspects of making checklists effective. Gawande resisted strongly against making checklists a top-down mandate. Mandating top-down adoption could have backfired; it needed to be a grassroots effort, allowing teams to adopt and customize checklists to make them their own.</p><p>I also loved this point: one of the first items on both flight and surgery checklists is the initial debrief and introduction of team members. Numerous studies highlight the significant positive impact this simple practice has in transforming individuals into a more effective and collaborative team. Don't skip on the basics and the human touch. </p><p><br /></p><h2 style="text-align: left;">Construction and beyond</h2><p>The book dedicates a chapter to construction as it builds the case for the widespread benefits of checklists across various domains. Given the complex nature of construction projects (involving numerous tasks and collaborators), construction project management benefits from the use of checklists for enhancing communication and coordination.</p><p>I found the connection between checklists and construction not as direct as those in aviation or medicine. The absence of specific, concrete examples of checklists for construction projects left me wanting. While the book provides detailed and concrete checklist examples for medicine, surgery, and aviation, there aren't specific checklist examples for construction. Instead the focus is on scheduling meetings between stakeholders, progress tracking, and finalizing decisions.</p><p>Could it be that construction is even more complicated than aviation and surgery due to the involvement of numerous stakeholders, a larger surface area, and extended time/duration?</p><p>The discussions on construction project management reminded me strongly of software project management, where the multitude of stakeholders, unknowns, extensive interaction surface, and prolonged time duration make it very complex, that in comparison operating a flight or a surgery seems more manageable. </p><p>While we are on this topic, we can also draw parallels to the DevOps field, particularly through practices like runbooks employed by Site Reliability Engineers (SREs). SREs maintain detailed runbooks that serve as systematic checklists for handling routine maintenance tasks as well as for addressing critical incidents. These runbooks ensure that the on-calls or SREs adhere to a well-defined, step-by-step process when dealing with specific issues or tasks.</p><p>The runbooks formulate and capture the operational best practices. Similar to checklists, runbooks enhance communication and reduce error risks. These are often automated using tools/templates like AWS CloudFormation, Terraform, or Google Cloud Deployment Manager for consistent and repeatable infrastructure deployments.</p><p><br /></p><h2 style="text-align: left;">Discussion</h2><p>The book leaves me pondering: why did people overlook such a simple yet powerful tool for so long? And when checklists were introduced, why did their widespread adoption take so much time? Dr. Pronovost recognized their life-saving potential in 2001 and piloted them in hospitals. However, it wasn't until 2007 that Gawande, collaborating with WHO, pushed for broader adoption. Gowande also protests about this, and contrasts this with the swift adoption of new drugs or surgical tools showing much less effectiveness.</p><p>Effecting behavioral change in humans is evidently challenging. Establishing good habits is not easy. Moreover, adopting checklists demands emotional maturity to acknowledge fallibility and the courage to embrace humility. A machismo effect seems at play too, with many doctors and surgeons resisting checklists, feeling reduced to automatons. However, the reality is that brainpower and attention are finite resources. Why waste them on routine tasks when checklists can handle them? Instead we should free up cognitive resources for more challenging aspects of projects! Checklists can get you more creative, because your bottom line is covered. Well-crafted checklists not only reduce errors and omissions but also enhance communication, leading to efficiency and performance improvements.</p><p>Why haven't checklists spread across more domains? Why can't we achieve wider adoption? We should try to mistake-proof routine parts of operations, so we can make progress in tackling complexity in the remaining parts. A colleague I admired once told me that the role of a professor/researcher is to simplify complex things and make them boring.</p><p>In summary, the book's message is clear: don't be a cowboy. Being a cowboy isn't heroic; it's foolish. Instead, be humble, be smart.</p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-81576917960343731962023-12-17T22:47:00.004-05:002023-12-17T22:47:45.433-05:00Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks<p><a href="https://www.vldb.org/pvldb/vol16/p629-geng.pdf">Nezha (VLDB'23)</a> is a consensus protocol that leverages synchronized clocks to decrease latency and increase throughput. There is also <a href="https://github.com/Steamgjk/Nezha">a GitHub repo</a> for the implementation of Nezha and <a href="https://github.com/Steamgjk/Nezha/blob/main/docs/Nezha-tla.pdf">TLA+ model associated with the protocol.<br /></a><br />Nezha's approach is to offload the traditional leader or sequencer-based ordering to synchronized clocks, achieving decentralized coordination without the need to rely on network routers or sequencers. Here, time synchronization is leveraged on a best-effort basis, with no impact on correctness. You guessed it right: there is a fast-path where the best-effort message ordering works, and the client waits for a super-majority quorum of replies ordered consistently. And then there is a slow-path that covers for the case where that fails.<br /><br />The evaluation suggests that Nezha outperforms previous protocols significantly, including an order of magnitude improvements in throughput. But the evaluations are performed with ideal conditions, and overloook <a href="https://muratbuffalo.blogspot.com/2023/09/metastable-failures-in-wild.html">the metastability effects</a>. While time synchronization doesn't affect correctness, it does lead to some modality: one misordered entry ruins/invalidates the well-ordered ones following it. That means, when the slow path is hit once, it is possible for the system to be stuck in the slow path in the following requests, as it may not get enough slack to recover back to the fast-path mode.<br /><br /><br /></p><h1 style="text-align: left;">Nezha's contribution</h1><p style="text-align: left;">This paper follows the research trend of utilizing time synchronization to enhance consensus performance. This is a timely trend. We have <a href="https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-time-sync-service-microsecond-accurate-time/">very precise time synchronization available in the datacenters</a> thanks to advances in atomic clocks and better time sync infrastructure. It is no longer unreasonable to assume good time sync in the datacenters. We had talked about <a href="https://muratbuffalo.blogspot.com/2022/02/efficient-replication-via-timestamp.html">Tempo and Accord</a> and their use of time.<br />Nezha seems more practical and more immediately applicable.<br /><br />Nezha makes a couple simplifications that reduces the big modality gap between the fast and slow paths. In contrast to traditional Fast Paxos protocols, in Nezha there is always a dedicated stable leader. This leader is required to be included/involved in both fast and slow quorums. I love the simplification this stable leader brings. Each replica follows the log of the leader, rather than try to piece together logs across multiple leaderless nodes themselves. The speculative execution at the leader is a bonus. Watch out for these point in the rest of the summary. <br /><br />Keep in mind that the modality gap is still not entirely closed as I complained above about the lack of evaluation of metastability. But this is an improvement, and I am happy to take it. <br /><br /><br /></p><h1 style="text-align: left;">The reordering problem and quantifying it</h1><p style="text-align: left;">Ensuring the same consistent order across all receivers is a significant challenge in distributed systems. While TCP maintains order consistency for a single receiver, this cannot be used to guarantee the same message order across different receivers. <br /><br />Reordering across receivers occur due to different paths/routers taken by the messages to the receivers. To quantify the extent of the reordering problem, the paper introduces the reordering score.</p><p style="text-align: left;"></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhvlkBJnsA9-tZDM7kBetpXGwD7ZLxvsqiIrzYZZ5ALjY1W9ZhHLgCstsoahwtnUiBET4Mq7QK08pCuQqJFI1EWP1uk5AdzgUcsmkR3ayp-8ldFUDm0SgwebLR1Fgi1_pOVbBU-rx--PT8thUm3z57Q04YJMksIFYVPyqhrQw5qmfBhlLEf_KtfFhnslzM" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="152" data-original-width="1238" height="39" src="https://blogger.googleusercontent.com/img/a/AVvXsEhvlkBJnsA9-tZDM7kBetpXGwD7ZLxvsqiIrzYZZ5ALjY1W9ZhHLgCstsoahwtnUiBET4Mq7QK08pCuQqJFI1EWP1uk5AdzgUcsmkR3ayp-8ldFUDm0SgwebLR1Fgi1_pOVbBU-rx--PT8thUm3z57Q04YJMksIFYVPyqhrQw5qmfBhlLEf_KtfFhnslzM" width="320" /></a></div>For two receivers, R1 and R2, receiving multicast messages from various senders, we establish the sequence of messages received by R1 as the reference. Each message receives a sequence number corresponding to its order of arrival at R1. Using these sequence numbers, the reordering score is computed to measure the extent of reordering in R2. This score leverages the length of the longest increasing subsequence (LIS) in R2's sequence.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhS6EPI5xxpEN15CNyuUb3oeLbHYZH1yjqlpgyuyrhD51Ny0awX0etmZVgutG0dQCxFHoE6xdeVHC-0AVRo-nv7kiZZxSua4JocIqXj9naItNmMo5rnlRyWArj2fe8XKPsBASO9xQjgdYPE1YxNIobzLeSOefpGg0AO1BcHmcP7Pad_FHxYB54kOl8tY9I" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="524" data-original-width="2492" height="135" src="https://blogger.googleusercontent.com/img/a/AVvXsEhS6EPI5xxpEN15CNyuUb3oeLbHYZH1yjqlpgyuyrhD51Ny0awX0etmZVgutG0dQCxFHoE6xdeVHC-0AVRo-nv7kiZZxSua4JocIqXj9naItNmMo5rnlRyWArj2fe8XKPsBASO9xQjgdYPE1YxNIobzLeSOefpGg0AO1BcHmcP7Pad_FHxYB54kOl8tY9I=w640-h135" width="640" /></a></div><p></p><h1 style="text-align: left;">Deadline-Ordered Multicast (DOM)</h1><p style="text-align: left;">Nezha uses the DOM primitive for best effort consistent message ordering across receivers. DOM assigns a deadline timestamp (with global time using synchronized clocks) to each request and only delivers requests after that deadline time is reached (and in the deadline timestamp order). The intuition here is that the deadline acts as a buffer. By holding a message m' until its deadline (instead of immediately delivering it), we get a chance to receive any earlier message m that m' may have over-taken at this receiver, and so we are able to deliver them in the right order: m followed by m'. <br /><br />DOM is a best-effort primitive: a sequence of messages is processed in order at a receiver if they all arrive before their deadlines, but DOM does not guarantee that messages arrive reliably at all receivers either before the deadline or at all. <br /><br />Figure 3 shows different percentiles (i.e., 50th , 75th , 90th , and 95th) for DOM to decide its deadlines. A higher percentile means lower reordering. However, a higher percentile also means longer holding delay for messages in DOM, which in turn decreases the latency savings of Nezha. The paper uses the 50th percentile in Nezha to strike a balance.<br /><br /><br /></p><h1 style="text-align: left;">Fast path</h1><p style="text-align: left;"></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgTjz8DcdyYLHp3G-0u55-n4p35M0AnpM6gXaP8PnlUkLjTMqSQ6hn1R8sqZ61yxSHM7H6oQIw7syzrn5NQju56w34ec9oCYYWthP-0KlqxNalXtRA81OYJqS0lGwHP9VUz1OiV1sWY6tKNEf5S9fwTmZqjvwXxS2zpo40rQyUWfA7Z22m4FU6ZSOd2NY8" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1060" data-original-width="1218" height="348" src="https://blogger.googleusercontent.com/img/a/AVvXsEgTjz8DcdyYLHp3G-0u55-n4p35M0AnpM6gXaP8PnlUkLjTMqSQ6hn1R8sqZ61yxSHM7H6oQIw7syzrn5NQju56w34ec9oCYYWthP-0KlqxNalXtRA81OYJqS0lGwHP9VUz1OiV1sWY6tKNEf5S9fwTmZqjvwXxS2zpo40rQyUWfA7Z22m4FU6ZSOd2NY8=w400-h348" width="400" /></a></div>Although the figure shows a single proxy, this is a multiple proxy system. We get scaling because each proxy has access to synchronized clocks, each can send their DOM requests using local clocks without the need to communicate with each other. The time synchronization ensures that the requests that Proxies send have an agreed upon order, and that the replicas will deliver them in that order. <br /><br />If the time synchronization and message delivery works well (that is, if messages are delivered before their deadlines in the leader and fast-quorum number of nodes), then we have 1 RTT consensus. <br /><br />The leader is not an inbound I/O bottleneck because it doesn't need to aggregate messages from the replicas. The leader and each replica does the same work: 1 receive, 1 send. The time synchronization takes care of ordering of messages without a sequencer at the replicas. The proxy does collecting of replies from replicas and checks whether the fast-quorum is achieved. This is achieved if leader's reply is received, alongside another f+f/2 replica responses (out of 2f+1 total nodes, which at most f of them can fail), which indicate they delivered the same message id as the leader did. This is checked through the use of hashing, and the paper also has a nice optimization in Section 7.1 for doing this incrementally (reducing the check to checking set equivalence rather than list equivalence because delivery at each replica is done in timestamped order).<br /><br />Only the leader executes the request, the replicas just respond saying that they delivered the same message, and are ready to serve as fault-tolerance agents/replicas. The replicas do not execute the request. Well at least not immediately. It is ok for replicas to execute requests later, after they confirm the leader's order.<br /><br />The execution at the leader is said to be speculative, because the leader doesn't know that the replicas also delivered this message in order. But from one perspective, this is not very speculative: the leader knows that its ordering will take affect (unless of course it is dethroned before this ordering is log-replicated by majority number of replicas.)<br /><br /><br /><p></p><h1 style="text-align: left;">Slow path</h1><p style="text-align: left;"></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhdjLiAfLNg2Y3wPnvVMU54YWrGqDKHM-dTQ4deMW-WzXtldIOdW-13X3yEtHnqYbFRmIoZusHBrpUynJwvKiD0tRbeiI_D_BQdet_4nA8lIJoQ2t4BwjOxiFhoUMzyD3vVQgzPsbroD5qvx9GdixROjMpDG02ItMvnAi9cuAgC_-bdXnUm3Z1WsqwBGT8" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1060" data-original-width="1218" height="348" src="https://blogger.googleusercontent.com/img/a/AVvXsEhdjLiAfLNg2Y3wPnvVMU54YWrGqDKHM-dTQ4deMW-WzXtldIOdW-13X3yEtHnqYbFRmIoZusHBrpUynJwvKiD0tRbeiI_D_BQdet_4nA8lIJoQ2t4BwjOxiFhoUMzyD3vVQgzPsbroD5qvx9GdixROjMpDG02ItMvnAi9cuAgC_-bdXnUm3Z1WsqwBGT8=w400-h348" width="400" /></a></div><br />If the fast path condition fails (that is if the super majority quorum does not have the same value as the leader), the slow path picks up the slack. <br /><br />This is more asynchronous in nature than the fast path execution. The replicas stream the log from the leader. And they reply back to the proxy. For this only f+1 replies (one of which must be from the leader) is sufficient.<br /><br /><br /><p></p><h1 style="text-align: left;">Discussion about the protocol </h1><p style="text-align: left;">Nezha increases latency more by requiring that the leader's message is also included in the super-majority quorums of the fast path. But this seems to help bridge the gap between fast and slow paths. Having a dedicated leader, and insisting on following the order of that leader reduces the pain of dropping to the slow path, and the pain of recovery of proposals. Recovery of proposals becomes just the recovery of the leader which is a better known problem, and a more exercised code path. I like this continuity from the fast path to slow path thanks to both relying on the same stable leader. This is not as big modality jump as in leaderless protocols like Tempo and Accord. <br /><br />But, still, there is a modality drop. I don't like that one misordered entry ruins the delivery of the well-ordered ones following it. After going slow path once, it would require sometime for the system to recover back and get to fast path replies again. This is because one misordered entry via slow path invalidates the order of fast-path delivered entries following that one, as we lost the order in the prefix of that log. Those would also likely need to follow slow-path, if the problem is not resolved before their deadlines expire.<br /><br />There is a <a href="https://muratbuffalo.blogspot.com/2023/09/metastable-failures-in-wild.html">metastability risk</a> here, since this may not happen at all as the system keeps playing catchup and get overwhelmed with the busy work to get back to fast path deliveries again. I think commutative operations may help for cutting some slack to Nezha, but I don't think there is any guarantees there. I am optimistic here because the slow path does not create extra looped-in traffic to the system and does not overload the system further than normal. So catchup is likely if there is enough idle period in-between requests. Unfortunately, the paper does not have experiments on this in the evaluation section. <br /><br />I think for a faster recovery, we can take a cue from <a href="https://muratbuffalo.blogspot.com/2023/09/metastable-failures-in-wild.html">the metastability paper</a>. It may be best to shed load fast upon falling to slow path, recover, and then accept load, otherwise it is possible for the system to be stuck grinding on the slow path. Once the replicas detect the slow path is used and system is grinding, they can inform the proxies to initiate some load shedding.<br /><br />Nezha does not prescribe anything new for leader change and reconfiguration (node addition/removal to the node set). They default to existing techniques here, which I think is a good thing. <br /><br />Finally, there are some subtle differences in the leader role from vanilla MultiPaxos. Having the same log entry at a quorum of replicas does not guarantee anchoring that entry for committing. Nezha is very leader-centric. For anchoring the entry for commit, the leader must have also appended the corresponding entry to its log at the same position. Even if all replicas share the same entry at log position K, if the entry differs from the leader's entry at position K, the leader can revert those log entries at those replicas in the slow path. This may happen because the replicas may have delivered the message using DOM, where as this message is delivered to the leader after its deadline expired, so the leader has this in a different order.<br /><br /><br /></p><h1 style="text-align: left;">Discussion about time synchronization and applications of DOM</h1><p style="text-align: left;">The authors had proposed in 2018 a very nice protocol, Huygens, for tight clock synchronization without requiring dedicated networking support. I had also liked the practicality of the approach used in Huygens. For a description of Huygens, <a href="https://muratbuffalo.blogspot.com/2018/04/nsdi-18-first-day.html">see the second session heading in my NSDI'18 post.</a> (For a more broader discussion on time synchronization, <a href="https://muratbuffalo.blogspot.com/2021/03/sundial-fault-tolerant-clock.html">see this post</a>.)<br /><br />Nezha builds up on that line of work. The paper mentions that, with its use of proxies, Nezha can serve as a drop-in replacement of Raft/Multi-Paxos and metadata stores like etcd and ZooKeeper. They also list fair-access financial exchange system for the cloud as an application. <br /><br />Nezha showcased DOM in use with single conflict domain consensus. I am excited about the applications of DOM for other consensus deployments and Paxos variants. There are many opportunities here. <br /><br />This is an exciting time. We are <a href="https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-time-sync-service-microsecond-accurate-time/">seeing microsecond-accuracy time sync available at the cloud</a>, and database products like <a href="https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-aurora-limitless-database/">Aurora Limitless</a> making use of that for improving the performance of distributed transaction processing across shards. (Of course, we do not forget about <a href="https://muratbuffalo.blogspot.com/2013/07/spanner-googles-globally-distributed_4.html">Spanner's original introduction of TrueTime in distributed transaction processing</a>.) We are likely to see more adoption of tight clock synchronization for distributed systems in the coming years.<br /><br /><br /></p><h1 style="text-align: left;">Implementation and evaluation</h1><p style="text-align: left;">Nezha outperforms the 4 baselines (Multi-Paxos, Fast Paxos, NOPaxos, Raft) by 1.9–20.9x in throughput, and by 1.3–4.0x in latency. The github repo for the code is available at <a href="https://github.com/Steamgjk/Nezha">https://github.com/Steamgjk/Nezha</a>. <br /><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEg9Ar0dTPd52KrF88H7aeNib6NoZOqfL2PoPIcyuXJcH8PwMCSJjKFvZRjlz6fG7SXOALGIFSnTEUYOcKok6qH28x6gWhxKplroo22OUmKbrQLINZnXQkhRc0OQvLXzjuvKml0TFZkXmJE-bKowY1Mwx2XPNN4vfZvqOgOM9VOiPZBVVw9y9qIKgSEKinc" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="650" data-original-width="2592" height="160" src="https://blogger.googleusercontent.com/img/a/AVvXsEg9Ar0dTPd52KrF88H7aeNib6NoZOqfL2PoPIcyuXJcH8PwMCSJjKFvZRjlz6fG7SXOALGIFSnTEUYOcKok6qH28x6gWhxKplroo22OUmKbrQLINZnXQkhRc0OQvLXzjuvKml0TFZkXmJE-bKowY1Mwx2XPNN4vfZvqOgOM9VOiPZBVVw9y9qIKgSEKinc=w640-h160" width="640" /></a></div><br /><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiYxaBYefutZZZK-iBTgbEzO0F2EPZzpbIOpRS8O21w7fZ-Y_SglP3wd8Qp-NCKDCAGSKhpyr3cSj9p51zeIYO2hWVwZBHNUxLuq-ExrJh9nteg6QRtZaFk_42-FMEJ3J5A-Z8ErWzfZ2HtHx2620HG37qbJB0gBdlQT4xZfGC3qHhFA2qAYtdV6ceR5vc" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="512" data-original-width="1150" height="178" src="https://blogger.googleusercontent.com/img/a/AVvXsEiYxaBYefutZZZK-iBTgbEzO0F2EPZzpbIOpRS8O21w7fZ-Y_SglP3wd8Qp-NCKDCAGSKhpyr3cSj9p51zeIYO2hWVwZBHNUxLuq-ExrJh9nteg6QRtZaFk_42-FMEJ3J5A-Z8ErWzfZ2HtHx2620HG37qbJB0gBdlQT4xZfGC3qHhFA2qAYtdV6ceR5vc=w400-h178" width="400" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEirZI2pBPR_4ZcmyeIEDzGp8LsU9JKaAj5QKtqeKTwVajdxzykwhDSkv4hIVUGNh20R3nz_K1tO1VTh_KvlUzl0tGB4WOU0Ge-a46XhnE64AkHu4xe1jN1AjDJ4NnlnLfnOrFaIjPQ-zE2j1necpQtD8NX8T01xmv0vAQ7xOLrmDzOkxz0vkVCDwFOCsAE" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="526" data-original-width="2482" height="136" src="https://blogger.googleusercontent.com/img/a/AVvXsEirZI2pBPR_4ZcmyeIEDzGp8LsU9JKaAj5QKtqeKTwVajdxzykwhDSkv4hIVUGNh20R3nz_K1tO1VTh_KvlUzl0tGB4WOU0Ge-a46XhnE64AkHu4xe1jN1AjDJ4NnlnLfnOrFaIjPQ-zE2j1necpQtD8NX8T01xmv0vAQ7xOLrmDzOkxz0vkVCDwFOCsAE=w640-h136" width="640" /></a></div><br /><br /><br /><p></p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-25452308040696361912023-12-05T13:54:00.006-05:002023-12-05T13:59:41.312-05:00Best of Metadata in 2023<p>It is that most wonderful time of the year again. Time to reflect back on the best posts at Metadata blog in 2023. <br /><br /></p><h1 style="text-align: left;">Distributed systems</h1><p style="text-align: left;"><a href="https://muratbuffalo.blogspot.com/2023/10/hints-for-distributed-systems-design.html">Hints for Distributed Systems Design</a>:<i> </i>I have seen these hints successfully applied in distributed systems
design throughout my 25 years in the field, starting from the theory of
distributed systems (98-01), immersing into the practice of wireless
sensor networks (01-11), and working on cloud computing systems both in
the academia and industry ever since. <br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/09/metastable-failures-in-wild.html">Metastable failures in the wild</a>: Metastable failure is defined as permanent overload with low throughput
even after the fault-trigger is removed. It is an emergent behavior of a
system, and it naturally arises from the optimizations for the common
case that lead to sustained work amplification. <br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/11/towards-modern-development-of-cloud.html">Towards Modern Development of Cloud Applications</a>: This is an easy-to-read paper, but it is not an easy-to-agree-with paper. The
message is controversial: Don't do microservices, write a monolith, and
our runtime will take care of deployment and distribution.<br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/03/characterizing-microservice-dependency.html">Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis</a>: The paper conducts a comprehensive study of large scale microservices deployed in Alibaba clusters. They find that the microservice graphs are dynamic in runtime, most graphs are scattered to grow like a tree, and the size of call graphs follows a heavy-tail distribution.<i><br /><br /><br /></i></p><h1 style="text-align: left;">TLA+</h1><p style="text-align: left;"><a href="https://muratbuffalo.blogspot.com/2023/09/beyond-code-tla-and-art-of-abstraction.html">Beyond the Code: TLA+ and the Art of Abstraction</a>: Abstraction is a powerful tool for avoiding distraction. The etimology
of the word abstract comes from Latin for cut and draw away. With
abstraction, you slice out the protocol from a complex system, omit
unnecessary details, and simplify a complex system into a useful model. In his 2019 talk, Leslie Lamport said: "Abstraction, abstraction, abstraction! That's how you win a Turing Award."<br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/08/model-checking-guided-testing-for.html">Model Checking Guided Testing for Distributed Systems</a>: The paper shows how to generate test-cases from a TLA+ model of a distributed
protocol and apply it to the Java implementation to check for bugs in
the implementation. They applied the technique to Raft, XRaft, and Zab
protocols, and presented the bugs they find. <br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/08/going-beyond-incident-report-with-tla.html">Going Beyond an Incident Report with TLA+ </a>: This paper is about the use of TLA+ to explain root cause of a Microsoft Azure incident. It looks like the incident got undetected/unreported for 26 days, because it was a partial outage. "A majority of requests did not fail -- rather, a specific type of request was disproportionately affected, such that global error rates did not reveal the outage despite a specific group of users being impacted."<i><br /><br /></i><br /><a href="https://muratbuffalo.blogspot.com/2023/09/a-snapshot-isolated-database-modeling.html">A snapshot isolated database modeling in TLA+</a>: This shows a modeling walk through (and model checking) of a snapshot isolated database, where each transaction makes a copy of the store, and OCC merges their copy back to store upon commit. <br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/10/cabbage-goat-and-wolf-puzzle-in-tla.html">Cabbage, Goat, and Wolf Puzzle in TLA+ </a>: It is important to emphasize that abstraction is an art, not a science, and it
is best learned through studying examples and practicing hands-on with
modeling. TLA+ excels in providing rapid feedback on your modeling and
designs, which facilitates this learning process significantly. Modeling
the "cabbage, goat, and wolf" puzzle taught me that tackling
real/physical-world scenarios is a great way to practice abstraction and
design -- cutting out the clutter and focusing on the core challenge.<br /><br /><br /></p><h1 style="text-align: left;">Production systems</h1><p style="text-align: left;"><a href="https://muratbuffalo.blogspot.com/2023/08/distributed-transactions-at-scale-in.html">Distributed Transactions at Scale in Amazon DynamoDB</a>: Aligned with the predictability tenet, when adding transactions to
DynamoDB, the first and primary constraint was to preserve the
predictable high performance of single-key reads/writes at any scale. The
second big constraint was to implement transactions using update
in-place operation without multi-version concurrency control. The reason
for this was they didn't want to mock with the storage layer which did
not support multi-versioning. Satisfying both of the above
constraints may seem like a fool's errand, as transactions are infamous
for not
being scalable and reducing performance for normal operations without
MVCC, but the team got creative around these constraints, and managed to
find a saving grace.<br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/10/kora-cloud-native-event-streaming.html">Kora: A Cloud-Native Event Streaming Platform For Kafka</a>: Kora combines best practices to deliver cloud features such as high
availability, durability, scalability, elasticity, cost efficiency,
performance, multi-tenancy. For example, the Kora architecture decouples
its storage and compute tiers to facilitate elasticity, performance,
and cost efficiency. As another example, Kora defines a Logical Kafka
Cluster (LKC) abstraction to serve as the user-visible unit of
provisioning, so it can help customers distance themselves from the
underlying hardware and think in terms of application requirements.<i> </i><br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/07/spanner-becoming-sql-system.html">Spanner: Becoming a SQL system</a>: The original Spanner paper was published in 2012 had little
discussion/support for SQL. It was mostly a "transactional NoSQL core".
In the intervening years, though, Spanner has evolved into a relational
database system, and many of the SQL features in F1 got incorporated
directly in Spanner. Spanner got a strongly-typed schema system and a
SQL query processor, among other features. This paper describes
Spanner's evolution to a full featured SQL system. It focuses mostly on
the distributed query execution (in the presence of resharding of the
underlying Spanner record space), query restarts upon transient
failures, range extraction (which drives query routing and index seeks),
and the improved blockwise-columnar storage format.<br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/11/polardb-scc-cloud-native-database.html">PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent Reads</a>: PolarDB adopts the canonical primary secondary architecture of
relational databases. The primary is a read-write (RW) node, and the
secondaries are read-only (RO) nodes. Having RO nodes help for executing
queries, and scaling out in terms of querying performance. On top of this, they are interested in being able to <b>serve strong-consistency reads from RO nodes</b>.<br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/10/tidb-raft-based-htap-database.html">TiDB: A Raft-based HTAP Database</a>: TiDB is an opensource
Hybrid Transactional and Analytical Processing (HTAP) database,
developed by PingCap. The TiDB server, written in Go, is the
query/transaction processing component; it is stateless, in the sense
that it does not store data and it is for computing only. The underlying key-value store, TiKV,
is written in Rust, and it uses RocksDB as the storage engine. They add
a columnar store called TiFlash, which gets most of the coverage in
this paper. <br /><br /><br /></p><h1 style="text-align: left;">Databases</h1><p style="text-align: left;"><a href="https://muratbuffalo.blogspot.com/2023/04/the-end-of-myth-distributed.html">The end of a myth: Distributed transactions can scale</a>: The paper presents NAM-DB, a scalable distributed database system that
uses RDMA (mostly 1-way RDMA) and a novel timestamp oracle to support
snapshot isolation (SI) transactions. NAM stands for
network-attached-memory architecture, which leverages RDMA to enable
compute nodes talk directly to a pool of memory nodes. <br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/04/aries-transaction-recovery-method.html">ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks</a>:<i> </i>This is a foundational paper in databases area. ARIES achieves
long-running transaction recovery in a performant/nonblocking fashion.
It is more complicated than simple (write-ahead-log) WAL-based
per-action-recovery, as it needs to preserve the Atomicity and
Durability properties for ACID transactions. Any transactional database
worth its salt (including PostGres, Oracle, MySQL) implements recovery
techniques based on the ARIES principles. <br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/11/epoxy-acid-transactions-across-diverse.html">Epoxy: ACID Transactions Across Diverse Data Stores</a>: Epoxy leverages Postgres transactional database as the
primary/coordinator and extends multiversion concurrency control (MVCC)
for cross-data store isolation. It provides isolation as well as
atomicity and durability through its optimistic concurrency control
(OCC) plus two-phase commit (2PC) protocol. Epoxy was implemented as a
bolt-on shim layer for five diverse data stores: Postgres, MySQL,
Elasticsearch, MongoDB, and Google Cloud Storage (GCS). (I guess the
authors had Google Cloud credits to use rather than AWS credits, and so
the experiments were run on Google Cloud.)<br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/07/detock-high-performance-multi-region.html">Detock: High Performance Multi-region Transactions at Scale</a>: This is a followup to the deterministic database work that Daniel Abadi has
been doing for more than a decade. I like this type of continuous
research effort rather than people jumping from one branch to another
before exploring the approach in depth. <br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/02/polyjuice-high-performance-transactions.html">Polyjuice: High-Performance Transactions via Learned Concurrency Control</a>: This paper shows a practical application of simple machine learning to an important systems problem, concurrency control. Instead of choosing among a small number of known algorithms, Polyjuice
searches the "policy space" of fine-grained actions by using
evolutionary-based reinforcement learning and offline training to
maximize throughput. Under different configurations of TPC-C and TPC-E,
Polyjuice can achieve throughput numbers higher than the best of
existing algorithms by 15% to 56%.<br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/02/a-study-of-database-performance.html">A Study of Database Performance Sensitivity to Experiment Settings</a>: The paper investigates the following question: Many articles compare
to prior works under certain settings, but how much of their
conclusions hold under other settings? They find that the evaluations of the sampled work (and conclusions drawn from
them) are sensitive to experiment settings. They make some
recommendations as to how to proceed for evaluation of future systems
work.<br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/01/tpc-e-vs-tpc-c-characterizing-new-tpc-e.html">TPC-E vs. TPC-C: Characterizing the New TPC-E Benchmark via an I/O Comparison Study</a>: This paper compares the two standard TPC benchmarks for OLTP, <a href="https://en.wikipedia.org/wiki/TPC-C">TPC-C </a>which came in 1992, and the TPC-E which dropped in 2007. TPC-E
is designed to be a more realistic and sophisticated OLTP benchmark
than TPC-C by incorporating realistic data skews and referential
integrity constraints. However, because of its complexity, TPC-E is more
difficult to implement, and hard to provide reproducability of the
results by others. As a result adoption had been very slow and little.<br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/01/is-scalable-oltp-in-cloud-solved.html">Is Scalable OLTP in the Cloud a Solved Problem?</a> The paper draws attention to the divide between conventional wisdom on
building scalable OLTP databases (shared-nothing architecture) and how
they are built and deployed on the cloud (shared storage architecture).
There are shared nothing systems like CockroachDB, Yugabyte, and
Spanner, but the overwhelming trend/volume on cloud OLTP is shared
storage, and even with single writer, like AWS Aurora, Azure SQL HyperScale, PolarDB, and AlloyDB. <br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/08/take-out-trache-maximizing.html">Take Out the TraChe: Maximizing (Tra)nsactional Ca(che) Hit Rate</a>: This is the main message of this paper: <b>You have been doing caching wrong for your transactional workloads!<br /></b><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/02/designing-access-methods-rum-conjecture.html">Designing Access Methods: The RUM Conjecture</a>: Algorithms and data structures for organizing and accessing data are
called access methods. Database research/development community has been
playing catchup redesigning and tuning access methods to accommodate
changes to hardware and workload. As data generation and workload
diversification grow exponentially, and hardware advances introduce
increased complexity, the effort for the redesigning and tuning have
been accumulating as well. The paper suggests it is important to solve
this problem once and for all by identifying tradeoffs access methods
face, and designing access methods that can adapt and autotune the
structures and techniques for the new environment.<br /><br /><br /></p><h1 style="text-align: left;">Miscellaneous</h1><p style="text-align: left;"><a href="https://muratbuffalo.blogspot.com/2023/07/sigmod-panel-future-of-database-system.html">SIGMOD panel: Future of Database System Architectures</a>: Swami said that when Raghu invited him to be on the panel, he didn't
know he would have to disagree with his PhD advisor Gustavo. He said
that disaggregation has already arrived. He took a customer-focused view
and said that the boundary between analytics, transactional, and ML is
irrelevant for customers, and these are artificial distinctions of the
research community has that needs to die. He built on the
hardware-software codesign theme Anastasia mentioned. He said that
humans are not good at high cardinality problems, this is where ML
helps, and there is not enough investment on how to use ML for building
DB. Being on-call at 2am, debugging, makes you appreciate these things.
He said, being known as the NoSQL guy, he would controversially claim
that "SQL is going to die" because LLMs are going to reinvent spec, and
allow natural language based querying.<br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/06/sigmodpods-day-2.html">SIGMOD/PODS Day 2</a>: Don Chamberlin (IBM fellow retired) is the creator of the SQL language. Why does the title say 49 years and not 50 years of querying? This is because the SQL paper was published 49 years ago at SIGMOD conference. The paper was titled: <a href="https://dl.acm.org/doi/10.1145/800296.811515">"SEQUEL: A structured English query language".</a>
But believe it or not, this was not the main show in that conference,
and maybe even went low key unnoticed. The main show was two influential
people debating. <br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/09/review-performance-modeling-and-design.html">Review: Performance Modeling and Design of Computer Systems: Queueing Theory in Action</a>: We are <a href="https://emptysqua.re">A. Jesse Jiryu Davis</a>, <a href="https://ahelwer.ca/">Andrew Helwer</a>, and <a href="https://muratbuffalo.blogspot.com/">Murat Demirbas</a>,
three enthusiasts of distributed systems and formal methods. We’re
looking for rigorous ways to model the performance of distributed
systems, and we had hoped that this book would point the way.<br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/05/keep-calm-and-crdt-on.html">Keep CALM and CRDT On</a>: This paper focuses on the read/querying problem of conflict-free replicated data
types (CRDTs). To solve this problem, it proposes extending CRDTs with a
SQL API query model, applying the CALM theorem to identify which
queries are safe to execute locally on any replica. The answer is of no
surprise: monotonic queries can provide consistent observations without
coordination.<br /></p><p style="text-align: left;"><br /></p><h1 style="text-align: left;">Previous years in review</h1><div style="text-align: left;"><a href="https://muratbuffalo.blogspot.com/2022/12/best-of-metadata-in-2022.html">Best of metadata in 2022</a></div><div style="text-align: left;"><br /><br /></div><a href="https://muratbuffalo.blogspot.com/2021/12/best-of-metadata-in-2021.html">Best of metadata in 2021</a><br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2020/12/year-in-review-best-of-metadata-in-2020.html">Best of metadata in 2020</a><br /><br /><br /><a href="http://muratbuffalo.blogspot.com/2019/12/year-in-review-best-of-metadata.html">Best of metadata in 2019</a><br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2018/12/year-in-review.html">Best of metadata in 2018</a><br /><br /><br /><a href="http://muratbuffalo.blogspot.com/2020/06/research-writing-and-career-advice.html">Research, writing, and career advice</a><p style="text-align: left;"><br /><br /></p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-20304263073721629512023-12-04T21:52:00.003-05:002023-12-04T21:53:21.907-05:00Lifting the veil on Meta’s microservice architecture: Analyses of topology and request workflows<p><a href="https://www.usenix.org/conference/atc23/presentation/huye">This paper appeared in USENIX ATC'23</a>. It is about a survey of microservices in Meta (nee Facebook). <br /><br /><br /><a href="https://muratbuffalo.blogspot.com/2023/03/characterizing-microservice-dependency.html">We had previously reviewed a microservices survey paper from Alibaba.</a> Motivated maybe by the desire for differentiation, the Meta paper spends the first two sections justifying why we need yet another microservices survey paper. I didn't mind reading this paper at all, it is an easy read. The paper gives another design point/view from industry on microservices topologies, call graphs, and how they evolve over time. It argues that this information will help build more accurate microservices benchmarks and artificial microservice topology/workflow generators, and also help for future microservices research and development.<br /><br />I did learn some interesting information and statistics about microservices use in Meta from the paper. But I didn't find any immediately applicable insights/takeaways to improve the quality and reliability of the services we build in the cloud. <span><br /></span></p><p><span></span></p><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" class="BLOG_video_class" height="266" src="https://www.youtube.com/embed/vfsdGdVAwag" width="320" youtube-src-id="vfsdGdVAwag"></iframe></div> <br />The conference presentation video does an excellent job of explaining the paper. I highly recommend watching it. For my brief summary, continue reading.<br /><br /><p></p><h2 style="text-align: left;">Topological characteristics</h2><p style="text-align: left;"></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEihT_q8RUF6MLjvsIHqFHD5TWKf4gme1AzZbvUf-ThRGyVUPSvoF6spEeULdPdXMfCidf7erWaRaYUr3vuQIHsMhzilUUa5ql4jtPdXAJiHGnv-eZVBWeYTgjFW68CUAQRSoXdPbUEnjC7rTrJRZUUg7m1i3OqiaEv9Bz5AqDto2VtbCyqsJmbStseZf-I" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="708" data-original-width="972" height="233" src="https://blogger.googleusercontent.com/img/a/AVvXsEihT_q8RUF6MLjvsIHqFHD5TWKf4gme1AzZbvUf-ThRGyVUPSvoF6spEeULdPdXMfCidf7erWaRaYUr3vuQIHsMhzilUUa5ql4jtPdXAJiHGnv-eZVBWeYTgjFW68CUAQRSoXdPbUEnjC7rTrJRZUUg7m1i3OqiaEv9Bz5AqDto2VtbCyqsJmbStseZf-I" width="320" /></a></div><br /><br />Figure 1 illustrates Meta’s microservice architecture. It is similar to other large-scale microservice architectures, in that there are load balancers hitting frontend services, which in turn call other services. They say that, at Meta, <b>business use case</b> is a sufficient partitioning for define [micro]services. Endpoints in the figure and the rest of the paper just means API interfaces.<br /><br />The main findings related to topology are summarized below.<br /><br /><b><u>Finding F1:</u></b> Meta's microservice topology contains three types of software entities that communicate within and amongst one another: (1) Those that represent a single, well-scoped business use case. (2) Those that serve many different business cases, but which are deployed as a single service (often from a single binary); (3) Those that are ill-suited to the microservice architecture's expectations that business use case is a sufficient partitioning on which to base scheduling, scaling, and routing decisions and to provide observability. <br /><br />What are those ill-fitting services, you say? These have Service IDs of the form inference_platform/ model_type_{random_number}. "Meta’s engineers informed us that these Service IDs are generated by a general-purpose platform for hosting per-tenant machine-learning models (called the Inference Platform). The platform serves a single business use case--i.e., serving ML models-- but many per-tenant use cases. Platform engineers chose to deploy each tenant's model under a separate Service ID so that each can be deployed and scaled independently per the tenant's requirements by the scheduler."<br /><br />ML bros, screwing things for systems guys since 2018 :-) On the other end of the spectrum, databases and other platforms appear as a single service and provide their own scheduling and observability mechanisms. And, the paper points out that, both of these extreme types of usage masks the true complexity of the services and skews service- and endpoint-based analyses of microservice topologies. This is an important point, and the paper does a good job of separating these ill-fitting services from others through out the evaluation section. <br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgm6I8ZCg5NUywbA-7YcnX4LMfsJtpP84A_AjhWqDah8hPos70Tn5SHu8twd_Tui8esDJq8ot3SZNAizQHDzMxnod6yyjdR_YPnxaKM5lJZuqsCwDitTZr1aaWtLEc2spy_LO3N2LJwTaKvCSz5qPxenNXezY88hK1zQduWvl-IJMVjhKFrKjAKdTaH7r8" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="582" data-original-width="972" height="240" src="https://blogger.googleusercontent.com/img/a/AVvXsEgm6I8ZCg5NUywbA-7YcnX4LMfsJtpP84A_AjhWqDah8hPos70Tn5SHu8twd_Tui8esDJq8ot3SZNAizQHDzMxnod6yyjdR_YPnxaKM5lJZuqsCwDitTZr1aaWtLEc2spy_LO3N2LJwTaKvCSz5qPxenNXezY88hK1zQduWvl-IJMVjhKFrKjAKdTaH7r8=w400-h240" width="400" /></a></div><br /><b><u>Finding F2:</u></b> The topology is very complex, containing over 12 million service instances and over 180,000 communication edges between services. Individual services are mostly simple, exposing just a few endpoints, but some are very complex, exposing 1000s or more endpoints. The overall topology of connected services does not exhibit a power-law relationship typical of many large-scale networks. However, the number of endpoints services expose does show a power-law relationship.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEh9wi1QtAcWsaCTIxeRY4pbDVE2FJkuNQ2pH8L29ozV4DrMQISlDewPUz4CGAn0tVCIcUzwZHxHKCKWlQOCZPm0PmnN3sZpcqzNjbkC54wx8e8gTh3UTZQArouYeSGw-ntOa-71_QcWL5Qv479JFoGK3SLjTQhbetINTCjOIDntsyLhzm7ddNrGJviBR_s" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="790" data-original-width="972" height="325" src="https://blogger.googleusercontent.com/img/a/AVvXsEh9wi1QtAcWsaCTIxeRY4pbDVE2FJkuNQ2pH8L29ozV4DrMQISlDewPUz4CGAn0tVCIcUzwZHxHKCKWlQOCZPm0PmnN3sZpcqzNjbkC54wx8e8gTh3UTZQArouYeSGw-ntOa-71_QcWL5Qv479JFoGK3SLjTQhbetINTCjOIDntsyLhzm7ddNrGJviBR_s=w400-h325" width="400" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiyymQ3mFXP45e9ujeQYj-v1RwWW--fbbSaoHszOkf4rQqAW3iYoydxX1usViWHJ--ARpRZNHAe3gWTmPU2Mgy5GTqVsaeQTrUVBRS6Wc5nm92rtFe1aIJQOW4NBbJDP5-LW_aYE6noLCiShSg4GOr-96w0CfeniC5YAW2nTUaglBgRbhoL7NEX3a31qR0" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="490" data-original-width="972" height="201" src="https://blogger.googleusercontent.com/img/a/AVvXsEiyymQ3mFXP45e9ujeQYj-v1RwWW--fbbSaoHszOkf4rQqAW3iYoydxX1usViWHJ--ARpRZNHAe3gWTmPU2Mgy5GTqVsaeQTrUVBRS6Wc5nm92rtFe1aIJQOW4NBbJDP5-LW_aYE6noLCiShSg4GOr-96w0CfeniC5YAW2nTUaglBgRbhoL7NEX3a31qR0=w400-h201" width="400" /></a></div><br /><u><b>Finding F3:</b></u> The topology has scaled rapidly, doubling in number of instances over the past 22 months. The rate of increase is driven by an increase in number of services (i.e., new functionality) rather than increased replication of existing ones (i.e., additional instances). The topology sees daily fluctuations due to service creations and deprecations. The rate of increase of instances is due to new business use cases (i.e., new microservices) rather than increased scale: check the blue line in Figure 7.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEi4TEZIoP6HL9mILADmE3lXihidufMOvRQXj-mnv0rfU_LcvQ1HbD-KpoGW1VsnwSdCHBesNuO0N32RDZpP2mqbl66UtfR01W0dOg5U3n_wvMtBNZ_ixMybEyLgqgSrZx2rdD3HnnPo_LknSSe-ZPup3rBDsszGgbOpBTHPSnpiFZF_cugYjoZkqYwtnTI" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="946" data-original-width="972" height="389" src="https://blogger.googleusercontent.com/img/a/AVvXsEi4TEZIoP6HL9mILADmE3lXihidufMOvRQXj-mnv0rfU_LcvQ1HbD-KpoGW1VsnwSdCHBesNuO0N32RDZpP2mqbl66UtfR01W0dOg5U3n_wvMtBNZ_ixMybEyLgqgSrZx2rdD3HnnPo_LknSSe-ZPup3rBDsszGgbOpBTHPSnpiFZF_cugYjoZkqYwtnTI=w400-h389" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhD7Im8dEv8Y53afPBDhmma5lw2TxTTBt-l37cyr47lLMYvVwxtnV3n0_PQqmzTP4H9ut1CW6jzeWNwn4mnjv9YJqMYeQFSQb4T3OtvHJTIru9WGgJ8c7qOrv2Ky0RzvnPmFWoV6dUJZcO4HE72ip6pIJiAN6v89mq1B-wAK4NnbVSCRtzWlXa2dpdxAMo" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="486" data-original-width="972" height="200" src="https://blogger.googleusercontent.com/img/a/AVvXsEhD7Im8dEv8Y53afPBDhmma5lw2TxTTBt-l37cyr47lLMYvVwxtnV3n0_PQqmzTP4H9ut1CW6jzeWNwn4mnjv9YJqMYeQFSQb4T3OtvHJTIru9WGgJ8c7qOrv2Ky0RzvnPmFWoV6dUJZcO4HE72ip6pIJiAN6v89mq1B-wAK4NnbVSCRtzWlXa2dpdxAMo=w400-h200" width="400" /></a></div><p> </p><h2 style="text-align: left;">Request-workflow characteristics</h2><p style="text-align: left;">This section analyzes service-level properties of individual request workflows using traces collected by three different profiles. <br /></p><ul style="text-align: left;"><li><b>Ads:</b> This profile represents a traditional CRUD web application focusing on managing customers’ advertisements, such as getting all advertisements belonging to a customer or updating ad campaign parameters.</li><li><b>Fetch:</b> This profile represents deferred (asynchronous) work triggered by opening the notifications tab in Meta’s client applications. Examples of work include updating the total tab badge count or retrieving the set of notifications shown on the first page of the tab.</li><li><b>RaaS (Ranking-as-a-Service):</b> This profile represents ranking of items, such as posts in a user’s feed.</li></ul><p style="text-align: left;"></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEh6sC4KDI1OaFu_d9VgNcv3wFvTIaV4IqNGTLr5j9zK65bk9g_E2h1nhoYjtCrcMO06XtgeHAbELJLDchqN9s6eEO3NjoW7yma6yHkDGHpf6quFUyRgrRuQ5FIYDrPcZnHm67y1wQvsso7laAPyZCJujrA0rDossGf_Y_3wHyuH-Hbh6MVL6KCsWB0Hhdk" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="220" data-original-width="220" height="240" src="https://blogger.googleusercontent.com/img/a/AVvXsEh6sC4KDI1OaFu_d9VgNcv3wFvTIaV4IqNGTLr5j9zK65bk9g_E2h1nhoYjtCrcMO06XtgeHAbELJLDchqN9s6eEO3NjoW7yma6yHkDGHpf6quFUyRgrRuQ5FIYDrPcZnHm67y1wQvsso7laAPyZCJujrA0rDossGf_Y_3wHyuH-Hbh6MVL6KCsWB0Hhdk" width="240" /></a></div><a href=" https://meangirls.fandom.com/wiki/Fetch">Fetch</a> is actually interesting. The Meta datacenters backend is pretty tightly connected/coordinated with the mobile devices. Of course mobile requests are served by datacenter backend, but I had read in another place that when shedding load, Meta first informs/changes mobile client settings to degrade those services so they won't bog down the datacenters with requests. <br /><br />Ok, back to the findings.<br /> <br /><b><u>Finding F4:</u></b> Trace sizes vary depending on workflows' high-level behaviors, but most are small (containing only a few service blocks). Traces are generally wide (services call many other services), and shallow in depth (length of caller/callee branches).<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiDOY8Ewtabmww9M2rDzJtfx8YjEv-fCE4Ml_qjBsL53rFJD5CtLGzWc_RKBkdugQHHyDe66A32AqoVBWH1DmwpnJ-9wdxMxtz-9lE6Qr9LvZZbBSNtpCuYnXWWmZwrQJaVsCAg1X50o0VYqW3BFHOAGP1oZPu52Ve8g3mwsqy4w7qYwvJR32HBg30KX3A" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="466" data-original-width="972" height="191" src="https://blogger.googleusercontent.com/img/a/AVvXsEiDOY8Ewtabmww9M2rDzJtfx8YjEv-fCE4Ml_qjBsL53rFJD5CtLGzWc_RKBkdugQHHyDe66A32AqoVBWH1DmwpnJ-9wdxMxtz-9lE6Qr9LvZZbBSNtpCuYnXWWmZwrQJaVsCAg1X50o0VYqW3BFHOAGP1oZPu52Ve8g3mwsqy4w7qYwvJR32HBg30KX3A=w400-h191" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgCUZyJSDJeyBmBeh5IWYG3zqnpbZUuERpAg2lVbA0lcUTCphfvEgRcDGb4gYXKTpr8wWZceG-pbiWlU1lxC5CQ3N32Ha_46-VDiOQdDjQqEPaHR8jOkJs6gYJwSX70EI25TURR1FxoUAQZ35Ik3lBbzmt47gNvGDbvn9APAjCsHV5oc_dpauWhiPGYXP8" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="960" data-original-width="972" height="395" src="https://blogger.googleusercontent.com/img/a/AVvXsEgCUZyJSDJeyBmBeh5IWYG3zqnpbZUuERpAg2lVbA0lcUTCphfvEgRcDGb4gYXKTpr8wWZceG-pbiWlU1lxC5CQ3N32Ha_46-VDiOQdDjQqEPaHR8jOkJs6gYJwSX70EI25TURR1FxoUAQZ35Ik3lBbzmt47gNvGDbvn9APAjCsHV5oc_dpauWhiPGYXP8=w400-h395" width="400" /></a></div><br /><b><u>Finding F5:</u></b> Root Ingress IDs do not predict trace properties. At the level of parent/child relationships, parents’ Ingress IDs are predictive of the set of children Ingress IDs the parent will call in at least 50% of executions. But, it is not very predictive of parents’ total number of RPC calls or concurrency among RPC calls. Adding children sets’ Ingress IDs to parent Ingress IDs more accurately predicts concurrency of RPC calls.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgrtfKuT78yVn4CZnaotdAllHG5qt0asXfGmVBDyDAP060pqyhoH4P8ReqOQNMfKnJJHJQN3lVL0CEu1M5btJqe5LdykOxfg1iNvwriy6ZeD4siplwMT3FZvewv-0SsBbBsWN2kqbazc5r-uDVJIOyuVbtdCAH6tDkdS1Y4z7490VKBch8hBX5EfHLgYdQ" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="368" data-original-width="972" height="151" src="https://blogger.googleusercontent.com/img/a/AVvXsEgrtfKuT78yVn4CZnaotdAllHG5qt0asXfGmVBDyDAP060pqyhoH4P8ReqOQNMfKnJJHJQN3lVL0CEu1M5btJqe5LdykOxfg1iNvwriy6ZeD4siplwMT3FZvewv-0SsBbBsWN2kqbazc5r-uDVJIOyuVbtdCAH6tDkdS1Y4z7490VKBch8hBX5EfHLgYdQ=w400-h151" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEg0YVu6Q503Fc_1dTSRyiN1JLWEHBz40xifc1SiJ-qe33u8kn_imNMADTkN8QrcfYS3XgG9kCkqsxWVi4uHQY7GgDSTs2VBrrsXrOZpnR3tVurFEeZpuzNiyvZLMQ4-14d28tDXp1xGNIkWBJi9dYvMUoudUcw8C3pWHSobCKMUiTkEBM2WK8mmTkn4EAU" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1208" data-original-width="972" height="400" src="https://blogger.googleusercontent.com/img/a/AVvXsEg0YVu6Q503Fc_1dTSRyiN1JLWEHBz40xifc1SiJ-qe33u8kn_imNMADTkN8QrcfYS3XgG9kCkqsxWVi4uHQY7GgDSTs2VBrrsXrOZpnR3tVurFEeZpuzNiyvZLMQ4-14d28tDXp1xGNIkWBJi9dYvMUoudUcw8C3pWHSobCKMUiTkEBM2WK8mmTkn4EAU=w322-h400" width="322" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEg2w4ocW_axgOIIeC0P2EwAAuDljePNry_dWrotDdd95J5_4XhmHW8x461Ug_WS0kjgaYPPNwFWrR3Xe6tQ7a_4jKESUhQj-6TcBJn2PI2980MjlNaUXcJtRR0Kp6DIDZf3CAtCXiD2oPZRDmpBpLiGvBkdvsfCG1KmyES71ShuMbHTL9dBCkmf_kMkBpE" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1012" data-original-width="972" height="400" src="https://blogger.googleusercontent.com/img/a/AVvXsEg2w4ocW_axgOIIeC0P2EwAAuDljePNry_dWrotDdd95J5_4XhmHW8x461Ug_WS0kjgaYPPNwFWrR3Xe6tQ7a_4jKESUhQj-6TcBJn2PI2980MjlNaUXcJtRR0Kp6DIDZf3CAtCXiD2oPZRDmpBpLiGvBkdvsfCG1KmyES71ShuMbHTL9dBCkmf_kMkBpE=w385-h400" width="385" /></a></div><br /><b><u>Finding F6:</u></b> Many call paths in the traces are prematurely terminated due to rate limiting, dropped records, or non-instrumented services like databases. Few of these call paths can be reconstructed (those known to terminate at databases) while the majority are unrecoverable. Deeper call paths are disproportionately terminated.<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiSZ6lipTHSiUmKlpXu4XePBik_sweXWEF1Wc0I3CT460ocnqeaT2GUSem8-0mMf-IhUg2MgUklqLp9252QbQLTtKbQkcmaW1LV8CkNB4rYcE66bItM-oSPshEWuJ20XGJhlNjhggSEL0JjvsLbpE3p-bLTtl6Ihrxh-bqTKgfA3LXUNKYPd6lJ5qBupk0" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="560" data-original-width="972" height="230" src="https://blogger.googleusercontent.com/img/a/AVvXsEiSZ6lipTHSiUmKlpXu4XePBik_sweXWEF1Wc0I3CT460ocnqeaT2GUSem8-0mMf-IhUg2MgUklqLp9252QbQLTtKbQkcmaW1LV8CkNB4rYcE66bItM-oSPshEWuJ20XGJhlNjhggSEL0JjvsLbpE3p-bLTtl6Ihrxh-bqTKgfA3LXUNKYPd6lJ5qBupk0=w400-h230" width="400" /></a></div><br />What intrigued me was the solitary mention of memcache in this paper, which is peculiar given its prominence at Meta. In contrast, <a href="https://muratbuffalo.blogspot.com/2023/03/characterizing-microservice-dependency.html">the Alibaba trace paper</a> highlighted memcache's significance, revealing that in call graphs with over 40 microservices, approximately 50% of the microservices were Memcacheds (MCs).<br /><p></p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-18847456484418830652023-11-28T22:12:00.000-05:002023-11-28T22:12:09.516-05:00Our Florida vacation<p>No paper review this week. Instead, I stumbled upon notes buried in my blog.org entries. With Buffalo now cold and snowy, reminiscing about last June's hot Florida vacation seemed fitting.<br /><br /></p><p></p><p>True to our tradition, we drove there from Buffalo, relishing the two-day road trip. Road trips are our love -- in 2018-19 we had crossed the US East-to-West and then West-to-East. <a href="https://muratbuffalo.blogspot.com/2020/02/traveling-across-us.html">I had documented the East-to-West drive here</a>. This time, it was North-to-South, all the way to the southernmost point of the US—the <a href="https://en.wikipedia.org/wiki/Key_West">Key West Islands</a>. <br /><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiNopPiRzOW1qcuc1GtlG-UarsDZ9OEyn1V0EO4hCfc-IgPSN8_up7Lem-wbUi3rGAOYURfEibcBOmtk6pM7_XtlY7rUtdzfkswCe214puhGlZMwhJn8nYwDZ07Du-NOgQMe8jRt6u_kgGGZChkXQ-Y-ge-44NXVtlDPQH1E4hVJSS_dRnzRFz6YoRbmgQ" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="4032" data-original-width="3024" height="400" src="https://blogger.googleusercontent.com/img/a/AVvXsEiNopPiRzOW1qcuc1GtlG-UarsDZ9OEyn1V0EO4hCfc-IgPSN8_up7Lem-wbUi3rGAOYURfEibcBOmtk6pM7_XtlY7rUtdzfkswCe214puhGlZMwhJn8nYwDZ07Du-NOgQMe8jRt6u_kgGGZChkXQ-Y-ge-44NXVtlDPQH1E4hVJSS_dRnzRFz6YoRbmgQ=w300-h400" width="300" /></a></div><br />Yes, the driving was a bit tiring. But with our new SUV, good audiobooks, and sightseeing on the way, it was enjoyable. We Hotwire'd the hotels for the drive at around afternoon of each driving day. That is how the Demirbas family rolls.<br /><br />Our AirBnB in Orlando was at a resort. It was a 5 bedroom rental. We were able to get it cheap at $200 a day after taxes and everything. It was very comfortable, and we enjoyed the lazy life at the resort. <br /><p></p><p><br />Oh God, everything is big in Orlando! We hit the Walmart close to our rental for some amenities, and it was the biggest Walmart I have ever been to. And the most crowded. It looked like people were raiding the place, because the shelves have been emptying, while staff was perpetually busy restocking.<br /><br /><br />We visited Orlando for DisneyWorld --a quintessential dad-duty in the U.S. and a rite of passage for American kids. DisneyWorld is big (duh!) and complicated. You have to do your homework and know your stuff. It is like applying to college, you have to do a lot of reading. First decide which one to go. Then learn about the map and rides. Explore virtual queues; navigate through Genie, Genie+, and individual ride fastlanes. Assess if any are worth the purchase. Fortunately, things weren't as daunting after we entered the park.<br /><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEi1iycgy5DAzikW3ZjkdCaFr_k76XAWMXDOnuBuyxqfahiYyMXJOXQrXLIzzr7vYyOmcy6tE__b1mtgzT6HZGWDr9UqBCTbaH1BijEUnqOR-mYiBOup3Lby-nDq8VXq-QNNkvu37cR_-FLtWf4ywm4EFRq5vH6C1Fo1Fqxoz5yFJj7EJEJv6Gc8Z8U9iY8" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="3024" data-original-width="4032" height="300" src="https://blogger.googleusercontent.com/img/a/AVvXsEi1iycgy5DAzikW3ZjkdCaFr_k76XAWMXDOnuBuyxqfahiYyMXJOXQrXLIzzr7vYyOmcy6tE__b1mtgzT6HZGWDr9UqBCTbaH1BijEUnqOR-mYiBOup3Lby-nDq8VXq-QNNkvu37cR_-FLtWf4ywm4EFRq5vH6C1Fo1Fqxoz5yFJj7EJEJv6Gc8Z8U9iY8=w400-h300" width="400" /></a></div><br /><br />We went to <a href=" https://disneyworld.disney.go.com/destinations/hollywood-studios/">the Hollywood studios.</a> We had a swell time. <br /><br />But I had my gripes. Orlando is like a sauna. Sweating became routine during ride waits, even in shade. When we were at the queue for the stupid Slinky Dog rollercoaster, it got closed due to a thunderstorm and we waited for it for 2+ hours. In door waiting is another story; it is a good covid incubation ground. <br /><br />Star Wars props were impressive and well-executed. It was an immersive atmosphere. The rides used screens, but they felt real enough. The in-place rides' jolting motion didn't sit well with me. <br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiNEKDsCWt1b2By8S6_bydtjcCkuKo2XgC2OnYhumcl5bC7HL5SD7gEZxzLYTyRSSPNWQ3n1Qu3hVQWTKEX57QqUXCXH0IRM2pRtLAsRGeDZ_co2tbtHWp-DiAYy64X_W0o7KFtBi_6QTceTdx8aKcRN-ogzNddbfeLoDNrsyRa1TfLUEubGII629sCvCM" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="4032" data-original-width="3024" height="400" src="https://blogger.googleusercontent.com/img/a/AVvXsEiNEKDsCWt1b2By8S6_bydtjcCkuKo2XgC2OnYhumcl5bC7HL5SD7gEZxzLYTyRSSPNWQ3n1Qu3hVQWTKEX57QqUXCXH0IRM2pRtLAsRGeDZ_co2tbtHWp-DiAYy64X_W0o7KFtBi_6QTceTdx8aKcRN-ogzNddbfeLoDNrsyRa1TfLUEubGII629sCvCM=w300-h400" width="300" /></a></div><br />The rides hit me hard. I think I have sensitive inner ear, and even elevators may trigger lightheadedness for me ever since I was a boy. So the rides really took a toll on me. I felt like throwing up after all of them. The slinky dog ride's acceleration and sharp turns did me in, worsening my dizziness.<br /><br />The rides were very hard on me. I think I have a sensitive inner ear. Ever since I know myself, I may get lightheaded even on the elevator. So the rides really took a toll on me. I felt like throwing after all of them. <br /><br />Even on the "Runaway Railway" ride, described by the doorperson as a "slow train ride," I struggled. Well, the ride was far from an easy ride! At the beginning of the ride, the train breaks into wagons, navigating various rooms with jolting motion. In one room, the wagons even waltz under Daisy Duck's instruction. Yeah, right, a slow train ride--the doorperson surely messed with me.<br /><br />Motion sickness folks, steer clear of the "House of Terror." The elevator accelerates to the 5th floor and drops freely to the 2nd floor and repeats this a couple times. What was I thinking? While waiting in the queue, I had been hearing the screams of people from the ride and I was seeing the windows open at the 5th floor. Why didn't I skip this one? I guess I was embarrassed because my 7 year old daughter was very eager to do this. When the ride was over, she screamed "This is awesome, let's do it again!" And all I could think was that "let's never do this again."<br /><br />There was a crazy roller coaster ride, called Rock and Roll or something, where the doorperson said the ride goes upside down 3 times. Based on my experience with the other doorperson, I skipped this one. My son and older daughter braved it, while the rest of the family went to the Muppets 3D show -- a delightful choice. Earlier in the day, we had gone to Frozen, and after that the Indiana Jones show. We had timed those shows really well, with no wait time in between them. Those shows were my favorite, as those were the only ones I didn't get motion sickness.<br /><br />After a week in Orlando, we headed back up north towards Buffalo, but after 30 minutes of driving, on a whim, we did a U-turn, and continued Southwards to Miami, and visited Miami beach. The next day we drove all the way to the Key West and visited the islands. I guess we were feeling spontaneous and adventurous. We really enjoyed the travel, and wanted to do more of it immediately after we returned. <br /><br />Another thing worth mentioning was that we had been to the <a href="https://www.alnatourrestaurant.com/">AlNatour restaurant</a> in Fort Lauderdale, and it was very good. <br /><p></p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com0tag:blogger.com,1999:blog-8436330762136344379.post-64598895715081007922023-11-22T14:16:00.003-05:002023-11-22T15:16:25.046-05:00Towards Modern Development of Cloud Applications<p><a href="https://dl.acm.org/doi/10.1145/3593856.3595909">This paper is from HotOS'23.</a> At 6 pages, it is an easy-to-read paper, but it is not an easy-to-agree-with paper. The message is controversial: Don't do microservices, write a monolith, and our runtime will take care of deployment and distribution. This is a big claim, and we have been burned by ambitious attempts like this many times before. I realize big claims are part of the style of HotOS, where work-in-progress and sometimes provocative papers make a debut to kickstart a discussion. This paper sure does a good job of starting a discussion.<br /><br /></p><h2 style="text-align: left;">Good</h2><p style="text-align: left;"><a href="https://github.com/ServiceWeaver">There is code, and it is opensource</a>, so this is not just a speculation paper. A Go framework does exist, which has been under development for sometime inside Google. Given Google's expertise on infrastructure and Go, I think this framework will be a big boon to the Google Cloud Platform (GCP), if it gets into production.<br /><br />To evaluate the framework (let's call it ServiceWeaver, with its Github name, shall we?), they consider a popular web application: <a href="https://github.com/GoogleCloudPlatform/microservices-demo">Online Boutique.</a> They say that Online Boutique is "representative of the kinds of microservice applications developers write". It consists of about 10K lines of code, implemented as ...(wait for it)... 11 microservices! <br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEh0F4icBfMvJgAQYJLrj7LswMJF6q3so_V7uSrTtq12bc-VOiYOfoYfCQBIj9q_FFLLzpp_2IiAXhTesNoiiCra2tVwGtqnLfUV6xo26UA3Hv015Ut264x05FSOzAtrbPQ61DqMfXbiNZOXY6IsjLYA_EkpYTiN84eSEk2HC2Hg4ETxBa-6-VDTtFzgUMQ" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="186" data-original-width="574" height="104" src="https://blogger.googleusercontent.com/img/a/AVvXsEh0F4icBfMvJgAQYJLrj7LswMJF6q3so_V7uSrTtq12bc-VOiYOfoYfCQBIj9q_FFLLzpp_2IiAXhTesNoiiCra2tVwGtqnLfUV6xo26UA3Hv015Ut264x05FSOzAtrbPQ61DqMfXbiNZOXY6IsjLYA_EkpYTiN84eSEk2HC2Hg4ETxBa-6-VDTtFzgUMQ" width="320" /></a></div>For evaluation, Table 2 shows the number of CPU cores used and their end-to-end latencies. For 10K queries per second, the 11 microservices Go implementation uses 78 cores, but the monolithic implementation (deployed with their ServiceWeaver runtime) uses 28 cores. These numbers are without colocation of any components. The savings become more impressive if you put all the 11 components into a single OS process, the number of cores drops to 9 and the median latency drops to 0.38 ms, both an order of magnitude lower than the baseline. <br /><br />Ok, let's step back for a minute. Did a 10K LOC application need to be implemented as 11 microservices? Did it have to be distributed in the first place? If you start with a distributed to a fault baseline, it is easy to show impressive improvements.<br /><br />Let's remember Frank McSherry's holy war against unnecessarily distributed analytics services. <a href="https://muratbuffalo.blogspot.com/2017/06/scalability-but-at-what-cost.html">I had reviewed the "Scalability, but at what COST?" paper here.</a> Frank had shown that "some single threaded implementations [on Frank's laptop] were found to be more than an order of magnitude faster than published results (at SOSP/OSDI!) for systems using 100s of cores"! If you start with "poor baselines and low expectations", it is easy to show impressive improvements. <br /><br />Let's get back to the evaluation section of the paper again. They say that most of the performance benefits of the monolithic implementation comes from getting rid of versioning and field numbers. Wow! How do you do atomic monolithic deployments? The answer is using blue-green deployments! But such one-shot deployments would be particularly hard to coordinate across AZs, let alone across regions. And finally, how do you deal with versions in the database, and schema changes when doing these deployments?<br /><br /><br />To conclude listing the "good" parts, I want to mention the challenges discussion in the introduction. There are remedies to these (such as knowing what you are doing), but there is no denying that these are real challenges. <br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEi6_8k_OAGxHT1CYmDm-xknrjce4uKl2WuM-CgG3OeAPu_JoNW15FlkqIQBUZqXY2iPAZPFHW_79AfDhEJv5Nzt7yThOMFhH3z0uXLryDWnFZQO-gBhr6rGcxoIwi5y7xHpR7WCU5Li7cA9XNc_Akr0FQC_E6vdy5FVbKRVZivwPC-6Wt0eyzgB3BSHFTA" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="679" data-original-width="574" height="400" src="https://blogger.googleusercontent.com/img/a/AVvXsEi6_8k_OAGxHT1CYmDm-xknrjce4uKl2WuM-CgG3OeAPu_JoNW15FlkqIQBUZqXY2iPAZPFHW_79AfDhEJv5Nzt7yThOMFhH3z0uXLryDWnFZQO-gBhr6rGcxoIwi5y7xHpR7WCU5Li7cA9XNc_Akr0FQC_E6vdy5FVbKRVZivwPC-6Wt0eyzgB3BSHFTA=w338-h400" width="338" /></a></div><br />I think the biggest challenge with microservices is complexity of integration. When you start building with microservices, integration becomes challenging: the later you delay the integration, the bigger the pain. <br /><p></p><p style="text-align: left;"><br /></p><h2 style="text-align: left;">Bad</h2><p style="text-align: left;">The claims are not scoped well. I think this framework is good for many web/frontend applications. But the paper has a general claim. For crying out loud, the paper starts with this sentence: "<b>When writing a distributed application,</b> conventional wisdom says to split your application into separate services that can be rolled out independently." <br /><br />After that sentence, I read the entire paper with distributed applications/systems in mind. The paper doubles down on this claim in the last sentence of the introduction. "Though these challenges and our proposal are discussed in the context of serving applications, we believe that our observations and solutions are broadly useful."<br /><br />But as I kept reading, I realized that this would not apply to general distributed services, and specifically backend systems. I realized that this is more applicable for a limited domain, like platform as a service (PaaS) applications, such as a web application, with limited freedoms. As I mentioned above, this would be a great boon for web-services built on GCP, for example. And that looks like the end game here.<br /><br />At the end of the paper, in Section 8.3, there is a very short paragraph talking about distributed systems. After having said so many things about how the ServiceWeaver framework/runtime distributes things and takes care of distributed systems concerns, this paragraph comes across as confusing. Too little, too late?<br /><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjGOYCazR2A0EsQ9dYRtbemvhuTYi3aZU2-dCPg2rRH5sY-Zbw_ce3UF8h5zDY57Pbe_2LLMPpou4LsN090qIjW3p5YdLiPmM1DFTqGUrfirlgN6K2S73fRwst9W2OZUIFIsmUVlOPEhcVGgSvay2hv3rJcKqS4GG4ZTAl4qvKKarwofvVaBlidsixzqPM" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="223" data-original-width="574" height="124" src="https://blogger.googleusercontent.com/img/a/AVvXsEjGOYCazR2A0EsQ9dYRtbemvhuTYi3aZU2-dCPg2rRH5sY-Zbw_ce3UF8h5zDY57Pbe_2LLMPpou4LsN090qIjW3p5YdLiPmM1DFTqGUrfirlgN6K2S73fRwst9W2OZUIFIsmUVlOPEhcVGgSvay2hv3rJcKqS4GG4ZTAl4qvKKarwofvVaBlidsixzqPM" width="320" /></a></div><p></p><h2 style="text-align: left;"> Ugly</h2><p style="text-align: left;"></p><p style="text-align: left;">By engaging the microservices versus monolith architecture discussion, the paper pokes the bee-hive, without answering the real questions. What do I mean by real questions? Can this alternative approach address the problems microservices has made non-problems, that we forgot they were problems. <br /></p><blockquote>"Tradition is a set of solutions for which we have forgotten the problems. Throw away the solution and you get the problem back. Sometimes the problem has mutated or disappeared. Often it is still there as strong as it ever was." -- Donald Kingsbury</blockquote>This harkens back to the <a href="https://fs.blog/chestertons-fence/">famous Chesterson's fence principle</a>, which cautions against dismissing established systems without comprehending their original purpose, and that second-order effects should be considered. <br /><blockquote>There exists in such a case a certain institution or law; let us say, for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, “I don’t see the use of this; let us clear it away.” To which the more intelligent type of reformer will do well to answer: “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.”</blockquote>One big benefit of microservices is that it serves as a technical patch to a social problem: organizing development across two-pizza teams. You assign each team of 5-10 a microservice, and this reduces the communication overhead. Microservices are not without its challenges. Integration is a challenge, but at least it gets you going. <br /><br />When developing the work as a monolith, where do you start coding, how do you grow the code? Scaling needs to be thought out. Every 10X in size requires a different design. How would growing a monolith for scaling work? Does it start with an unscalable system first? But then what is the design path to make it scalable?<br /><br />Sure, you can do separation-of-work and reduction of coordination with the monolith approach if you know what you are doing. But if you know what you are doing, you would avoid many problems with microservices as well.<br /><br /><br />The paper didn't cite and address the classic <a href="https://scholar.harvard.edu/files/waldo/files/waldo-94.pdf">"A note on distributed computing" by Jim Waldo in 1994.</a> That paper has a section titled: "Dejavu all over again". In 1994! That was before DCOM and CORBA. Looks like enough time has passed and the pendulum swings yet again, one more time. <br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEikDwTOqHxhAiDgW8mwjG_w3wmY4dHWtftqWdTF9iX0Ze9frxTUJbZG-HuYyRumOOocL8s8lxb5aT3WAkzPO5ADjsPvG_-snAciVHzrItlq8DnM-YN_7pJJDogrHpRXG5QP_uwpwmP-lyRg6ak_Lgr_stGz1ADULbJG4HOOuTs1Zcx6NRQngEEcQsMbBTg" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="409" data-original-width="575" height="228" src="https://blogger.googleusercontent.com/img/a/AVvXsEikDwTOqHxhAiDgW8mwjG_w3wmY4dHWtftqWdTF9iX0Ze9frxTUJbZG-HuYyRumOOocL8s8lxb5aT3WAkzPO5ADjsPvG_-snAciVHzrItlq8DnM-YN_7pJJDogrHpRXG5QP_uwpwmP-lyRg6ak_Lgr_stGz1ADULbJG4HOOuTs1Zcx6NRQngEEcQsMbBTg" width="320" /></a></div><br />Figure 1 sounds great on paper. ServiceWeaver can put components all together for efficiency. But what about blast-radius, bursty/coordinated traffic, and <a href="http://muratbuffalo.blogspot.com/2023/09/metastable-failures-in-wild.html">metastable failures</a>?<br /><p></p><p><br />"Did Google invent AGI? These questions are <a href="https://en.wikipedia.org/wiki/Artificial_general_intelligence">AGI complete.</a>" If I was reading this paper 5 years ago, I would not hesitate to write that as a counter-argument. With recent advancements in ML, maybe it is time to reconsider this smart middleware approach again. I don't know. Still this is a tall order. How do you automate design, especially distributed systems design? <br /><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEi4SngNRpDaitCxPUAC6MVcQl4PE9acrce-ppE52qa6unpH2qC3BqFEami0XW-j0ZBgxpe_ivXbhwX54SRZZG0a8NixzH0XSh8zXv0T7F_a_sQp3e_MwifMMA1m86G9PaD-E_ckvPSQJUuKi29ecuV6_4qxJ6O_pHqgcJG8hxZ9DgD13lgfO_7i4gNFux8" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="472" data-original-width="575" height="526" src="https://blogger.googleusercontent.com/img/a/AVvXsEi4SngNRpDaitCxPUAC6MVcQl4PE9acrce-ppE52qa6unpH2qC3BqFEami0XW-j0ZBgxpe_ivXbhwX54SRZZG0a8NixzH0XSh8zXv0T7F_a_sQp3e_MwifMMA1m86G9PaD-E_ckvPSQJUuKi29ecuV6_4qxJ6O_pHqgcJG8hxZ9DgD13lgfO_7i4gNFux8=w640-h526" width="640" /></a></div><br /><p></p>Murathttp://www.blogger.com/profile/07842046940394980130noreply@blogger.com1