Sunday, January 28, 2018

Paxos derived

Lamport's fault-intolerant state machine replication algorithm

In 1978, Lamport published his classic "Time, Clocks, and the Ordering of Events in a Distributed System". As an application of logical clocks, he presented a distributed replicated state machine algorithm (and then he instantiated that algorithm to solve mutual exclusion as an example). Lamport complains that no one seemed to be aware of the distributed replicated state machine algorithm introduced in the paper:
"This is my most often cited paper. Many computer scientists claim to have read it. But I have rarely encountered anyone who was aware that the paper said anything about state machines. People seem to think that it is about either the causality relation on events in a distributed system, or the distributed mutual exclusion problem. People have insisted that there is nothing about state machines in the paper. I’ve even had to go back and reread it to convince myself that I really did remember what I had written."
I had talked about this distributed replicated state machine algorithm earlier. This algorithm is decentralized to a defect. It is not even tolerant to a single node failure. It assumes failure-free nodes.

The idea of the algorithm is as follows: In order to ensure that processes do not have different views of the order of updates, logical clocks is used to impose a total ordering on the updates. Each process keeps as part of its state the following: copy of the state, logical clock, queue of "modify requests" (with their logical time stamps), list of "known-times", one for every other process. Each process executes an update request on its copy of the state in increasing order of timestamps. For safety, all "known times" from other processes should be later than the time of the request.

The algorithm works as follows:
  1. Push your request in your own queue (timestamped with your logical clock)
  2. Broadcast your request to every node timestamp included.
  3. Wait for replies from all other nodes.
  4. If your request is now at the head of your queue and the known-times for other processes is ahead of its request timestamp (known-times is updated as processes send replies to the update request), enter critical section (where update to the state is done).
  5. Upon exiting the critical section, remove your request from the queue and send a release message to every process.

A fault-intolerant version of Paxos

I recently realized that the algorithm above (from the 1978 paper) constitutes a fault-intolerant instance of Paxos!

This occurred to me after thinking about it in the context of flexible quorums result. The flexible quorums idea (2016) states that we can weaken Paxos’ "all quorums should intersect" assertion to instead "only quorums from different phases should intersect". That is, majority quorums are not necessary for Paxos, provided that phase-1 quorums (Q1) intersect with phase-2 quorums (Q2).

This result allows trading off Q1 and Q2 sizes to improve performance (to the detriment of fault-tolerance)  Assuming failures and resulting leader changes are rare, phase-2 (where the leader tells the acceptors to decide values) is run more often than phase-1 (where a new leader is elected). Thus it is possible to improve performance of Paxos by reducing the size of Q2 at the expense of making the infrequently used Q1 larger. For example in a system of 10 acceptors, we can safely allow any set of only 3 acceptors to participate in Phase2, provided that we require 8 acceptors to participate for Phase1.  Note that the majority quorums (Q1=Q2=6) would be able to mask upto 5 node failures (f=5), whereas the Q1=8 configuration can only with stand upto 2 node failures (f=2) as it needs 8 nodes to be able to perform phase-1 if needed.

So, if you take Q1=N and Q2=1, the Paxos algorithm simplifies to the Lamport's distributed state machine replication algorithm above. Note that Q1=N implies the algorithm cannot tolerate any node failures, i.e., f=0. On the other hand, with this setup, you can combine phase 2 and phase 3 because you are writing to only one node, yourself. So phase 3 is non-existent in that algorithm.

The road from f=0 to Paxos

Ok, let's approach our claim from the other side as well. How do we take that f=0 protocol and strengthen it so that it doesn't block (lose progress) with one node failure?

This is how Phase 3 comes in to play as we add fault-tolerance. In order to tolerate one node crash  (in a fault-masking manner), you need Q2 to be 2. Then things suddenly get complicated, because you are not just writing to yourself, you will also need to write to another node in a guaranteed manner to persist the state. But, another leader may be stealing your turn before you can write to your other Q2 node your decision at Phase 2, so it is not safe to commit the update request! Therefore, Phase 2 clearing, which is phase 3, is needed to make this check, and it helps you replicate your state so it is preserved to the face of one node failure.

This is a point of objection, though. In Lamport's f=0 algorithm, logical clocks (LC) are used for reservation; every node respects LC, and puts requests into its queue ordered by LC. If one node needs to get its update done, it eventually will because the system is making progress. On the other hand, in Paxos, using the ballot numbers, for whose implementation LC could be used, a leader steals the previous leader's turn instead of patiently waiting the previous round to be complete. So what gives?

Well... In Lamport's f=0 algorithm, you could afford to be nice and patiently wait for each node to finish its turn, because f=0, and you are guaranteed to reach what you wait for. But when f>0 and a node can fail, you can't afford to wait for it to finish its turn (otherwise you would have to wait for an eternity in an asynchronous system model), and that is why Paxos is happy to change leaderships, and dueling leaders can arise (even to the point of violating progress).

In sum, something "fundamental" changes when you want to go fault-tolerant and tolerate node failure in an asynchronous system. When you combine faults and full-asynchrony, you get the FLP impossibility result. That means you lose progress! That is why Paxos does not guarantee making progress under a full asynchronous model with a crash failure. However, it preserves safety thanks to its balloting and anchoring system, and will provide progress as soon as the partial synchrony kicks in and weak-complete & eventually-weak-accurate failure detectors are implementable (i.e., when we are out of the realm of the FLP result). So, yes, there is a phase transition going from no faults to faults in asynchronous system.

I thank my PhD students, Ailidani Ailijiang and Aleksey Charapko, for discussion on this idea.

MAD questions

Was this actually how Leslie Lamport come up with the Paxos protocol? Does the 1978 fault-intolerant distributed state machine replication form a basis to evolve a fault-tolerant version?

I am not aware of any paper that makes this connection. Was this connection noticed and mentioned before?

Friday, January 26, 2018

Modeling the DAO attack in PlusCal

Maurice Herlihy's paper: "Blockchains from a distributed computing perspective" explains the DAO attack as follows:

"Figure 1 shows a fragment of a DAO-like contract, illustrating a function that allows an investor to withdraw funds. First, the function extracts the client's address (Line 2), then checks whether the client has enough funds to cover the withdrawal (Line 3). If so, the funds are sent to the client through an external function call (Line 4), and if the transfer is successful, the client’s balance is decremented (Line 5). 
This code is fatally  flawed. In June 2016, someone exploited this function to steal about $50 million funds from the DAO. As noted, the expression in Line 3 is a call to a function in the client's contract. Figure 2 shows the client's code. The client's contract immediately calls withdraw() again (Line 4). This re-entrant call again tests whether the client has enough funds to cover the withdrawal (Line 3), and because withdraw() decrements the balance only after the nested call is complete, the test erroneously passes, and the funds are transferred a second time, then a third, and so on, stopping only when the call stack overflows."
(Of course, that is a very simplified description of the DAO attack. More accurate descriptions are provided here and here.)

Even though the code seems sequential (after all the blockchain serializes everything), it has concurrency problems built in. This was a point made in Herlihy's paper as follows:
"In Ethereum, all contracts are recorded on the blockchain, and the ledger includes those contracts' current states. When a miner constructs a block, it fills that block with smart contracts and exe- cutes them one-by-one, where each contract's  final state is the next contract's initial state. These contract executions occur in order, so it would appear that there is no need to worry about concurrency." 
After showing DAO vulnerability and ERC20 token standard vulnerability, the paper says:
"We have seen that the notion that smart contracts do not need a concurrency model because execution is single-threaded is a dangerous illusion. Sergey and Hobor give an excellent survey of pitfalls and common bugs in smart contracts that are disguised versions of familiar concurrency pitfalls and bugs." 

Enter TLA+/PlusCal

Since TLA+/PlusCal is a great tool for catching concurrency problems, I thought it would be useful to model this DAO attack in PlusCal. After I got the idea, it took me a short time to model and model-check this in PlusCal. I used procedures in PlusCal (which I don't use often) to match the description of the problem.

TLA+ is all about invariant-based reasoning so I wrote the invariant first. Writing "SafeWithdrawal == (bankBalance=BALANCE /\ malloryBalance=0) \/ (bankBalance=BALANCE-AMOUNT /\ malloryBalance=AMOUNT)was too tight, because the updates of the balances are not happening atomically. That is how the invariant-based thinking helps us immediately: we can see that the withdrawal is a non-atomic operation, and realize that we should be more careful with the updates.

In the model checking pane, I set BALANCE as 10 and AMOUNT as 10. That is, initially Mallory has 10 coins in her bankBalance, and 0 in her wallet and wants to transfer her bankBalance and sets AMOUNT=10. When I run the model checker, it finds the double withdrawal problem immediately. Mallory's account got to 20 starting from 0! Normally we would expect it to go to 10 (line 27) temporarily, and then her bankBalance to be set to 0 (line 22). But this code managed to do double withdrawal, and the SafeWithdrawal invariant is violated.

The error trace contains 8 steps: Initially BankWithdraw is called, which then calls the MallorySendMoney to complete withdrawal. However, Mallory's SendMoney implementation includes another call to BankWithdraw and the balance check in line 18 passes because bankBalance is not decremented by amount (that comes in line 22). So the second BankWithdraw executes concurrently and Mallory manages to do double (and later triple) withdrawal.

Fixing things

Ok, let's check if we can fix this if we move the bankBalance subtraction before MallorySendMoney.
Of course for that we change SafeWithDrawal to accommodate the new way of updating bankBalance. But it turns out that is still too tight. If I call this with BALANCE=10 and AMOUNT=4, it is OK to have two withdrawals concurrently provided that in the final state no new money is produced: Invariant == bankBalance+malloryBalance <= BALANCE. I also model check for progress and write an EndState temporal formula for it: EndState == <>(bankBalance<=BALANCE-AMOUNT /\ bankBalance+malloryBalance=BALANCE). When we model check it, we see that this solves the problem.  So it leaves me puzzled, why, when it was this easy, the original BankWithdraw code was not coded this way and was left vulnerable to the attack.

These PlusCal models are available on my Github directory.

MAD questions

Should we come up with a PlusCal framework to facilitate modeling and model-checking of smart-contracts?

I had written about why you should model. Those apply here as well, and here things become even more critical. When money is involved, attackers get smart quickly, and it is easy to have vulnerabilities in concurrent code due to the many corner cases. Let TLA+/PlusCal show you those cornercases and help you design your protocol to achieve correctness guarantees. So if you are writing smartcontracts, I think it makes sense to first model-check and verify them. It doesn't take much effort, and it can save you from big problems.

Related links

Here is some previous discussion/context about why I started assigning TLA+/PlusCal modeling projects in distributed systems classes.

There is a vibrant Google Groups forum for TLA+ :!forum/tlaplus

Clicking on label "tla" at the end of the post you can reach all my posts about TLA+

Wednesday, January 24, 2018

Spring 18 Distributed Systems Seminar

I considered doing a blast from the past edition of the seminar focusing only on classical papers published before 2000. But, nah, I decided, I am doing a blockchain seminar instead. There has been many interesting papers on blockchains and this will be a good chance to catch up with those and look at the distributed systems, concurrency and coordination issues there.

Here is the list. I thank Adem Efe Gencer for suggesting several interesting papers for this list.

Blockchain papers:
  1. Blockchains from a Distributed Computing Perspective 
  2. Blockstack: A Global Naming and Storage System Secured by Blockchains 
  3. IPFS 
  4. Bitcoin-NG: A Scalable Blockchain Protocol 
  5. A Secure Sharding Protocol For Open Blockchains
  6. Enhancing Bitcoin Security and Performance with Strong Consistency via Collective Signing 
  7. Service-Oriented Sharding for Blockchains
  8. The stellar consensus protocol: A federated model for internet-level consensus
Smart contracts papers:
  1. Step by Step Towards Creating a Safe Smart Contract: Lessons and Insights from a Cryptocurrency Lab 
  2. A Concurrent Perspective on Smart Contracts
  1. Zerocash: Decentralized Anonymous Payments from Bitcoin 
  2. SoK: Research Perspectives and Challenges for Bitcoin and Cryptocurrencies 
  3. Lightning Network 

Finally, I will also cover some distributed machine learning papers as well:

Related links

2016 Seminar reading list
2015 Seminar reading list

Monday, January 22, 2018

Erasable pens for editing papers

I recently discovered the Pilot Frixion pens and I like them a lot. (I am not getting any advertisement money from them I swear :-)

The pens have erasable ink, so they are great for marking your comments/edits on a paper while reading. They erase via heat. Each pen comes with a plastic nub, and if you apply friction to the page with the plastic nub at the top, and it erases the writing --mostly clean. A word of caution though, this means if you leave your writing in a hot car, you will find it erased, which you can remedy by putting it in a freezer. I am not kidding.

So, don't use it for writing you want to keep permanently, but it is great for writing comments and marking on a paper when you are reading.

I print the research paper I am reading and I do a lot of marking on paper. If I use a regular pen, I cross over some of my guesswork, nonsensical questions, or misinformed comments, and it messes up the paper. But using Frixion pens, I erase and modify my comments without creating a mess. Well, at least, less of a mess.

I am still not an online shopping junkie. Although these pens are available on Amazon, I still like to buy them from a store. I was happy to see that Walmart has the 3 colors (black, red, blue) for $5. I use the red pen for question,  blue pen for important comments, and black pen for all the other comments and doodling.

MAD questions

I once prematurely advocated going back to the fountain pen.

I still switch between the two. Mostly I live in Emacs happily. But I print and edit with erasable pens when I need to edit. And I switch to a fountain pen (a cheap Pilot Metropolitan pen combined with a refillable ink-pump --- again not advertising), when I am badly stuck and I need to think on paper, or when I want to break monotony. I sign things with that fountain pen, and the pen surprises the students even my colleagues. Fountain pens are now considered antique.

I wonder if it would be possible to combine the perfect pen on paper feel with all the ease/convenience of text wrangling in Emacs within the next 10 years. I feel that would be the ultimate productivity product for me.  

No I don't think MS surface* is there yet; it is still lacking in both fronts: writing-feel and editing-convenience. I was hoping Apple with get there, but after Steve Jobs passed away, I am not counting on it anymore.

Thursday, January 18, 2018

Remember peer-to-peer systems?

Traditionally computer systems use client server model. This is more of a centralized approach; server sits there and responds to clients requests. If one server is not enough for computation/analysis, a "hierarchical" organization of servers model is adopted in datacenter and cloud computing. One node becomes the master, other nodes act as workers. This is called the master-worker model. This simple model make sense if you have an infrastructure. Centralized control architecture is simple, so you can keep the coordination simple and efficient.

Peer-to-peer model is on the other end of the spectrum: it calls for a fully decentralized system model. There is no distinguished master. Each node acts as both server and client, each node is a peer. This model does not require stable infrastructure and it can self-organize with what is presently available. As such, they are great for circumventing laws, bans, and censorship.
In 2000s, peer-to-peer systems were all the craze. Peer-to-peer music sharing applications, Napster, Gnutella, and Kazaa, were very popular and successful. There were systems called CAN, Chord, Pastry, and Tapestry. Bittorrent, peer-to-peer communication protocol was also very popular: "In November 2004, BitTorrent was responsible for 25% of all Internet traffic." 
Then, peer-to-peer systems disappeared from the scene in the next 5 years or so. The peer-to-peer architecture got abolished, but the best ideas from those work found their way to traditional datacenter computing. The distributed hash table idea got adopted for distributed key-value stores. Bittorrent saw some uses in datacenter application-layer networking. 

Today, there seems to be a revival of peer-to-peer systems with blockchains, decentralized attestation, distributed ledgers, and ipfs applications. 

As a distributed systems professor, I should be exuberant like everybody else, but I am cautious. (As I wrote in 2014, distributed is not necessarily more scalable than centralized). The centralized coordination architectures (today's cloud computing and datacenter architectures) have a strong attraction point: they are simple and efficient to coordinate. Even then we mess up building those systems. So we don't stand much chance in building, scaling, and maintaining fully-decentralized systems, let alone leveraging on them as scaffolding to build/control/coordinate more sophisticated and scalable applications.

So, most likely, this will play out similar to the peer-to-peer systems of 2000. The blockchain architectures will fade away, but the best ideas of blockchain systems will be adopted for adding attestation, authentication and smart-contracts for cloud computing and datacenter applications. Hopefully those ideas can be used to fix the problems with social networks and make them used for enabling collaboration to make/build things, where individual effort/labor/contribution can be tracked and rewarded appropriately. 

Tuesday, January 16, 2018

Paper summary. A Berkeley view of systems challenges for AI

This position paper from Berkeley identifies an agenda for systems research in AI for the next 10 years. The paper also serves to publicize/showcase their research, and steer interest towards these directions, which is why you really write position papers.

The paper motivates the systems agenda by discussing how systems research/development played a crucial role in fueling AI’s recent success. It says that the remarkable progress in AI has been made possible by a "perfect storm" emerging over the past two decades, bringing together: (1) massive amounts of data, (2) scalable computer and software systems, and (3) the broad accessibility of these technologies.

The rest of the paper talks about the trends in AI and how those map to their systems research agenda for AI.

Trends and challenges

The paper identifies 4 basic trends in the AI area:
  • Mission-critical AI: Design AI systems that learn continually by interacting with a dynamic environment in a timely, robust, and secure manner.
  • Personalized AI: Design AI systems that enable personalized applications and services while respecting users’ privacy and security.
  • AI across organizations: Design AI systems that can train on datasets owned by different organizations without compromising their confidentiality. (I think it was possible to simplify presentation by combining this with the Personalized AI.)
  • AI demands outpacing the Moore’s Law: Develop domain-specific architectures and distributed software systems to address the performance needs of future AI applications in the post-Moore’s Law era.
To enable progress on these fronts, the paper then identifies 9 research topics, across 3 main areas: Acting in dynamic environments, Secure AI, and AI specific architectures.

Acting in dynamic environments

R1: Continual learning

Despite Reinforcement Learning (RL)'s successes (Atari games, AlphaGo in chess and Go games), RL has not seen widescale real-world application. The paper argues that coupling advances in RL algorithms with innovations in systems design will drive new RL applications.

Research: (1) Build systems for RL that fully exploit parallelism, while allowing dynamic task graphs, providing millisecond-level latencies, and running on heterogeneous hardware under stringent deadlines. (2) Build systems that can faithfully simulate the real-world environment, as the environment changes continually and unexpectedly, and run faster than real time.
Of course, the second part here refers to research described in "Real-Time Machine Learning: The Missing Pieces". Simulated Reality (SR) focuses on continually simulating the physical world with which the agent is interacting. Trying to simulate multiple possible futures of a physical environment in high fidelity within a couple milliseconds is a very ambitious goal. But research here can also help other fields, so this is interesting.

R2: Robust decisions

The challenges here are: (1) robust learning in the presence of noisy and adversarial feedback, and (2) robust decision-making in the presence of unforeseen and adversarial inputs.

Research: (1) Build  fine grained provenance support into AI systems to connect outcome changes (e.g., reward or state) to the data sources that caused these changes, and automatically learn causal, source-specific noise models. (2) Design API and language support for developing systems that maintain confidence intervals for decision-making, and in particular can flag unforeseen inputs.

R3: Explainable decisions

Here we are in the domain of causal inference, a field "which will be essential in many future AI applications, and one which has natural connections to diagnostics and provenance ideas in databases."

Research: Build AI systems that can support interactive diagnostic analysis, that faithfully replay past executions, and that can help to determine the features of the input that are responsible for a particular decision, possibly by replaying the decision task against past perturbed inputs. More generally, provide systems support for causal inference.

Secure AI

R4: Secure enclaves

A secure enclave is a secure execution environment—which protects the application running within from malicious code running outside.

Research: Build AI systems that leverage secure enclaves to ensure data confidentiality, user privacy and decision integrity, possibly by splitting the AI system’s code between a minimal code base running within the enclave, and code running outside the enclave. Ensure the code inside the enclave does not leak information, or compromise decision integrity.

R5: Adversarial learning

The adaptive nature of ML algorithms opens the learning systems to new categories of attacks: evasion attacks and data poisoning attacks.

Research: Build AI systems that are robust against adversarial inputs both during training and prediction (e.g., decision making), possibly by designing new machine learning models and network architectures, leveraging provenance to track down fraudulent data sources, and replaying to redo decisions after eliminating the fraudulent sources.

R6: Shared learning on confidential data

The paper observes that, despite the large volume of theoretical research, there are few practical differential privacy systems in use today, and proposes to simplify differential privacy use for real-world applications.

Research: Build AI systems that (1) can learn across multiple data sources without leaking information from a data source during training or serving, and (2) provide incentives to potentially competing organizations to share their data or models.

AI specific architectures

R7: Domain specific hardware

The paper argues that "the one path left  to continue the improvements in performance-energy-cost of processors is developing domain-specific processors." It mentions the Berkeley Firebox project, which proposes a multi-rack supercomputer that connects thousands of processor chips with thousands of DRAM chips and nonvolatile storage chips using fiber optics to provide low-latency, high-bandwidth, and long physical distance.

Research: (1) Design domain-specific hardware architectures to improve the performance and reduce power consumption of AI applications by orders of magnitude, or enhance the security of these applications. (2) Design AI software systems to take advantage of these domain-specific architectures, resource disaggregation architectures, and future non-volatile storage technologies.

R8: Composable AI systems

The paper says modularity and composition will be key to increasing development speed and adoption of AI. The paper cites the Clipper project.

Research: Design AI systems and APIs that allow the composition of models and actions in a modular and  flexible manner, and develop rich libraries of models and options using these APIs to dramatically simplify the development of AI applications.

R9: Cloud-edge systems

The paper mentions the need to repurpose code to multiple heterogeneous platforms via re-targetable software design and compiler technology. It says "To address the wide heterogeneity of edge devices and the relative difficulty of upgrading the applications running on these devices, we need new software stacks that abstract away the heterogeneity of devices by exposing the hardware capabilities to the application through common APIs."

Research: Design cloud-edge AI systems that (1) leverage the edge to reduce latency, improve safety and security, and implement intelligent data retention techniques, and (2) leverage the cloud to share data and models across edge devices, train sophisticated computation-intensive models, and take high quality decisions.

MAD questions

(The questions that led to these explanations are left as an exercise to the reader.)

1) In 2009, there was a similar position paper from Berkeley called "Above the Clouds: A Berkeley View of Cloud Computing". That paper did a very good job of summarizing, framing, and selling the cloud computing idea to the academia. But it looks like the research agenda/directions from that report didn't fare very well after 8 years---which is totally expected. Plans are useless but planning is indispensable. The areas of interest change after some time and the research adapts to it. It is impossible to tightly plan and manage exploratory research in CS areas (maybe this is different in biology and sciences areas.)

I think it is a YES for items 4, 5, 6, and partial for the rest, with very little progress in items 2 and 9. While the opportunities did not include them, the following developments have since reshaped the cloud computing landscape:

  • dominance of machine learning workloads in the cloud
  • the rise of NewSQL systems, the trend for more consistent distributed databases, and the importance of coordination/Paxos/ZooKeeper in the cloud 
  • the development of online in-memory dataflow and stream processing systems, such as Spark, which came out of Berkeley
  • the race towards finer-granularity virtualization via containers and functions as a service
  • the prominence of SLAs (mentioned only once in the paper)

So even though the AI-systems agenda from Berkeley makes a lot of sense, it will be instructive to watch how these pan out and what unexpected big AI-systems areas open up in the coming years.

2) Stanford also released a similar position paper earlier this year, although theirs was for a limited scope/question for developing a [re]usable infrastructure for ML. Stanford's DAWN project aims to target end-to-end ML workflows, empower domain experts, and optimize end-to-end. This figure summarizes their vision for the reusable ML stack:

Of course, again, this inevitably reflects the strengths and biases of the Stanford team; they are more on the database, datascience, production side of things. It  looks like this has some commonalities with the AI-specific architectures section of the Berkeley report, but different approaches are proposed for the same questions.

3) For R2: Robust decisions, it seems like formal methods, modeling, invariant-based reasoning, can be useful, especially when concurrency control becomes an issue in distributed ML deployments.

Sunday, January 14, 2018

The Lambda and the Kappa Architectures

This article, by Jimmy Lin, looks at the Lambda and Kappa architectures, and through them considers a larger question: Can one size fit all?

The answer, it concludes, is it depends on what year you ask! The pendulum swings between the apex of one tool to rule them all, and the other apex of multiple tools for maximum efficiency. Each apex has its drawbacks: One tool leaves efficiency on the table, multiple tools spawns integration problems.

In the RDBMS world, we already saw this play out. One size RDBMS fitted all, until it couldn't anymore. Stonebraker declared "one size does not fit all", and we have seen a split to dedicated OLTP and OLAP databases connected by extract-transform-load (ETL) pipelines. But these last couple years we are seeing a lot of one size fits all "Hybrid Transactional/Analytical Processing (HTAP)" solutions being introduced again.

Lambda and Kappa

OK, back to telling the story from the Lambda and Kappa architectures perspective. What are the Lambda and Kappa architectures anyway?

Lambda, from Nathan Marz, is the multitool solution. There is a batch computing layer, and on top there is a fast serving layer. The batch layer provides the "stale" truth, in contrast, the realtime results are fast, but approximate and transient. In Twitter's case, the batch layer was the MapReduce framework, and Storm was the serving layer on top. This enabled fast response at the serving layer, but introduced an integration hell. Lambda meant everything must be written twice: once for the batch platform and again for the real-time platform.The two platforms need to be indefinitely maintained in parallel and kept in sync with respect to how each interact with other components and integrates features.

Kappa, from Jay Kreps, is the "one tool fits all" solution. The Kafka log streaming platform considers everything as a stream. Batch processing is simply streaming through historic data. Table is merely the cache of the latest value of each key in the log and the log is a record of each update to the table. Kafka streams adds the table abstraction as a first-class citizen, implemented as compacted topics. (This is of course already familiar/known to database people as incremental view maintenance.)

Kappa gives you a "one tool fits all" solution, but the drawback is it can't be as efficient as a batch solution, because it is general and needs to prioritize low-latency response to individual events than to high-throughput response to batch of events.

What about Spark and Apache Beam?

Spark considers everything as batch. Then, the online stream processing is considered as microbatch processing. So Spark is still a one tool solution. I had written earlier about Mike Franklin's talk which compared Spark and Kappa architecture.

Apache Beam provides abstractions/APIs for big data processing. It is an implementation of Google Dataflow framework, as explained in the Millwheel paper. It differentiates between event time and processing time, and uses a watermark to capture the relation between the two. Using the watermark it provides information about the completeness of observed data with respect to event times, such as 99% complete in 5 minute mark. The late arriving messages trigger a makeup procedure to amend previous results. This is of course close to the Kappa solution, because it treats everything, even batch, as stream.

I would say Naiad, TensorFlow, timely dataflow, differential dataflow are "one tool fits all" solutions, using similar dataflow concepts as in Apache Beam.

Can you have your cake and eat it too? and other MAD questions.

Here are the important claims in the paper:

  • Right now, integration is a bigger pain point, so the pendulum is now on the one-tool solution side.
  • Later, when efficiency becomes a bigger pain point, the pendulum will swing back to the multi-tool solution, again.
  • The pendulum will keep swinging back and forth because there cannot be a best of both worlds solution.

1) The paper emphasizes that there is no free lunch. But why not?
I think the argument is that a one tool solution cannot be as efficient as a batch solution, because it needs to prioritize low-latency response to individual events rather than prioritizing high-throughput response to batch of events.

Why can't the one tool solution be more refined and made more efficient? Why can't we have finely-tunable/configurable tools? Not the crude hammer, but the nanotechnology transformable tool such as the ones in the Diamond Age book by Neal Stephenson?

If we had highly parallel I/O and computation flow, would that help achieve a best of both worlds solution?

2) The paper mentions using API abstractions as a compromise solution, but quickly cautions that this will also not be able to achieve best of both worlds, because abstractions leak.

Summingbird at Twitter is an example of an API based solution: Reduced expressiveness (DAG computations) is traded of for achieving simplicity (no need to maintain separate batch and realtime implementations). Summingbird is a domain specific language (DSL) that allows queries to be automatically translated into MapReduce jobs and Storm topologies.

3) Are there analogous pendulums for other problems?
The other day, I posted a summary of Google's TFX (TensorFlow Extended) platform. It is a one tool to fit all solutions approach, like most ML approaches today. I think the reason is because integration and ease-of-development is the biggest pain point these days. The efficiency for training is addressed by having parallel training in the backend, and training is already accepted to be a batch solution. When integration/development problems are alleviated, and we start seeing very low-latency training demand for machine learning workloads, we may expect to see the pendulum to swing to multitool/specialization solutions in the space.

Another example of the pendulum thinking is in the decentralized versus centralized coordination problem. My take on this is that centralized coordination is simple and efficient, so it has a strong attraction. You go with a decentralized coordination solution only if you have a big pain point with the centralized solution, such as geographic separation induced latency. But even then hierarchical solutions or federations can get you close to best of both worlds.

The presentation of the paper

I respect Jimmy Lin's take on the subject because he has been in the trenches in his Twitter times, and he is also an academic and can evaluate intrinsic strength of the ideas abstracted away from the technologies. And I really enjoyed reading the paper in this format. This is a "big data bite" article, so it is written in a relaxed format, and manages to teach a lot in 6 pages.

However, I was worried when I read the first two paragraphs, as it gave some bad signals.  The first paragraph referred to the one tool solution as a "hammer", which is associated with crude and rough. The next paragraph said: "My high level message is simple: there is no free lunch." That is a very safe position, it may even be vacuous. And I was concerned that Jimmy Lin is refraining to take any positions. Well, it turns out, this was indeed his final take all things considered, and he took some strong positions in the article. His first rant (yes, really, he has a sidebar called "Rant") about Lambda architecture has some strong words.

Friday, January 12, 2018

Paper summary. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform

This paper from Google appeared at KDD 2017 Applied Data Science track. The paper discusses Google's quality assurance extensions to their machine learning (ML) platforms, called TensorFlow Extended (TFX). (Google is not very creative with names, they should take cue from Facebook.)

TFX supports continuous training and serving pipelines and integrates best practices to achieve production-level reliability and scalability. You can argue that the paper does not have a deep research component and a novel insight/idea. But you can argue the same thing for the checklist manifesto by Atul Gowande, which nevertheless does not decrease from its effectiveness, usefulness, and impact.

On the other hand, the paper could definitely have been written much succinctly. In fact, I found this blog post by Martin Zinkevich, the last author of the paper, much easier to follow than the paper. (Are we pushed to make papers artificially obfuscated to be publication-worthy?)  This blog post on serving skew, a major topic discussed in the TFX paper, was both very succinct and accessible.

While we are on the topic of related work, the NIPS 2016 paper, "What is your ML score? A rubric for ML production systems", from a subset of the authors of the TFX paper, is also related.   A big motivation for this paper is another previous Google paper, titled: "Hidden technical debt in machine learning systems".

The paper focuses its presentation on the following components of the TFX.

Data analysis, transformation, and validation

Data analysis: This component gathers statistics over feature values: for continuous features, the statistics include quantiles, equi-width histograms, the mean and standard deviation. For discrete features they include the top-K values by frequency.

Data Transformation: This component implements a suite of data transformations to allow "feature wrangling" for model training and serving. The paper says: "Representing features in ID space often saves memory and computation time as well. Since there can be a large number (∼1–100B) of unique values per sparse feature, it is a common practice to assign unique IDs only to the most “relevant” values. The less relevant values are either dropped (i.e., no IDs assigned) or are assigned IDs from a fixed set of IDs."

Data validation: To perform validation, the component relies on a schema that provides a versioned, succinct description of the expected properties of the data.

The other day, I wrote about modeling use cases, which included data modeling. That kind of TLA+/PlusCal modeling may have applications here to design and enforce a rich/sophisticated schema, with high-level specifications of some of the main operations on the data.

Model training

This section talks about warm-starting, which is inspired by transfer learning. The idea is to first train a base network on some base dataset, then use the ‘general’ parameters from the base network to initialize the target network, and finally train the target network on the target dataset. This cuts down the training time significantly. When applying this to continuous training, TFX helps you identify a few general features of the network being trained (e.g., embeddings of sparse features). When training a new version of the network, TFX initializes (or warm-starts) the parameters corresponding to these features from the previously trained version of the network and fine tune them with the rest of the network.

I first thought whether it would be beneficial to check when warm training would be applicable/beneficial. But then I realized, why bother? ML is empirical and practical; try it and see if warm training helps, and if not, don't use it. On the other hand, if the design space becomes very large, this kind of applicability check can help save time, and guide the development process.

This section also talks about FeatureColumns which help users focus on which features to use in their machine learning model. These provide a declarative way of defining the input layer of a model.

Model evaluation and validation

A good model meets a desired prediction quality, and is safe to serve.

It turns out the "safe to serve" part is not trivial at all: "The model should not crash or cause errors in the serving system when being loaded, or when sent bad or unexpected inputs, and the model shouldn’t use too many resources (such as CPU or RAM). One specific problem we have encountered is when the model is trained using a newer version of a machine learning library than is used at serving time, resulting in a model representation that cannot be used by the serving system."

Model serving

This component aims to scale serving to varied traffic patterns. They identified interference between the request processing and model-load processing flows of the system which caused latency peaks during the interval when the system was loading a new model or a new version of an existing model. To solve this they provide a separate dedicated threadpool for model-loading operations, which reduces the peak latencies by an order of magnitude.

This section first says it is important to use a common data format for standardization, but then backtracks on that: "Non neural network (e.g., linear) models are often more data intensive than CPU intensive. For such models, data input, output, and preprocessing tend to be the bottleneck. Using a generic protocol buffer parser proved to be inefficient. To resolve this, a specialized protocol buffer parser was built based on profiles of various real data distributions in multiple parsing configurations. Lazy parsing was employed, including skipping complete parts of the input protocol buffer that the configuration specified as unnecessary. The application of the specialized protocol buffer parser resulted in a speedup of 2-5 times on benchmarked datasets."

In NIPS 2017, Google had a more detailed paper on the Tensorflow serving layer.

Case Study: Google Play

One of the first deployments of TFX is the recommender system for the Google Play mobile app store. TFX is used for the Google Play recommender system, whose goal is to recommend relevant Android apps to the Play app users. Wow, talk about scale: Google Play has over one billion active users and over one million apps.

This part was very interesting and is a testament to the usefulness of TFX:
"The data validation and analysis component helped in discovering a harmful training-serving feature skew. By comparing the statistics of serving logs and training data on the same day, Google Play discovered a few features that were always missing from the logs, but always present in training. The results of an online A/B experiment showed that removing this skew improved the app install rate on the main landing page of the app store by 2%."

MAD questions

1) The paper provides best practices for validating the sanity of ML pipelines, in order to avoid the Garbage In Garbage Out (GIGO) syndrome. How much of these best practices is likely to change over the years? I can already see a paper coming in the next couple years, titled: "One size does not fit all for machine learning".

In fact, this thought send me down a rabbit hole, where I read about Apache Beam, Google Dataflow, and then the Lambda versus Kappa architecture. Very interesting work, which I will summarize soon.

2) Why do research papers not have a MAD questions section?
(I am not picking on this paper.) I guess the research papers have to claim authority, and provide a sense of everything is under control. Pointing out unclosed-loops and open-ended questions may give a bad impression for the paper. The future work sections often come as one paragraph at the end of the paper, and play it safe. I don't think it should be that way though. More relaxed venues, such as HOT-X and workshops can provide a venue for papers that raise questions.

Wednesday, January 10, 2018

Why you should use modeling [with TLA+/PlusCal]

I recently gave a two day seminar on "debugging your designs with TLA+/PlusCal" at Dell. So I wanted to write some of the motivation for modeling and debugging your models while this is still fresh in my mind.

You need modeling

No, not that kind of modeling! Actually the naming clash is not accidental after all: fashion designers need models to test/showcase their designs.

You need modeling because:

  • Failing to plan is planning to fail 
  • Everything is a distributed system
  • The corner cases ... they are so many
  • Do it for the development process
  • Being smart does not scale

Failing to plan is planning to fail

This is from the paper, "Use of formal methods at Amazon Web Services, 2014".
"Before launching any complex service, we need to reach extremely high confidence that the core of the system is correct. We have found that the standard verification techniques in industry (deep design reviews, code reviews, static code analysis, stress testing, fault-injection testing, etc.) are necessary but not sufficient.
Human fallibility means that some of the more subtle, dangerous bugs turn out to be errors in design; the code faithfully implements the intended design, but the design fails to correctly handle a particular ‘rare’ scenario. We have found that testing the code is inadequate as a method to find subtle errors in design."

Modeling shows you how sloppy your "design" is. You think you got the design right, but for a complex service worth its salt you almost always get it wrong (more on this below). You won't find what you got wrong unless you model your design and validate it. And you want to find that out early on, without the sunken investment of correctly implementing your flawed design. Otherwise, even after Jepsen shows you that you screwed up, you are already too much invested into this flawed design, and you try to patch it, and you end up with a slow and bloated system.

Everything is a distributed system

There's just no getting around it: You're building a distributed system.

In this process, you are very likely to make an assumption that will bite you back, such as one hop is faster than two hops, zero hops is faster than one hop, and the network is reliable. What you assumed was an atomic block of execution will be violated because another process will execute concurrently and change the system state in a way you didn't anticipate. And don't even get me started on faults, they are in a league of their own, they will collude with your program actions to screw you up.

The corner cases, they are so many

In the 2004, "Consensus on Transaction Commit" paper, Lamport and Gray mentioned that they could not find a correct three-phase commit protocol in database textbooks/papers because each one fails to account for a corner case.
"Three-Phase Commit protocols ... have been proposed, and a few have been implemented [3, 4, 19]. They have usually attempted to “fix” the Two-Phase Commit protocol by choosing another TM if the first TM fails. However, we know of none that provides a complete algorithm proven to satisfy a clearly stated correctness condition. For example, the discussion of non-blocking commit in the classic text of Bernstein, Hadzilacos, and Goodman [3] fails to explain what a process should do if it receives messages from two different processes, both claiming to be the current TM. Guaranteeing that this situation cannot arise is a problem that is as difficult as implementing a transaction commit protocol."

Do it for the development process

Modeling is good for achieving clarity of thinking and communication. Lamport used TLA+ without a model checker from 1990s to 2010. Even without the model checker, he still found value in modeling. It made him nail down the specifications and communicate them with others precisely. When you write things down precisely, it enables your brain to move on and do more with it. Clarity begets more clarity. Focus begets more focus.

Once you abstract away the clutter, come up with a precise model in Pluscal, and validate it with exhaustive model-checking, you can focus on the essence of the problem, and see alternative ways to implement it. And through this development process where you refine/implement the design, the PlusCal model will help a lot for communicating the design with other engineers, and check which implementations would work for each subsystem.

Being smart does not scale; exhaustive model checking comes to the rescue

After you get the design down in pseudocode (but not in TLA+ or PlusCal), couldn't that work for invariant-based design? Can't you just check each action in your pseudocode to see if it preserves your safety/invariant conditions, and be done with this? There is no need to model with TLA+/PlusCal and model-check, right?

Sigh. Did you read the above carefully? Everything is a distributed system, and there are many corner cases. A sloppy pseudocode is not going to cut it. And don't trust your deduction abilities for proving that each action preserves the safety conditions you identify. That works for simple toy examples, but for complicated examples you need to do a lot of extra mental inferencing/linking of concepts/being creative which is very error-prone.

Consider the hygienic philosophers example I discussed earlier. Your invariant will talk about being in critical section, but the actions talk about ... forks ... per edges ... over a dynamic priority graph. So doing that mental mapping would be very hard. Instead TLA+/PlusCal model checker gets you covered with exhaustive checking on the breadth first traversal of all possible permutations of action scheduling and show you if there is any possible execution (including the fault actions you model) that can violate your invariants.

This is why I so happily adopted TLA+/PlusCal for my distributed systems class.  Even for sophisticated algorithms, I can refer the students to the TLA+/PlusCal model to practice and play with the algorithm, so they can internalize what is going on.


This already got long. So in a later post, I will write more about the modeling/abstracting process, the mathematical/invariant-based thinking, and about some shortcomings of modeling.

Drop me a message if you are interested in having me over for a training/talk!

MAD questions

This section is here because of my New Year's commitment.

1) In addition to protocol/system modeling, workflow modeling, business logic modeling, TLA+/PlusCal has also been used for data modeling. Are there any other uses? If you have interesting use cases, please let me know as I am curious. 

2) Actually, I am aware of another use case for PlusCal model-checking, but it seems to be mostly for toy examples so far. You define actors/operators that can act at any order, and you challenge the model checker and claim that concurrent operation of those actors/operators cannot ever satisfy a condition that you like to happen. And the model checker, being the jerk it is, responds with a trace showing that it is possible, and you adopt this as the solution. The die hard puzzle is an example of this. This approach is useful for scheduling, maybe even cluster job scheduling under concerns and even anticipating some statical failures and still hitting the deadline. But I am not aware of any real-world use of this. Is this used in practice?

3) Is there a bad time for modeling? When should you not model?
Sometimes you may need to go bottom up to figure out the domain/problems first. After you have an idea of the domain, then you can start to model and go top down. I think it would not make sense to be opinionated and making modeling calls, before you are informed about the domain and issues. I think modeling is just thinking made more rigorous, and you should get the ball rolling on the thinking/understanding part a bit first before attempting to model.

Monday, January 8, 2018

Salute to Prof. Mohamed Gouda: Elegance in computing

A couple months ago, I attended a special half-day workshop organized honoring Prof. Mohamed Gouda's contributions to computer science, and particularly the self-stabilizing systems community.

Mohamed is the Mike A. Myers Centennial Professor at University of Texas at Austin. He has been at Austin Texas since 1980, for almost 40 years. His research contributions to the distributed systems has been phenomenal (borrowing a word Mohamed likes to use for things that excite him.) I am proud that Mohamed is my academic grandfather; he was the PhD advisor of my PhD advisor, Prof. Anish Arora. I wrote about "how to find your advisor" in my previous post, I hope elegance/rigor from Mohamed and Anish rubbed off on me a bit.

At the workshop, there were about 10 talks technical in nature, but at the end of the talks, each speaker mentioned how their research and career has been enriched by Mohamed's contributions/help.

I talked about my technical report on "Does the cloud need stabilizing?", and at the end I took some time to mention what I learned from Mohamed, and how I benefited being around him, albeit for a limited time.

The most significant things I learned from Mohamed is not from what he taught on the board or at the conferences, but from how he acted. Actions speak louder than words.

Always be passionate about research

When I was a PhD student, I had time to spend with Mohamed at self-stabilizing systems workshops and later as part of our work on the DARPA NEST wireless sensor networks project.

What impressed me most was Mohamed's passion for research. At that time he was involved with research 30+ years already, but he was as fresh  and enthusiastic about research problems as PhD students. When someone mentioned a research problem, you could see his ears prick up and his eyes literally sparkle. And this was often after a long tiring day of research talks/meetings.

There has been many days in my academic career where I was feeling tired and down, and I motivated myself by remembering Mohamed's passion about research, and asking myself what excuse do I have to be bored/unexcited/unmotivated. (I will be amiss if I don't add my advisor Anish Arora and postdoc supervisor Nancy Lynch to this mention. I try to imitate their genuine excitement about research.)

Never turn down collaboration opportunities

Mohamed has always been very positive and full of energy and humor. He has been graceful and gentle with everyone.

He jokes that he never says no to any collaboration opportunities after he regretted to say no to collaborating at that time on the newly starting field of "ad hoc networks" (I am most likely misremembering the name/anecdote but was that with Charles Perkins), saying that "ad hoc networks" does not sound like a respectable name for a research area :-)

He helped build the self-stabilizing systems community, as many speakers credited him at the workshop. If you help people, you help yourself as well.

Clarify, Simplify, and Generalize

On the technical side of things, I learned from Mohamed the art of abstraction. He really has a knack for simplifying things to their essentials and contribute there and generalize from there. Mohamed is famous for saying, he does not trust any distributed algorithm that has more than 3 actions. You can see this elegance/simplicity at work in his classical papers. He was colleagues with Dijkstra for many years, who was a grandmaster of the art of abstraction.

In a research talk you could find Mohamed listening carefully, asking a couple trivial seeming definition questions, and towards the end of the talk making very insightful comment/observation about the work. After knowing how Mohamed works, I can see he is simplifying and mapping the work to his mental framework and generalizing and contributing from there.

Mohamed's book on "elements of network protocol design" thought me about networks, after the networking class I took made me hate networking. (I wrote about it here earlier.) The book also showed how you abstract away details and simplify things to their essence and focus that essence.

MAD questions

(This section is here because of my new year's commitment.)

1) Mohamed has always been an exceptionally good storyteller as well. Several successful researchers I know are great storytellers. I wonder if this is an important feature for successful researchers. Could this be correlated with the researchers' ability to frame/present their work better?

2) Shouldn't we create more opportunities for our graduate students to spend quality time (or hanging around time -- or maybe they are the same) with such energizing and inspiring researchers? Conferences these days look far away from fulfilling that goal, maybe very small workshops or summer schools are better places for that. What are other ways to cultivate that interaction? Or to make this scale, how can we capture the thought-processes of these people better? The research papers are finished products and fail miserably at capturing the thought processes of the researchers.

3) I called it the "art of abstraction", because I don't think that there are well-defined rules to abstraction. What is the right level of abstraction to treat things at? Should you abstracts things as far as most people accuse you to go too far?

In a quote attributed to Einstein, he says: "Everything should be made as simple as possible, but not simpler." Could this pass as a helpful heuristic? What is the limit of possible?

Saturday, January 6, 2018

How to find your advisor

I had tweeted this earlier about "Rocking your PhD":

It is that simple. This is actually a concise version of a longer advice I provided earlier.

Since I haven't talked about it before, I like to now write some suggestions on finding an advisor.

How to find your advisor

Ask around and get advice from senior PhD students in the department about faculty as potential advisors.

In the first semester of your graduate studies, take 3-4 classes you are interested in. This provides a good opportunity to meet and impress your prospective advisor. If there is a class project, go overboard and exceed expectations. Try to improve on an algorithm mentioned in the class, and discuss this with the prospective advisor.

Before you commit with an advisor, make sure it is a good match. As you take a class from your advisor, see if you can work together during the semester on a small entry-level problem/project. Because after you commit working with an advisor, it is messy/costly to switch advisors, since there is a sunken cost of investment from both parties.

You want to find a hardworking advisor, so you can work hard as well. Assistant professors have a lot of incentive to work hard to get tenure. On the other hand, many Associate and Full Professors also work hard, and may provide additional benefits: more experience and an established research group where you can get mentoring from senior students. In any case, you should check the track record of the faculty, giving emphasis on recent publications.

Try to find an advisor that you can learn a good toolkit, e.g., formal methods, graph theory, probability theory, or machine learning theory. Application domains come and go following trend cycles, on the other hand, toolkits are fundamental and can be applied to application domains as needed. To give a personal example, after learning formal methods and reasoning about distributed algorithms as my toolkit, I was able to apply them to wireless sensor networks domain and later to cloud computing domain, distributed databases domain, and hopefully soon to the distributed machine learning domain.

To balance the previous advise, make sure the advisor is also doing work on the practical side of things. Faculty jobs are very sparse, and after your PhD, you should have the option of joining industrial labs/positions.

MAD questions

Is a good advisor enough?
I think it is necessary but not sufficient.

Can you succeed with a bad advisor?
First, define "succeed". If it is a good faculty job or a prominent industrial research position, I don't know examples of this. It may be possible, if there is good support from other graduate students/faculty.

Are there bad advisors?
There are definitely advisors that don't care. Or that are too busy and don't help enough.

Is PhD largely an apprenticeship?
Yes, I think so. You learn by diffusion your advisor's taste in finding problems and devising solutions. In CS there are distinct categories of theory, systems, metrics, engineering, and algorithms people. I think the advisor imprints on the student not just on the research category but also perspective on things, such as openness to new things, having a home conference versus being more promiscuous, etc.

Is there a personality match thing?
Even though this is a professional relationship, personality clashes between the advisor and student may impede progress, or sometimes even lead to a blow up. A student that is very stubborn/uncoachable is not good, but the student shouldn't just follow or wait for instructions either. I like the student to push back and defend what he/she thinks is true, but also to be open-minded/receptive to alternative opinions/perspectives.

I knew of a faculty who asked students for a Myer-Briggs test (the one where we all get INTJ category). He was a very smart professor, so probably he found utility in that. There is also the big-five test. Personality styles may be useful to gauge how the student can fare with different advisement styles: socratic method, sink or swim method, trainer method, coaching method, and manager method. As for me, I don't care to know personality types of my PhD students, I just want to see if we can work well, discuss well, be productive and grow together.

Related links

I enjoy yakking about academic advice.

Thursday, January 4, 2018

Logical clocks and Vector clocks modeling in TLA+/PlusCal

In a distributed system, there is no shared state, counter, or any other kind of global clock.  So we can not implicitly associate an event with its time, because one node may have a different clock than another. Time synchronization is not easy to achieve, and failures complicate things.

It turns out we care about time because of its utility in ordering of the events. Using this observation, in 1978, Leslie Lamport offered a time-free definition for "happens before": Event A happens before event B (denoted as A hb B) if and only if A can causally affect B.

In the context of distributed systems, A hb B iff
1. A and B are on the same node and A is earlier in computation than B
2. A is the send of a message and B is the receive event for that message
3. There is some third event C, which A hb C, and C hb B.

This also suggest the definition for "concurrent" relation. Events A and B are concurrent iff $\neg( A ~hb~ B) \land \neg( B ~hb~ A)$

To capture the hb relation, logical clocks and vector clocks are proposed. I modeled these in PlusCal, and I wanted to share that in this post.

Logical clocks

A logical clock is a way of assigning number to an event. Assume each node $j$ has a logical clock, $lc.j$, that assigns a number to any event in that node. $lc.j$ can simply be a counter.

We want the following property from $lc$: $A ~hb~ B \Rightarrow lc.A < lc.B$
Note that the converse might not hold.

The algorithm is simple. For a send event at $j$: The clock is incremented and this updated value is the timestamp of the new event and of the message being sent.

For a receive event at $j$: The clock is updated by taking the maximum of the current clock at $j$ and the timestamp of the message (time of the corresponding send event). This maximum must be incremented to ensure that the new clock value is strictly larger than both previous events.

Here is the PlusCal modeling of the algorithm.

I use STOP to limit the scope of model checking, often the interesting behavior would present itself in a limited time, we don't need to go to logical clock values of 100s. Another way to do this would be to  go to the model checker advanced options tab, and use depth-first execution, and limit depth. 

I don't have interesting invariants to check for logical clocks, because I don't know of a good way to create maintain events with timestamps and also capture the events "causality relationships" and crosscheck that with the event timestamps. Maybe if I had created a time graph, and run the LC model over the time graph that might work.

Instead, I use a bait invariant to get a trace that violates the invariant, so I can observe that the model indeed operates as I intended it to. BaitInv == (\A k \in Procs: lc[k] < STOP /\ msg[k]<STOP) \* violated!

Vector clocks

With logical clocks, we have $A ~hb~ B \Rightarrow lc.A < lc.B$, and we can not determine whether $A ~hb~ B$ simply by examining their timestamps. For example, for $lc.A_j=3$ and $lc.B_k=9$, what can we say about the relationship between these two events?  We know that $\neg (B ~hb~ A)$ (by the contrapositive of the property above). But do we know that $A ~hb~ B$?

The answer is no. A and B may be events on different processes and B's timestamp might have reached to 9 without any message-chain (causality link) from A. Or not. The process B occurs might indeed have received a message from A or following event, and that might have drove up the lc at that process. The problem with logical clocks is these chains are not captured, because there is only one scalar to keep the counter/clock.

In order to keep a tab on which values were last learned from which processes, we need to extend the scalar counter to be a vector of size N, the number of processes.

The algorithm is then similar to logical clocks algorithm, except for maintaining a vector of "logical clocks".

Send event at $j$: The clock entry of $j$ in the vector clock at $j$ is incremented and the resulting vector is assigned as the timestamp of the event and message being sent.

Receive event at $j$: The vector clock is updated by taking the pair-wise maximum of the current vector and the timestamp of the received message. Then clock entry for $j$ in the vector clock must be incremented to ensure that the new clock value is strictly larger than both previous events.

Here is the PlusCal model of vector clocks algorithm.

The PairMax function is neat for defining the pairwise maximum of two vectors of size N. Otherwise, you will notice the model is similar to that of LC.

Again, I am unable to check with respect to events. But here I use this invariant to check that the VC structural property holds. It says, at any point in execution process $k$'s knowledge of its own clock vc[k][k] is greater than or equal to process $l$'s knowledge of $k$'s clock, vc[l][k]. This is because $l$ can only learn of $k$'s clock from a communication chain originating at $k$ and cannot advance $k$'s clock itself.
VCOK == (\A k,l \in Procs: vc[k][k] >= vc[l][k])

MAD questions

(This section is here due to my new year's resolution.)

Of course there is also matrix clocks. While vector clocks maintain what a process knows of the clocks of other processes, matrix clocks maintain what a process knows of what other processes know of other processes as well. "A matrix clock maintains a vector of the vector clocks for each communicating host. Every time a message is exchanged, the sending host sends not only what it knows about the global state of time, but also the state of time that it received from other hosts. This allows establishing a lower bound on what other hosts know, and is useful in applications such as checkpointing and garbage collection." (Frankly I am not aware of matrix clocks being used in practice.)

There is also version vectors, an application of the vector clocks idea to distributed replica maintenance: instead of having N to be the number of clients/processes, N becomes the number of replicas, which is a smaller number, 3-5. Version vectors enable causality tracking among data replicas and are a basic mechanism for optimistic replication.

I wonder if we could have version matrices, that extends version vectors  analogous to how matrix clocks extend vector clocks. Would that have any utility for distributed replicas?

Related links

Earlier I had provided a PlusCal model of hybrid logical clocks that augment logical clocks with physical clocks.

Here is some previous discussion/context about why I started assigning TLA+/PlusCal modeling projects in distributed systems classes.

There is a vibrant Google Groups forum for TLA+ :!forum/tlaplus
Clicking on label "tla" at the end of the post you can reach all my posts about TLA+

Tuesday, January 2, 2018

Mad questions

I am a very curious natured person. Every child starts asking lots of questions around 3-4, but according to my mom, I took that to another level constantly asking "but why?" and drove her crazy. On the other hand, I believe I owe my being curious to my mom. She was an elementary school teacher (a damn good one), and was instrumental in my development. She was (and still is) a very curious person, and she taught me how to ask more and better questions. For example, while traveling, she would notice different plants and would ask me why the landscape is different here? And we would make guesses.

The Turkish education system was not big on asking questions (these days it is waaaay waaaaay worse). Since the lazy path is to memorize and regurgitate answers, that is what it demanded from the students. But I think my questioning skills mostly survived. Among my friends, I was famous for replying questions with questions of my own, and if not, my answer was often "I don't know", because the question led me to ask more questions internally and I really don't know what I think about it yet.

For my PhD studies, I got a boost from my advisor Prof. Anish Arora, because as any great researcher he knew how to ask good questions and I learned from his example. Questions are the fuel needed for doing research. You can argue that the hard part of research is to figure out the right questions to ask!

Questions are a good way to test your sanity and mindfulness as well. Questioning helps you think meta and jump out of the system. Therefore I try to incorporate questioning in my studying as I wrote earlier: "For studying, writing, or researching, my process is also to take on the task in units of 30 minutes. I use the pomodoro technique and customized it over time a lot. I am right now on my 3rd version of pomodoro process. I will write about my recent pomodoro setup later on. It involves detaching and going meta for 4 minutes in the beginning and end of pomodoro. In these slots, I step back and force myself to think meta: Is this the right approach? What else I can try? Am I making good progress? Can this be made transformative?"

Unfortunately I feel like I got more into a routine with those "trying to ask questions", and I think the questions often come out as shallow/predictable. Those are still useful---at least I am double checking my sanity/approach--- but not really transformative. But yeah, I cannot be sure whether I have gone compliant/docile recently, can I? Maybe I am losing my edge, because I have been in the system long enough. Maybe I start seeing myself as experienced now, and it is affecting me.

So my New Year's resolution is to ask more/better/crazier questions.

To ensure that, I will try to install a system. One part of that system is to ask at least a couple questions in each blog post. Another part is to add a new tag to my blog called "mad questions".

Mad questions are questions beyond normal/expected questions, but are out of the left field questions. Asking mad questions will require a lot of effort and will drive me out of my comfort zone. I will try to give this my best shot, and I think the blogging format is a good medium to try this.

The questions for this post

1. Where does the question mark come from? It turns out there is no definite answer. I like its shape, though. It is like a hook.

2. Wikipedia says that some experts say animals can't ask questions:
"Enculturated apes Kanzi, Washoe, Sarah and a few others who underwent extensive language training programs (with the use of gestures and other visual forms of communications) successfully learned to answer quite complex questions and requests (including question words "who" what", "where"), although so far they failed to learn how to ask questions themselves. For example, David and Anne Premack wrote: "Though she [Sarah] understood the question, she did not herself ask any questions — unlike the child who asks interminable questions, such as What that? Who making noise? When Daddy come home? Me go Granny's house? Where puppy? Sarah never delayed the departure of her trainer after her lessons by asking where the trainer was going, when she was returning, or anything else".[10] The ability to ask questions is often assessed in relation to comprehension of syntactic structures. It is widely accepted, that the first questions are asked by humans during their early infancy, at the pre-syntactic, one word stage of language development, with the use of question intonation.[11]"
That is interesting, but also not very believable to me. It is known that dogs, monkeys, even cats can be surprised by illusion tricks. So, why is being surprised not acceptable as a question?
Maybe we don't understand the apes' questioning. If the ape phrases a question as surprise/frustration, or stating its desire to confirm/learn that thing, shouldn't that may be count as a question? What is the criteria used here? Taking this further, could hypothetical aliens say humans are unable to ask "questions" and therefore not intelligent?

3. What makes a question a "good" question? Can you stumble upon an insightful question by chance? What are some heuristics that help with asking good questions? Does a good question involve a perspective-change? (In science, it often seems to be so. When it is very hard to directly confront the issue, the skilled researcher looks at the problem sideways, formulates the question with a new perspective, and then finding a solution becomes feasible.)

4. When I ask a lot of questions in a blog post, how does that make you feel? Charged? Tired? Irritated?

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...