Monday, February 15, 2016

Paper review: Implementing Linearizability at Large Scale and Low Latency (SOSP'15)

Motivation for the paper

Remember the 2-phase commit problem?  The 2-phase commit blocks when the initiator crashes. Or, after a server crash the client/initiator may not be able to determine whether the transaction is completed. And transaction blocks. Because if we retry we may end up giving inconsistent result or redoing the transaction which messes up linearizability.

You need to have 3-phase commit to fix this. The new-leader/recovery-agent comes and tries to recommit things. Unfortunately, 3-phase commit solutions are complicated, there are a lot of corner cases. Lamport and Gray recommended that the Paxos consensus box can be used to remember the initiator's abort commit decision to achieve consistency, or more precisely they recommended the Paxos box remembers each participants decision  for the sake of shaving off a message in critical latency path.

What this paper recommends is an alternative approach. Don't use Paxos box to remember the decision, instead use a durable log at the participant to remember the decision. At this log the participant stores a completion record, which includes any results that are returned to the client. So if the initiator is confused and retries, or if the client retries, or a recovery-agent from one participating server comes and retries, this querying party is not going to get an inconsistent answer/decision from what is committed/returned earlier from the transaction.

How is the log at the participant durable against the crash of the participant? In other words, how do we ensure that the completion record is preserved? This is where this assumption about fast log-based recovery and RAMCloud specific features comes into play. RAMCloud maintains a log-structured replication and quick recovery, that ensures the completion record is not lost.

The paper presents this durable log-based transaction serializability idea with single participant, i.e., single object transaction, and then shows that it can be extended to multiple participant transactions.

That was my plot line for motivating the approach in the paper. The paper used, what I think is an indirect way to motivate the problem, by first pointing a problem with linearizability: exactly-once semantics. The figure illustrates that "at least once + idempotence != exactly-once". The paper then presents the completion record idea to achieve exactly-once semantics, and then builds linearizability on top of it, and in turn builds transactions on top of it.

Another idea in the paper is "implementing transactions on top of a lightweight linearizability layer". The paper argues that after having a lightweight linearizability layer in place, transations in fact become easier to implement. We will revisit this idea at the end of the review to see how it holds up.

RIFL architecture

The lightweight linearizability layer the paper suggests is named RIFL (Reusable Infrastructure for Linearizability).
In order to implement exactly-once semantics, RIFL must solve 4 problems: RPC identification, completion record durability, retry rendezvous, and garbage collection.
1) In order to detect redundant RPCs, each RPC must have a unique identifier, which is present in all invocations of that RPC.
2) RIFL assumes that the underlying system provides durable storage for completion records keyed with the RPC identifier.
3) If an RPC completes and is then retried at a later time, the retries must find the completion record to avoid re-executing the operation. To achieve this  each operation is associated with a particular object in the underlying system, and the completion record is stored wherever that object is stored.
4) For garbage collection, a completion record should not be reclaimed until it is certain that the corresponding request will never be retried. This can happen in two ways. First, once the client has received a response, it will never retry the request. Clients provide acknowledgments to the servers about which requests have successfully completed, and RIFL uses the acknowledgments to delete completion records.
RIFL appears to the rest of the system as three modules. The first, RequestTracker, runs on client machines to manage sequence numbers for outstanding RPCs (Figure 3). The second module, LeaseManager, runs on both clients and servers to manage client leases (Figure 4). On clients, LeaseManager creates and renews the client’s lease, which also yields a unique identifier for the client. On servers, LeaseManager detects the expiration of client leases. The third module, ResultTracker, runs only on servers: it keeps track of currently executing RPCs and manages the completion records for RPCs that have finished (Figure 5).

Implementing transactions over RIFL

The paper shows how Sinfonia minitransactions can be implemented over RIFL layer. You can read a summary of Sinfonia minitransactions here. The implementation of Sinfonia transactions over RIFL requires a long description, so I will avoid summarizing it myself, and instead point to a couple paragraphs verbatim from the paper to give you an idea about this.
"No side-effects" is the key idea when implementing transactions. The Transaction object defers all updates to the key-value store until commit is invoked. Commit must atomically verify that each object has the required version number, then apply all of the write and delete operations. If any of the version checks fail, the commit aborts and no updates occur.
Commit is implemented using a two-phase protocol where the client serves as coordinator. In the first phase, the client issues one prepare RPC for each object involved in the transaction (see Figure 6). The server storing the object (called a participant) locks the object and checks its version number. If it doesn't match the desired version then the participant unlocks the object and returns ABORT; it also returns ABORT if the object was already locked by another transaction. Otherwise the participant stores information about the lock in a transaction lock table and creates a durable record of the lock in its log. It then returns PREPARED to the client. The client issues all of the prepare RPCs concurrently and it batches requests to the same participant. If all of the prepare RPCs return PREPARED, then the commit will succeed; if any of the prepare RPCs return ABORT, then the transaction will abort. In either case, the client then enters the second phase, where it issues a decision RPC for each object. The participant for each object checks whether the RPC indicates “commit” or “abort”. If the decision was to commit, it applies the mutation for that object, if any. Then, whether committing or aborting, it removes the lock table entry and adds a tombstone record to the RAMCloud log to nullify the lock record. The transaction is effectively committed once a durable lock record has been written for each of the objects. At this point, regardless of the client's actions, the mutations will eventually be applied in the crash recovery.

If the client is suspected to have crashed, the participant for the transaction's first object acts as recovery coordinator. The recovery coordinator executes a two-phase protocol similar to that of the client, except that its goal is to abort the transaction unless it has already committed (in general, there may not be enough information to complete an incomplete transaction, since the client may have crashed before issuing all of the prepare RPCs). In the first phase the recovery coordinator issues a requestAbort RPC for each object, whose participant will agree to abort unless it has already accepted a prepare for the object. If requestAbort returns PREPARED for every object in the transaction, then the transaction has already committed. Otherwise, the recovery coordinator will abort the transaction. In either case, the recovery coordinator then sends decision RPCs for each object. In order to provide enough information for crash recovery, the client includes identifiers for all objects in the transaction as part of each prepare. This information serves two purposes. First, it allows each participant to identify the recovery coordinator (the server for the first object in the list). Second, the object list is needed by the recovery coordinator to identify participants for its requestAbort and decision RPCs. The recovery coordinator may not have received a prepare if the client crashed, so when a participant invokes startRecovery it includes the list of objects that it received in its prepare.

Transactions over Linearizability vs. Linearizability over Transactions

Implementing transactions over linearizability certainly provided a lot of benefit in terms of simplifying the complex recovery protocol in the original Sinfonia system. By implementing Sinfonia over RIFL, the paper did not need to implement the recovery protocol of Sinfonia. On the other hand we don't have results from the paper to see if implementing Sinfonia over RIFL helped improve performance.  The paper does not include any comparison of performance with base Sinfonia transactions.  Or the paper could have at least measured throughput of base RAMCloud transaction and RAMCloud with RIFL transactions to see if there is a throughput increase. That is also missing sorely in the paper.

As far as general transactions and general linearizability is concerned: the RIFL paper doesn't compare with Calvin, a system that made the case for transactions over linearizability. This is a big omission as Calvin system set out to do just that, implementing transactions over a linearized log: “By first writing transaction requests to a durable, replicated log, and then using a concurrency control mechanism that emulates a deterministic serial execution of the log's transaction requests, Calvin supports strongly consistent replication and fully ACID distributed transactions, and manages to incur lower inter-partition transaction coordination costs than traditional distributed database systems.”

There is also the question of throughput vs. latency for transactions. The Calvin paper suggests that transactions over linearizability improves throughput but latency suffers. I think the reason is as follows: linearizability-first avoids coordination headache/contentions. So eventhough it may add to latency, it should help improve throughput overall because coordination contention is avoided. Things are all predecided by the linearizability layer, aka the log.

Evaluation section

The evalution section shows the practical implications of the shim lightweight linearizability layer and demonstrates that this provides low overhead transactions even as the number of participants increase.
The RIFL+RAMCloud stack is compared/contrasted with H-Store main-memory database system using the TPC-C benchmark. The RAMCloud implementation of RIFL exhibits high performance: it adds less than 4% to the 13.5 μs base cost for writes, and simple distributed transactions execute in about 20 μs. RAMCloud transactions outperform H-Store on the TPC-C benchmark, providing at least 10x lower latency and 1.35x–7x as much throughput.
To contextualize these comparison results,  keep in mind that H-Store is a totally different beast than RAMCloud. H-Store is a  row-store-based main-memory relational database management system for OLTP applications. So H-Store is definitely more heavyweight to deal with general transactions and relational databases.

Figure 12 graphs the throughput of RIFL-based transactions. For transactions involving a single participant, a cluster with 10 servers can support about 440k transactions per second. With more participants per transaction, throughput drops roughly in proportion to the number of participants: with five participants in each transaction, the cluster throughput is about 70k transactions per second. Figure 12 also shows that the single-server optimization improves throughput by about 40%.

Related links

The paper:
Conference video:

The case for RAMClouds:
RAMCloud Reloaded:

Thursday, February 4, 2016

Holistic Configuration Management at Facebook

Move fast and break things

That was Facebook's famous mantra for developers. Facebook believes in getting early feedback and iterating rapidly, so it releases software early and frequently: Three times a day for frontend code, three times a day for for backend code. I am amazed that Facebook is able to maintain such an agile deployment process at that scale. I have heard other software companies, even relatively young ones, develop problems with agility, even to the point that deploying a trivial app would take couple months due to reviews, filing tickets, routing, permissions, etc.

Of course when deploying that frequently at that scale you need discipline and good processes in order to prevent chaos. In his F8 Developers conference in 2014, Zuckerberg announced the new Facebook motto "Move Fast With Stable Infra."

I think the configerator tool discussed in this paper is a big part of the "Stable Infra". (By the way, why is it configerator but not configurator? Another Facebook peculiarity like the spelling of Übertrace?)

Here is a link to the paper in SOSP'15 proceedings.
Here is a link to the conference presentation video.

Configuration management

What is even more surprising than daily Facebook code deployment is this: Facebook's various configurations are changed even more frequently, currently thousands of times a day. And hold fast: every single engineer can make live configuration changes! This is certainly exceptional especially considering that even a minor mistake could potentially cause a site-wide outage (due to complex interdependencies). How is this possible without incurring chaos?

The answer is: Discipline sets you free. By being disciplined about the deployment process, by having built the configerator, Facebook lowers the risks for deployments and can give freedom to its developers to deploy frequently.

Ok, before reviewing this cool system configerator, let's get this clarified first: what does configuration management involve and where is it needed? It turns out it is essential for many and diverse set of systems at Facebook. These include: gating new product features, conducting experiments (A/B tests), performing application-level traffic control, performing topology setup and load balancing at TAO, performing monitoring alerts/remediation, updating machine learning models (which varies from KBs to GBs), controlling applications' behaviors (e.g., how much memory is reserved for caching, how many writes to batch before writing to the disk, how much data to prefetch on a read).

Essentially configuration management provides the knobs that enable tuning, adjusting, and controlling Facebook's systems. No wonder configuration changes keep growing in frequency and outdo code changes by orders of magnitudes.

Configuration as code approach

The configerator philosophy is treating configuration as code, that is compiling and generating configs from high-level source code. Configerator stores the config programs and the generated configs in the git version control.

There can be complex dependencies across systems services in Facebook: after one subsystem/service config is updated to enable a new feature, the configs of all other systems might need be updated accordingly. By taking a configuration as code approach, configerator automatically extracts dependencies from source code without the need to manually edit a makefile. Furthermore, Configerator provides many other foundational functions, including version control, authoring, code review, automated canary testing, and config distribution. We will review these next as part of the Configerator architecture discussion.

While configerator is the main tool, there are other configuration support tools in the suite.
Gatekeeper controls the rollouts of new product features. Moreover, it can also run A/B testing experiments to find the best config parameters. In addition to Gatekeeper, Facebook has other A/B testing tools built on top of Configerator, but we omit them in this paper due to the space limitation. PackageVessel uses peer-to-peer file transfer to assist the distribution of large configs (e.g., GBs of machine learning models), without sacrificing the consistency guarantee. Sitevars is a shim layer that provides an easy-to-use configuration API for the frontend PHP products. MobileConfig manages mobile apps' configs on Android and iOS, and bridges them to the backend systems such as Configerator and Gatekeeper. MobileConfig is not bridged to Sitevars because Sitevars is for PHP only. MobileConfig is not bridged to PackageVessel because currently there is no need to transfer very large configs to mobile devices.

The P2P file transfer mentioned as part of PackageVessel is none other than BitTorrent. Yes, BitTorrent finds many applications in the datacenter. This example from Twitter in 2010.

The Configerator architecture

The Configerator application is designed to defend against configuration errors using many phases. "First, the configuration compiler automatically runs developer-provided validators to verify invariants defined for configs. Second, a config change is treated the same as a code change and goes though the same rigorous code review process. Third, a config change that affects the frontend products automatically goes through continuous integration tests in a sandbox. Lastly, the automated canary testing tool rolls out a config change to production in a staged fashion, monitors the system health, and rolls back automatically in case of problems."

I think this architecture is actually quite simple, even though it may look complex.  Both control and data are flowing the same direction: top to down. There are no cyclic dependencies which can make recovery hard. This is a soft-state architecture. New and correct information pushed from top, will cleans old and bad information.

Canary testing: The proof is in the pudding

The paper has this to say on their canary testing:
The canary service automatically tests a new config on a subset of production machines that serve live traffic. It com- plements manual testing and automated integration tests. Manual testing can execute tests that are hard to automate, but may miss config errors due to oversight or shortcut under time pressure. Continuous integration tests in a sandbox can have broad coverage, but may miss config errors due to the small-scale setup or other environment differences. A config is associated with a canary spec that describes how to automate testing the config in production. The spec defines multiple testing phases. For example, in phase 1, test on 20 servers; in phase 2, test in a full cluster with thousands of servers. For each phase, it specifies the testing target servers, the healthcheck metrics, and the predicates that decide whether the test passes or fails. For example, the click-through rate (CTR) collected from the servers using the new config should not be more than x% lower than the CTR collected from the servers still using the old config.
Canary testing is an end-to-end test, and it somewhat overrides trying to build more exhaustive static tests on configs. Of course the validation, review, and sandbox tests are important precautions to try to make sure the config is sane before it is tried in small amount in production. However, given that Facebook already has canary testing, it is a good end proof for correctness of the config, and this somewhat obviates the need for heavyweight correctness checking mechanisms. The paper gives couple examples of problems caught during canary testing.

On the other hand, the paper does not make it clear how conclusive/exhaustive are the canary tests. What if canary tests don't catch slowly manifesting errors, like memory leaks. Also, how does Facebook detect whether there are  abnormality during a canary test? Yes, Facebook has monitoring tools (ubertrace and mystery machine) but are they sufficient for abnormality detection and subtle bug detection? Maybe we don't see adverse effect of configuration change for this application, but what if it adversely affected other applications, or backend services. It seems like an exhaustive monitoring, log collection, and log analysis may need to be done to detect more subtle errors.

Performance of the Configerator

Here are approximate latencies for configerator phases:
When an engineer saves a config change, it takes about ten minutes to go through automated canary tests. This long testing time is needed in order to reliably determine whether the application is healthy under the new config. After ca- nary tests, how long does it take to commit the change and propagate it to all servers subscribing to the config? This la- tency can be broken down into three parts: 1) It takes about 5 seconds to commit the change into the shared git repository, because git is slow on a large repository; 2) The git tailer (see Figure 3) takes about 5 seconds to fetch config changes from the shared git repository; 3) The git tailer writes the change to Zeus, which propagates the change to all subscribing servers through a distribution tree. The last step takes about 4.5 seconds to reach hundreds of thousands of servers distributed across multiple continents.

This figure from the paper show that git is the bottleneck for configuration distribution. "The commit throughput is not scalable with respect to the repository size, because the execution time of many git operations increases with the number of files in the repository and the depth of the git history.  Configerator is in the process of migration to multiple smaller git repositories that collectively serve a partitioned global name space."

Where is the research?

Configerator is an impressive engineering effort, and I want to  focus on what are the important research take aways from this. Going forward, what are the core research ideas and findings? How can we push the envelope for future-facing improvements?

How consistent should the configuration rollouts be?
There can be couplings/conflicts between  code and configuration. Facebook solves this cleverly. They deploy code first, much earlier than the config, and enable the hidden/latent code later with the config change. There can also be couplings/conflicts between old and new configs. The configuration change arrives at production servers at different times, albeit within 5-10 seconds of each other. Would it cause problems to have some servers run old configuration, some new configuration? Facebook punts this responsibility to the developers, they need to make sure that new config can coexist with old config in peace. After all they use canary testing where fraction of machines use new config, remaining the old config. So, in sum, Facebook does not try to have a strong consistent reset to the new config. I don't know the details of their system, but for backend servers config changes may need stronger consistency than that.

Push versus Pull debate.
The paper claims push is more advantageous than pull in the datacenter for config deployment. I am not convinced because the arguments do not look strong.
Configerator uses the push model. How does it compare with the pull model? The biggest advantage of the pull model is its simplicity in implementation, because the server side can be stateless, without storing any hard state about individual clients, e.g., the set of configs needed by each client (note that different machines may run different applications and hence need different configs). However, the pull model is less efficient for two reasons. First, some polls return no new data and hence are pure overhead. It is hard to determine the optimal poll frequency. Second, since the server side is stateless, the client has to include in each poll the full list of configs needed by the client, which is not scalable as the number of configs grows. In our environment, many servers need tens of thousands of configs to run. We opt for the push model in our environment.
This may be worth revisiting and investigating in more detail. Pull is simple and stateless as they also mention, and it is unclear why it couldn't be adopted.

How do we extend to WAN?
All coordination mentioned is single master (i.e., single producer/writer). Would there be a need for multi master solution, a master at each region/continent that can start a config update? Then the system shall need to deal with concurrent and potentially conflicting configuration changes.  However, given that canary testing is on the order of minutes, there would not be a practical need for multi-master deployment in the near future.

Reviews of other Facebook papers

Facebook's Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services

Facebook's software architecture

Scaling Memcache at Facebook

Finding a Needle in Haystack: Facebook's Photo Storage

Finding a Needle in Haystack: Facebook's Photo Storage