Sunday, October 31, 2010

Paxos taught

In my previous posts I have alluded a couple of times to how I teach Paxos. Here I will explain how I go about teaching Paxos.

These are the slides I use in class. (In my slides I reuse a lot of material from the lecture slides of Jeff Chase (Duke). He shared his slides with me under the Creative Commons license, where I get to remix/change the contents with proper credit to him.)

Paxos is a protocol for solving distributed consensus. The problem is easy to state. Consensus requires the following three properties. The first two are safety properties, the last one is a liveness property.
  • Agreement: No two process can commit different decisions.
  • Validity (Non-triviality): If all initial values are same, nodes must commit that value.
  • Termination: Nodes commit eventually.

I first start with a discussion of impossibility results in consensus. A major impossibility result is the coordinated attack (aka two generals) paradox. The result states that there is no deterministic algorithm for reaching consensus in a model where an arbitrary number of messages can be lost undetectably (to the sender). The result is strong, it applies to both asynchronous and synchronous models. Teaching this result is fun. Before I finally explain the proof of impossibility, I challenge the students to come up with a solution, and we shoot each "solution" down together.

Even after assuming no message loss, another impossibility result haunts consensus. Fisher-Lynch-Paterson impossibility result says that there is no deterministic algorithm for reaching consensus under the asynchronous system model in the presence of just 1 crash failure. Unlike the coordinated attack result FLP applies only for asynchronous systems, partially-synchronous, or synchronous systems are safe from this impossibility result. However, it is important to note that even for the most synchronous cluster of computers a heavy load of requests can break all the timeliness assumptions and turn the system into an asynchronous one in effect.

Next I discuss another related impossibility result, the CAP theorem, which had a lot of impact in the context of large scale web services development. CAP theorem says that your system can only have two of the three properties: Consistency, Availability, Partition-Tolerance. CAP theorem has been highly influential in large-scale web services as it succinctly captured the tradeoffs that follow from the above impossibility results in that domain. The reason I teach CAP theorem is because it provides a map for the distributed system protocols I teach later, including Paxos and Dynamo. Paxos provides CP because the system is always consistent, even in a partition, but a reachable replica may deny service without agreement of the others. (Actually Paxos provides availability even in a partition if the server the request hits is part of a majority quorum. The requests that hit a minority island are the ones that get rejected.) Amazon Dynamo, in contrast, provides AP because the system is always available if any replica is up and reachable, even in a partition, but might not be consistent, even without a partition. Dynamo provides eventual consistency approximation instead of a consistency guarantee. (Footnote: In Daniel Abadi's alternative proposal to CAP theorem, Paxos is PCEC, and Dynamo is PAEL.)

Then I start discussing Paxos. First I motivate Paxos by pointing out how important Paxos is, and how Paxos got widely adopted in datacenters for the tasks that require consistency. The importance of Paxos is that it satisfies the safety properties even under asynchrony and arbitrary message loss. This does not conflict with the impossibility results discussed above. Those impossibility results said that you can't achieve consensus (both safety and liveness at the same time), they did not say you have to sacrifice safety under message-losses or asynchrony conditions. So Paxos preserves safety under all conditions and achieves liveness when conditions improve outside the impossibility realm (less message losses, some timing assumptions start to hold). Another reason Paxos is important is because it is simple and comes with a formal proof.

I caution my students that although Paxos is a simple protocol, it may take a lot of time and effort to really internalize and grok the protocol. After all this protocol has been mostly elusive to distributed systems people for about 20 years. Leslie Lamport first formulated Paxos in 1980s, and submitted his first paper on it on 1990. The paper got rejected :-), and finally appeared in print in 1998. He gave some talks about the protocol, casting it as a historical part-time parliment protocol from the Greek Paxos islands (hence the name of the protocol). Nobody got it, except a few (Butler Lampson, Nancy Lynch). The protocol stayed underground mostly until the 2000s. Here is Lamport's discussion about the interesting history behind the Paxos protocol.

This post has already gotten long, so I will not go into an explanation of how/why Paxos works. Maybe I will have a "Paxos Taught 2" post for this later. You can see slides 10-40 for an explanation of Paxos. While teaching, I bring 5-6 students to the board to perform live reenactments the Paxos consensus algorithm under several fault scenarios. So the Paxos classes are generally the most fun ones in my distributed systems course.

After describing and reiterating Paxos, which takes 2-3 classes at least, I show how Paxos is used in real world. I discuss the Paxos Made Live paper which discusses the Google Chubby system. That work provides some optimizations (which do not violate safety and Paxos's guarantees) to improve the efficiency and performance of the system.

Finally, I give another application of Paxos in the transaction commit problem. 2PC is blocking if the leader dies, supposedly the 3PC protocol takes care of the blocking, but it has many special cases to consider, and is unproven. Leslie Lamport and Jim Gray, two giants of distributed systems, have proposed using Paxos to solve this transaction commit problem. The obvious solution is to use Paxos to record the transaction manager(TM)'s decision in 2PC, so that if the TM fails, the new TM learns this decision unambigiously from that Paxos box. However, Lamport and Gray suggest using Paxos to record each resource manager(RM)'s decision, so the TM (original or new TM) can learn each RM's vote unambiguously from their corresponding Paxos boxes. Although this approach requires more messages, it's advantage is that it shaves off another message latency from the commit compared to the obvious solution. At the end, this paxos-commit requires 5 message delays, which is very reasonable given that 2PC requires 4 message delays.

Thursday, October 28, 2010

Speculative Execution in a Distributed File System

This SOSP'05 paper is about optimization techniques for client server distributed computing model, especially focusing on distributed file systems. In distributed filesystems, providing strong cash-consistency among concurrently editing clients require several synchronization messages to be exhanged between the clients and the server. Network delays, especially in wide area networks, make this unfeasibly slow.
In fact AFS and NFS sacrifice consistency guarantees for speed, and provide weaker consistency guarantees such as close-to-open consistency where a client that opens the file see modifications by clients that have previously closed the file. Even these weaker consistency guarantees require synchronizing with the server and introduce latencies.

Can distributed file systems be safe, consistent, yet fast? The paper states that it is possible to achieve this by using operating system support for lightweight checkpointing and speculative execution.

The idea is simple. The main reason for latency is that, traditionaly, when a client contacts server with request, the client blocks till it gets a reply. To avoid blocking, the paper proposes to use speculation. In this approach, the client contacts the server with a request, client checkpoints & speculatively continues execution using its guess for the result of the communication. When the server replies, if the client's guess was right, the client saves time. Otherwise, the client restores state back to checkpoint and continues from there. So no savings over traditional is achieved when the guess is wrong.

Of course this approach works best when 1. operations are highly predictable, 2. checkpointing is cheaper than network I/O, and 3. computers have resources to spare. All these conditions appear to hold in today's computer systems. Especially, with the advent of multicore processors it is possible to concurrently try multiple guesses while speculating. Condition 2 can be a suspect in LANs with less than 1ms roundtrip times, though. (The paper mentions that the checkpointing time for a 64mb process is about 6ms. Maybe checkpointing can be done faster, and also as the paper mentions, many speculations can share the same checkpoint and amortize costs)

The challenge is how to manage speculative executions so that it does not contaminate other non-speculative processes and cause system wide rollbacks. The speculator (the name of the system the authors build) does not allow a process that is executing speculatively to externalize output until that speculation is proven correct. Note that Speculator does not block read-only calls or calls that modify only task state. Despite blocking external output, dependencies between processes introduced by other indirect paths. For ensuring correct execution without side-effects, Speculator tracks these dependencies passed through fork, exit, signals, pipes, fifos, Unix sockets, and files in local and distributed file systems. All other forms of IPC currently block the speculative process until the speculations on which it depends are resolved. The good news is that since speculation is implemented entirely in the operating system, no application modification is required.

In the evaluation section, results from PostMark and Andrew-style benchmarks show that Speculator improves the performance of NFS by more than a factor of 2 over local-area networks; over net-works with 30ms of round-trip latency, speculation makes NFS more than 14 times faster. To recap from the introduction again, speculation improves file system performance because it hides latency: multiple file system operations can be performed concurrently, and computation can be overlapped with I/O.

This speculation approach is used for different systems including
byzantine fault-tolerant system for replicated state machine in "Tolerating Latency in Replicated State Machines Through Client Speculation" NSDI 2009.

Saturday, October 23, 2010

Live migration of virtual machines

This is a 2005 paper by the same group at Cambridge that developed the Xen virtualization system discussed in the previous post. This review will be short since it is only based on the notes I took while listening to the paper presentation in the seminar.

Live virtual machine (VM) migration is the relocation of a VM from one machine to another while its applications continue to execute. The motivation for live migration is to perform load management across machines and even across data centers. Another motivation is fault-tolerance, if the original host needs to go down due to faults, maintenance, or power outage, migration can provide availability of the applications running on the hosted VMs.

The challenges with live migration of VMs is to minimize the downtime and to provide a seamless relocation of the VM so the VM can continue to operate normally in its new address.

One strawman for migrating the contents of VM memory is the stop-and-copy approach. This approach leads to a long down-time and hence not compatible with our objective. Another strawman for migrating memory is on-demand paging approach. Here we first freeze the VM at the source machine, copy minimal execution context to the target machine, restart the VM at the target, and pull memory contents from the source as and when needed. The drawback here is the slow startup of the VM at the target.

The approach that is taken in this paper is a hybrid of the two, and is called the pre-copy migration approach. We DON'T freeze the VM at the source machine, but start copying the VM'S pseudo-physical memory contents to the target machine over multiple iterations. Then when there is little dirty memory remaining that is still not copied, we do a short stop and copy. This way we get the best of the two worlds by avoid the drawbacks of both.

As for the networking of the VM, we need to make sure that the migrating VM will include all protocol state and will carry its IP address. We assume that the source and destination exist behind a single switched LAN, so by generating an unsolicited ARP-reply from the migrated host we can advertise that the IP moved to a new location.

The paper does not address the problem of migrating local-disk storage. One approach suggested for this is to use a network attached storage that is uniformly accessible from all host machines. Another approach can be to do replication of the local disk at other machines and choosing the target among the machines where the disk is replicated.

Xen and the art of virtualization

This week in the seminar class we discussed the Xen virtualization paper from SOSP 2003. Xen is the first open source virtualization solution, however, Vmware was already available in 2000 as a commercial solution for virtualization.

A virtual machine (VM) is a software implementation of a machine (i.e. a computer) that executes instructions like a physical machine. The biggest benefit of virtualization is in server consolidation: enabling efficient usage of computer server resources in order to reduce the total number of servers that an organization requires. Thanks to the virtualization's ability to separate the OS and application from the hardware, it becomes possible to run multiple applications (in complete isolation from each other) on each server instead of just one application per server. This increases the utilization rate of servers and prevents the problem of "server sprawl", a situation in which multiple, under-utilized servers take up more space and resources than can be justified by their workload. Another motivation for virtualization is that it provides flexibility. We can easily migrate virtual machines across the network, from server to server or datacenter to datacenter, to balance loads and use compute capacity more efficiently. I will discuss live migration of virtual machines in the next paper review post shortly.

From the motivation above, it follows that the biggest goal of virtualization is isolation of the virtual machines from one another. And the important thing here is to achieve virtualization with as little overhead (performance penalty) as possible.

Some basic terms used in the virtualization domain are "guestOS: the OS that xen hosts", "domain: the virtual machine where the guestOS runs", and "hypervisor: Xen as it runs at a higher privilege level than the guestOSes it hosts".

Xen provides para-virtualization as opposed to full-virtualization. Vmware provided full virtualization (i.e., a complete simulation of the underlying hardware), so there was no need to modify the OS at all. To achieve this the Vmware hypervisor trapped and translated any binary command to mediate access to the resources, but this approach incurs a big penalty on the performance. Xen avoided this penalty on the performance by using paravirtualization. In order to achieve better performance Xen required modifications to the guestOS (a one-time overhead per guestOS to hack this up). Even then, note that Xen did not require any modifications to the application running on the guestOS.

Before I can discuss why Xen needed to modify the guestOS and what modifications are needed, here is a brief description of the Xen architecture. There are four distinct privilege levels (described as rings) on x86: level 0 (most privileged) where the OS runs, levels 1,2 unused, and level 3 (least privileged) where the application code runs. Xen hypervisor has to run at the most privileged level, so it runs at level 0 and bumps the OS to run at the previously unused level 1. Then, the hypervisor has to play the role of the mediator between the hardware and the guestOS running at the less privileged level. Running at a less privileged level prevents the guestOS from directly executing privileged instructions. The memory management is trickier than the CPU privilege management. The x86 does not have a software-managed Translation Lookaside Buffer (TLB), so TLB misses are serviced automatically by the processor from the page table structure in the hardware. The x86 also does not have a tagged TLB, so address space switches require a complete TLB flush. This forced Xen to take the paravirtualization route: guestOSes are made responsible for allocating and managing the hardware page tables with minimal involvement from Xen to ensure safety and isolation. Each time a guestOS requires a new page table it allocates and initializes a page from its own memory and registers it with Xen. Each update to the page-table memory should also be validated by Xen. In 2008, Intel Nahelem and AMD (SVM) introduced tags as part of the TLB entry and dedicated hardware which checks the tags during lookup. A ring -1 privilege level is also added for the hypervisor to run while guestOS runs at ring 0. As a result Xen does not need to modify the guestOS anymore.

This figure shows a machine running Xen hosting different guestOSs, including Domain0 running control software in a XenoLinux environment.
Domain0 is special, it is used for bootup. Only domain0 has direct access to the physical discs, and the management software running in domain0 is responsible for mediating access to the hardware devices.

There are two types of control transfer between the domain (VM) and hypervisor. The first one is via a synchronous hypercall from the domain to hypervisor to perform a privileged operation. The second one is an asynchronous event from hypervisor to the domain that replaces the usual delivery mechanism for device interrupts (e.g., new data received over the network). The data transfer to and from each domain is mediated via Xen. Rather than emulating existing hardware devices as in full-virtualization, Xen exposes a set of clean and simple device abstractions. The data transfer is done using shared-memory asynchronous buffer descriptor rings (not to be confused with privilege rings). The I/O descriptor ring implements two queues: request queue and response queue between the domain and Xen.

The paper provides extensive evaluation results. The short of it is that Xen can host up to 100 virtual machines simultaneously on a server circa 2003, and Xen's performance overhead is only around 3% compared to the unvirtualized case.

One question asked in the class during the discussion is: would it make sense to perform a co-design of schedulers of hypervisor and guestOS? Hypervisor scheduler can provide hints to the guestOS scheduler about when a device/resource will be available, and guestOS can use this information to schedule things more cleverly. Would this pay off? How can this be done cleanly?

Virtualization approach takes the guestOS/application as a blackbox. By taking the guestOS/application as a graybox (using some contracts, mutual rely-guarantees, or having access to a specification of the application), there can be ways to improve performance. But this is a very fuzzy seed of a thought for now.

Tuesday, October 19, 2010

Mencius: building efficient replicated state machines for Wide-Area-Networks (WANs)

I will write my short summary of the Mencius paper from OSDI08 here. Writing this helped me understand a couple subtle points better, so I hope it will also help others.

Replicated state machines is a common approach for achieving fault-tolerance. State machine replication works by replicating the state and function across multiple servers (replicas) and using consensus to agree on the sequence of the commands the replicas execute. Assuming deterministic service, since all replicas execute the same sequence of commands, they all keep the same state, and strong consistency guarantees are provided to the face of server failures.

Paxos is a fault-tolerant implementation of replicated state machines. Paxos has become hugely popular because it is the only formally proven protocol that works in the face of asynchronous model. Paxos preserves safety (no incorrect decisions are taken by replicas) to the face of violations of timing assumptions on the channels and servers. Paxos satisfies progress (a decision is taken by replicas) when timing assumptions are satisfied, channels deliver messages, and majority of replicas are alive. I will not attempt to explain Paxos here, as that would take at least another post. Although the protocol is simple, significant time and effort is needed to internalize how and why Paxos works.

WANs are becoming a focus area in cloud computing, as services and data need to be replicated among multiple datacenters in different regions of the country and the world. Paxos is unsuitable for WANs due to the single leader requirement. In Paxos during normal operation only one server act as the leader: All client requests should be forwarded to that leader, and that leader then clears this with the other replicas (via a one-to-all followed by all-to-one traffic).

This single entry-point requirement leads to three bottlenecks. First is the bandwidth bottleneck due to the unbalanced communication pattern (in one-to-all and all-to-one, only the links adjacent to the leader are used). Second is the computational bottleneck at leader, as the leader needs to process more messages. The paper calls a deployment/load that reaches the bandwidth bottleneck first as network-bound and that reaches the CPU bottleneck first as CPU-bound. (Somehow I doubt whether CPU-bound load/deployment is possible in practice.) Finally, the third problem is the extra latency imposed for clients that are far away from the leader. These clients need to send requests to all the way to a far away server (even though a nearby replica may exist) and wait for the reply from that server.

Mencius is a multileader version of Paxos and tries to eliminate the single entry-point requirement in Paxos. Mencius achieves load balancing by partitioning consensus sequence numbers (consensus requests/instances) among multiple servers. e.g., if we have 3 servers, server 0 is responsible for acting as a leader for consensus instances numbered 0,3,6..., server 1 for 1,4,7..., and server 2 for 2,5,8...

This load balancing helps distribute the network bandwidth and CPU load better. In contrast to Paxos where only the channels adjacent to the leader server gets utilized, now since all servers may get to act as leader all channels get utilized. Network-bound limitation is alleviated this way. Similarly, CPU-bound limitation is alleviated as all servers pitch in for the leader role. Finally, the clients can send their requests to a nearby server to avoid the extra latency for sending the requests to a faraway server. (The paper assumes that each site has a server and a group of clients, so the clients know the nearest server to themselves is the server in the same site.)

Mencius adapts to client and network load variance by using simple consensus. In simple consensus only one special server per consensus instance (let's call this coordinator) can propose any command (including a no-op), the others can only propose no-op for that instance. A benefit of using simple consensus is that servers can learn a skipped no-op without having to have a majority of servers to agree on it first. This enables, servers with low client load to ckip their turns without having to have a majority of the servers agree on it first. The servers can piggyback SKIP messages to other messages, e.g., ack messages to other servers.

By using multiple leaders, however, Mencius loses out from the "serve reads locally at the leader" optimization possible in Paxos. Since there is only one leader in Paxos, that leader may use leases with other replicas. While the lease holds the replicas promise they will not choose a new leader, so the leader is guaranteed that it has the latest information and can serve reads locally. In Mencius, since there are several leaders, this leasing cannot be based on leader, so this serve-reads locally optimization is not applicable. It may still be possible to use a more complicated lease based on partitioning of the data and serve reads locally.

Of course, Mencius is not addressing the core question of can you slash down latency by serving read and even the write requests locally? This is a hard question. There are inherent tradeoffs between consistency and availability and low-latency that would rule out Paxos completely. So, I need to touch the CAP theorem a bit to explain this.

In his blog, Dan Abadi gives the most lucid explanation of the CAP theorem I have ever read. Dan proposes that the acronym is not CAP, it should actually be PACELC. If there is currently a partition, the tradeoff is between the availability and consistency (PAC corresponds to this part). Else, if no partition, then the tradeoff is between low-latency and consistency (ELC corresponds to this part).

Paxos is, of course, PCEC since it always chooses consistency, and hence the checking required with other replicas. A truly low-latency protocol should probably be PAEL or PCEL, but then it should also provide some consistency in the E part as well to be useful. Yahoo's PNUTS system, which is PCEL, attempts to answer this question (putting some consistency in the E part as well) by partitioning the data with respect to the owner of the data and providing “timeline consistency” where replicas may not be consistent with each other but updates are applied in the same order at all replicas via asynchronous replication. For example for a social network application, the server in Buffalo is assigned as the leader for Murat's status page. So Murat can access and update his data with little latency in Buffalo, and this data is gradually replicated to other sites in the world. But, PNUTS is just one point in the design space, there should be more choices to explore.

By the way, Yahoo is opening a green datacenter in Buffalo, check it out.

Future of powerpoint presentations

This Ted talk reminded me of a dream I had.

In that dream, hundreds of soldiers are being shipped into a big empty convention center room, and they are being deployed into lines that are only a meter across from other enemy combatant soldiers. Between these two combatant lines there are some soil bags lined up, but the height of these bag blocks is only waist level. Everyone is so serious, this is war, so they take these deployment instructions very seriously. Nobody wants to make a mistake. After the deployment is finished, the commander is supposed to announce start, and soldiers are supposed to start firing at each other. I wake up before this happens. This is pure madness and terror. I felt disturbed by this very visual image for many days. But if you think about it, this was pretty much how WW1 and WW2 were in abstract.

I think that deployment would make a really good performance art project. It is both abstract and concrete(visual) at the same time. It drives the point home. It is abstract because we are cutting the idea of war to its basic: killing. It is concrete because it gives a better image of war than pages of pages of descriptions.

As computational resources grow quickly due to cloud computing, I predict that in near future (5-10 years) standard powerpoint presentations will have animations of movie quality. Today's movie quality may become the norm for company or lecture presentations soon, who knows. But I am sure that there will still be the same abundance of bad presentations regardless (or maybe due to) the improvements in technology.

There is Prezi taking a small step in that direction now. Check it out, if you haven't discovered it yet.

Friday, October 8, 2010

Rethinking Enterprise Network Control

This post will cover three closely related papers together:
Rethinking Enterprise Network Control
NOX: Towards an Operating System for Networks
OpenFlow: Enabling Innovation in Campus Networks

These papers are all about simplifying the job of the network administrators, that is, that of monitoring and enforcing policies over a large network.

Existing approaches for network administration are hard to implement and maintain. One of these approaches is using middleboxes at network chokepoints (e.g., firewalls). Another is adding functionality to networks (e.g., using diagnostic tools, controls for VLANs, access-control lists, filters).

These papers advocate the use of an omniscient central controller to simplify the implementation and enforcing of policies and simplifying network equipment. The papers argue that despite its centralized nature, the controller has good scalability in this approach.

The three objectives of this approach are:
1) Network should be governed by policies declared over high-level names (user, server, host instead of IPs)
2) Network routing should be policy aware (network policies declare what each device and network can connect to)
3) Network should enforce a strong binding between a packet and its origin (this is difficult in traditional systems, ip addresses can be spoofed easily; controller manages to overcome theses spoofing attempts by taking over the authentication)

The controller is the brain of the system and is responsible for everything. All switches, users, hosts register with controller with credentials for authentication. Each new flow is sent to the controller, and the controller denies or accepts and provides the routing instructions. Finally, the controller tracks bindings to enable describing policies in terms of users and hosts.

The switches are similar to ethernet switches, but simpler, because the controller takes over many of the features of the switches. The Open Flow paper describes the switches and their interface to the controller in detail. Open Flow makes a pragmatic compromise; it allows researchers to run experiments without vendors to expose internal workings. The interface is basically the flow table, managed by the controller. Flow entries contain a Header (to match packets against), an Action (to tell the switch what to do with the packet), and Per-Flow Data (counters to collect stats). The switch forwards a the first packet of a flow, if the the packet does not match any active entries in its flow table. The control makes a decision; if the flow is allowed the controller computes the flow's route and adds a new entry to the flow tables of all switches along the path.

Going into more details, there are two types of entries in the flow table: 1) per flow entries describing flows to be forwarded, and 2) per host entries describing misbehaving hosts whose packets should be dropped. Obviously doing experiments at the flow level is simple with this architecture. You can also do experiments at the packet level, for this the papers recommends that the switch route these to a netfpga to do a line-rate packet inspection and processing.

NOX paper focuses on providing a programmatic interface over the architecture discussed above. The goal is to enable writing centralized programs that will be deployed on the controller to observe and control the network.

Question: Can this system handle spoofing of Ethernet address?
Answer: Not very easily, not very clear if it can do that.

Question: Can the controller use flow patterns to detect bittorrent activity?
Answer: Yes, but this may lead to false-positives, deep-packet inspection is needed.

Tuesday, October 5, 2010

The Chubby Lock Service for Loosely-Coupled Distributed Systems, Burrows, OSDI 2006

I didn't read this paper before class, because I thought I had read it earlier. It turns out I hadn't and I was confusing it with the "Paxos Made Live" PODC 2007 paper. I realized this only towards the middle of the class :-) The "Paxos Made Live" paper focused on the Paxos consensus implementation that makes the state replication and availability work in Chubby. In contrast, this Chubby Lock Service paper focuses only on the lock service part and is very client-centric. It is the need and uses cases of the clients (i.e., the programmers in Google) that has dictated the design of the service and the interfaces, and even the presentation of the paper.

Chubby serves course grain locks rather than fine-grained locks to keep the load light. The paper mentions that the primary goal is to provide reliability/availability and thruput/storage are secondary goals.

Chubby uses Paxos for consistent state replication among the five machines in the Chubby cell. The details of this process is best left to the "Paxos Made Live" paper. Paxos is simple yet could be hard to internalize. When I am teaching the distributed systems class, I dedicate one week solid to the Paxos algorithm so that students have time to understand and internalize the algorithm. As far as this paper is concerned, the thing to keep in mind from the Paxos algorithm is that the master performs a write (for installing a lock) by contacting all the replicas and getting an acknowledgement from at least the majority of the replicas that write is performed. So, writing takes time, the master has to clear everything with the other replicas first before responding to the client that the write is performed. The read is quicker, the master can serve the reads locally. This is achieved by using a lease on the replicas; while the lease holds the replicas promise they will not choose a new master, so the master is guaranteed that it has the latest information and can serve reads locally. (This lease should not be confused with the lease to clients that will be mentioned below.)

OK, let's proceed to discuss the meat of the paper. I have to confess that there are several things I could not understand in this paper. Maybe this is because I have more of an academic background and little background as a programmer, the intended audience of the paper. The students in the seminar also were not any successful than me in answering the questions I raise below.

Chubby provides an interface much like a distributed file system with advisory locks. A typical name is: "/ls/foo/wombat/pouch" where ls stands for lock service, and foo is the name of a Chubby cell, the remaining path is the file that holds the metadata for the lock. "Locks are advisory" means that locks conflict only with other attempts to acquire the same lock: holding a lock called F neither is necessary to access the file F, nor prevents other clients from doing so. The author states the reason for this decision as "Chubby didn't want to break other clients", "We rejected mandatory locks, which make locked objects inaccessible to clients not holding their locks." But I don't understand how you can have an application that uses locks and one that ignores locks run at the same datacenter without giving up consistency and sanity perhaps. Is there an assumption about the filesystem that it will check with Chubby if a client without a lock is trying to acccess a file with a lock?

A lot of the paper is devoted to explaining the lease to clients and client-side caching of the locks to reduce the traffic to the master. Caching is all about improving performance via leases, so that the client doesn't need to read-request the lock from the master everytime it wants to check it. And it turns out the programmers were really in the habit of busy-waiting on the locks. This is from page 13 of the paper: "Originally, we did not appreciate the critical need to cache the absence of files, nor to reuse open file handles. Despite attempts at education, our developers regularly write loops that retry indefinitely when a file is not present, or poll a file by opening it and closing it repeatedly when one might expect they would open the file just once."

So, what the Chubby system does is to solve this problem for the clients. Chubby master makes the clients cache the locks, and promises to contact the client when that cache needs to be invalidated. So, the client does the busy-waiting on the lock in its cache without overwhelming the master with read-requests. This is how the paper puts it: "Chubby clients cache file data and node meta-data in a consistent write-through cache held in memory. The cache is maintained by a lease, and kept consistent by invalidations sent by the master, which keeps a list of what each client may be caching."

Keep alive messages are employed to extend the client leases with the master. I really like this idea. The client asks for a lease extension via an RPC, and the master, upon receiving this rpc, blocks the RPC and answers it (granting the lease) only close to the end of the lease. This allows the master to extend lease with max time and little overlap. I guess this process could have been at the master immediately on receving the request if the reply referenced a physical time rather than a period. But then replying early is a disadvantage for the master: What if it grants the lease too early (then it is promising for a long interval in the future), and it dies right after this reply. Then availability suffers, because first the gap and then the lease need to be waited. There is a reason the client is not sending the request toward the end of the lease and i rather sending it early on. This is because if the master is down, the client can find this out early on with this check to the master and plan accordingly. Another advantage of this RPC mechanism is that the notifications from the master to the client is piggybacked to the callback on the keep alive RPC. This way firewalls are not an issue for the master.

I found the Section 2.9 on failovers very complicated. This section talks about how the caches and leases can survive a master failover (new master selection in Chubby) with least interruption as possible. The paper also acknowledges this: "Readers will be unsurprised to learn that the fail-over code, which is exercised far less often than other parts of the system, has been a rich source of interesting bugs." There are all sorts of things to deal with in a failover. The new master tries to construct the leases the old master has distributed to the clients by observing the old handles the clients use to refer the files (handles include a sequence-number that allows a master to tell whether a handle was generated by it or by a previous master), and tries to keep these sessions alive seamlessly as much as possible.

Monday, October 4, 2010

The Google File System, Ghemawat et al., SOSP 2003

The motivation for the GFS arised because the traditional filesystems didn't fit the bill for Google's use cases: Firstly, failures are continual (i.e., always happening) since Google has thousands of machines. Secondly, multi-GB files are common in Google due to mapreduce jobs. (Though, the paper does not mention mapreduce at all, the mapreduce paper appeared the next year in 2004.) And finally, again due to mapreduce, in Google's use cases file access is read/append mostly, random writes are very rare. (For mapreduce jobs, it was more important to have a high sustained bandwidth than low latency atomic appends.)

The GFS architecture is very simple (to the credit of the designers). Clients talk to a single master to get a handle on the chunk to read/write, and then using the handle clients pull/push the data to the chunkservers. Chunks are replicated over (by default) three chunkservers, one of which is designated as a primary replica by the master. So the master in fact provides three handles to the client for a requested chunk. The client can read from any of the three replicas, for example, it may choose the closest replica to read from. The client has to write to all three replicas, and the designated primary coordinates the 2-phase-commit of the write process on behalf of the client. (In sum, the write quorum is 3 out of 3, and the read quorum is 1 out of 3.)

While single master keeps things very simply, there can be some disadvantages to this decision. The single master can be a traffic bottleneck, but GFS prevents this by not involving the master in actual read/write, the master just provides the handle and gets out of the way. Leasing handles to clients also helps reduce traffic to the master further. The single master is also a single point of failure and availability could suffer, but GFS prevents this by using Chubby (which was actually again not mentioned in the paper, and appeared later in 2007).

As I mentioned above, the primary replica is used for coordinating write to the other two replicas. The client first pushes the data to all replicas, and then contact the primary with the write request. The primary forwards the write requests to replicas, and wait for the acknowledgements from the replicas. The primary then contacts the client about success or failure of the write.

I was concerned at this point about why a replica is being overloaded as being a primary and to coordinate a commit on behalf of the client. This only makes the replicas complicated, which could otherwise just have been hard disks basically. Why can't the client do this instead? The answer is because a chunk may be getting accessed my multiple clients simultaneously, and if the clients coordinate the writes the order may be very different in each replica; the replicas diverge. On the other hand, by using a primary replica these simultaneous accesses are effectively serialized with respect to the order the primary receives these requests, and the same order is dictated to all the replicas. So, the best place to do this is at the primary replica.

Another concern about the write process is that it would have large overheads for writing small files because of all the coordination the primary has to do. (Yes, GFS is designed for batch read/writes of large files.) I posited that an optimistic write approach could help alleviate this problem. Instead of doing a guaranteed write to all 3 replicas, the client may be given say 5 handles, and does an optimistic write (without any coordination, no primary, no 2 phase locking) to all 5 replicas at once. The assumption here is that at least 3 out of 5 of these replicas will be written. Then the read quorum should be chosen to be 3 for the small files (instead of the original 1 in GFS). By reading 3 out of 5 replicas, the client is guaranteed to see the most up-to-date version of the file.