Tuesday, January 29, 2019

Paper review: Probabilistically Bounded Staleness for Practical Partial Quorums

There is a fundamental trade-off between operation latency and data consistency in distributed database replication. The PBS paper (VLDB'12) examines this trade-off for partial quorum replicated data stores.

Quorum systems

We can categorize quorum systems into strict versus partial quorums. Strict quorum systems ensure strong consistency by ensuring that read & write replica sets overlap: $R + W > N$. Here N is the total number of replicas in the quorum, R is the number of replicas that need to reply to a read query, and W is the number of replicas that need to reply to a write query.

Employing partial quorums can lower latency by requiring fewer replicas to respond, but R and W need not overlap: $R+W \leq N$. Such partial quorums offer eventual consistency.


Here is a visual representation of an expanding quorum system. The coordinator forwards a write requests to all N replicas, and wait for W acknowledgements for responding back to the client for completion of the write. The quorum system is called expanding because the third replica will also get the write request soon even though the coordinator waits for only W=2 acknowledgements to confirm the write as completed. Similarly the coordinator also forwards the read request to N nodes, and responds back to the client with the highest versioned read when responses from R replicas are received.

Many quorum-replicated data stores, such as Apache Cassandra, Basho Riak, and Project Voldemort offer a choice between strict quorums with strong consistency and partial quorums with eventual consistency. Cassandra often defaults to a partial/non-strict quorum system with N=3, R=W=1, for maximum performance. While Riak defaults to a strict quorum system with N=3 and R=W=2,  users suggest using  R=W=1, N=2 for low-value data. Finally, for applications requiring very low latency and high availability, LinkedIn deploys Voldemort with N=3 and R=W=1. (Unlike Dynamo style systems, Voldemort sends read requests to R of N replicas--not N of N--; this decreases load per replica and network traffic at the expense of read latency and potential availability.)

In Cosmos DB, inside a region, we offer quorums with N=4, W=3, and allow the user to choose R=1 for session-consistency, consistent-prefix, and eventual-consistency reads, and R=2 for strong-consistency and bounded-staleness consistency reads. Cosmos DB also provides these 5 well-defined consistency levels for global replication and across region reads.

Quorum staleness
How consistent is eventual? For the average case, can we offer staleness bounds with respect to version history and time? The Probabilistically Bounded Staleness (PBS) paper investigates this problem quantify the probability of staleness for partial quorums across versions via k-staleness and time via t-visibility metrics.

Let's start with some basic math to quantify staleness. What is staleness probability? It is the probability that the read quorum does not contain the last written version. We can obtain this by dividing the number of quorums of size $R$ composed of nodes that were not written to in the write quorum by the number of all possible read quorums:

$p_s = \frac{{{N-W} \choose R}}{{N \choose R}}$

For N=3, R=W=1, this probability comes to $p_s$=2/3, that is 0.666. But this is staleness with respect to the latest written version and with an immediate read after the write. We can generalize this staleness formula in both dimensions, with respect to k-versions (in lieu of latest version), and with respect to t unit time delayed read (in lieu of immediate, t=0, read).

k-staleness

A system obeys *k-staleness* consistency with probability $1-p_{sk}$ if at least one value in any read quorum has been committed within k versions of /the latest committed version when the read begins/.

Given the probability of a single quorum non-intersection p, the probability of non-intersection with one of the last $k$ independent quorums is $p^k$. (Note that this calculation ignores the effects of expanding quorums and constitutes a lower bound.)

$p_{sk} = \left( \frac{{{N-W} \choose R}}{{N \choose R}} \right)^k$

t-visibility

A system obeys t-visibility with probability $1-p_{st}$ if any read quorum started at least t units of time after a write commits returns at least one value that is at least as recent as that write.

Let $P_w$ ($W_r$,t) denote the cumulative density function describing the number of replicas W_r that have received version v exactly t time after v commits. For expanding quorums, W replicas have the value with certainty, and we can model t-visibility by summing the conditional probabilities of each possible $W_r$:

$p_{st} = \frac{{{N-W} \choose R}}{{N \choose R}} + \sum\limits_{c \in (W,N]} (\frac{{{N-c} \choose R}}{{N \choose R}} *  [P_w(c+1,t)-P_w(c,t)])$

<k,t>-staleness

A system obeys <k,t>-staleness consistency with $1-p_{skt}$ if  at least one value in any read quorum will be within k versions of the latest committed version when the read begins, provided the read begins t units of time after the previous k versions commit.

$p_{skt} = \left( \frac{{{N-W} \choose R}}{{N \choose R}} + \sum\limits_{c \in (W,N]} (\frac{{{N-c} \choose R}}{{N \choose R}} *  [P_w(c+1,t)-P_w(c,t)]) \right)^k$

Note that k-staleness equals  <k,0>-staleness consistency, and t-visibility equals <1,t>-staleness consistency.

Monte Carlo modeling of t-staleness

Since t-staleness formula depends on $P_w$ the cumulative density function describing the expanding write quorums (i.e., anti-entropy), it is easier to analyze t-staleness using Monte Carlo simulations. So we first model the quorum systems using the *WARS* latency distributions in the operations, and then quantify the t-staleness.


The read coordinator will return stale data if the first R responses received reached their replicas before the replicas received the latest version (delayed by *W*). Note that for a strict quorum, where R+W>N, returning stale data is impossible, because R will intersect a replica that has seen the latest write. For the partial quorum systems, the probability of the staleness depends on the latency distributions on *W*, *A*, *R*, and also indirectly on *S*.

Let wt denotes the commit time (i.e., when the coordinator received W acks). A single replica's response is stale if r' + wt + t < w', for w' drawn from *W* and r' drawn from *R* latency distributions. Of course writes expand to additional replicas during *A* and *R*, and that helps.

We can see from this formulation that longer *W* tails and faster reads increase the chance of staleness due to reordering. Dually, for improved consistency, it helps to:

  • reduce variance for *W* write-req from coord to replicas
  • increase *A* write-reply from replicas to coord
  • increase *R* read-request from coord to replicas
  • reduce variance for *S* read-respond from replicas to coord

(The effect of *S* is indirect and is very small. If S is very high variance, then stale reads may get returned faster than fresh reads. So by reducing the variance on *S*, you increase the chance of reordering of a fresher read to get returned faster.)

Monte Carlo simulations

Calculating t-visibility for a given value of t is straightforward using Monte Carlo simulations.

  1. Draw N samples from *W*, *A*, *R*, and *S* at time t, 
  2. Compute wt as the Wth smallest value of {*W[i] + A[i]*, i \in [0, N )}
  3. Check if the first R samples of *R*, ordered by *R[i] + S[i]* all satisfy $wt+R[i]+t \leq W[i]$

The paper uses exponential latency distribution for some Monte Carlo simulations, because exponential distributions are simple. An exponential distribution describes the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. The cumulative distribution function (CDF) is given as $F(x;\lambda) = 1- e^{-\lambda*x}$,  for $x \geq 0$, which leads to the  Mean = $\frac{1}{\lambda}$, and Variance= $\frac{1}{\lambda^2}$.

The PBS webpage provides an interactive demonstration of Monte Carlo simulations using the *WARS* model with exponential distributions. The demo gives you a better understanding of the effects of *WARS* distribution and t on consistency.

Write-latency distribution effects

In order to illustrate the effects of *W*, write-latency distribution, the paper fixes *A=R=S* with $\lambda$=1, and sweeps a range of *W* distributions by changing its $\lambda$.


As expected, we find that high write variance *W* increases staleness:

  • When the variance of *W* is 0.0625ms ($\lambda$= 4, mean .25ms, one-fourth the mean of *A=R=S*), we observe a 94% chance of consistency immediately after the write and 99.9% chance after 1ms
  • When the variance of *W* is 100ms ($\lambda$ = .1, mean 10ms, ten times the mean of *A=R=S*), we observe a 41% chance of consistency immediately after write and a 99.9% chance of consistency only after 65ms

As the variance and mean of W increases, so does the inconsistency. Looking at distributions with fixed means and variable variances (uniform, normal), the paper finds that the mean of *W* is less important than its variance if *W* is strictly greater than *A=R=S*.

Using production traces

Instead of just providing simulations with exponential distributions, the paper also maps these simulations to production deployments, by first fitting  production latency distributions (obtained from LinkedIn and Yammer deployments) to distributions. It looks like Pareto distributions fit the latency distributions better for most cases.

LNKD-SSD versus LNKD-DISK comparisons provide a good validation for the PBS finding that reducing *W* variance contributes most for the consistency. The figure shows that SSDs improve consistency immensely due to their reduced write variance. Immediately after write commit, LNKD-SSD had a 97.4% probability of consistent reads, reaching over a 99.999% probability of consistent reads after 5ms.


Another thing to notice is that R=2 & W=1 (the blue square plot) blows the socks off of R=1 & W=2 (the green circle plot).  Why? Aren't these suppose to be symmetrical? My explanation is this. By increasing W by 1, you incur only a very little latency (assuming the variance on *W* is not large) and in return, you get to hit two replicas with a read, which exponentially decreases the probability of both replicas missing the latest version.

Why is this not more widely adapted in partial quorum systems? W=1 makes you vulnerable to a data loss but only if both the replica and the coordinator crashes at the same time and the coordinator did not have chance to forward the write to other replicas even though it had acknowledged the write (not very plausible). Even with W>1, reading from more replicas improves consistency quickly, so it is a low hanging fruit to reap when the performance requirements don't forbid it.

Figure 7 shows how varying N affects t-visibility while maintaining R=W=1. As expected, the probability of consistency immediately after write commit decreases as N increases. But you can see that SSDs totally rock! Even with increased N, they keep a very high consistency probability thanks to the very low variance on *W* write latency across replicas.


Finally, Table 4 compares t-visibility required for a 99.9% probability of consistent reads to the 99.9th %ile read and write latencies. The table shows that lowering values of R and W can greatly improve operation latency and  t-visibility can be low even when we require a high probability of consistent reads. Again note how much an improvement W=1 & R=2 provides over W=2 & R=1! That makes a big difference.



MAD questions

1. Is it possible to change the read API to capture sweetpoints in tradeoff between consistency and read/write latency??

There is a knee for large $\lambda$ (i.e.,  small mean & variance). Waiting till the knee gives you the most bang-for-the-buck for consistency, and waiting after the knee helps less.

What if we fix time waited on a read, instead of R, the number of replicas to read from? This will prevent the coordinator from accepting the response from an early stale read-reply as sufficient. The coordinator can wait the period the SLAs (or cut that short if another read reply is received), and this will avoid falling for an early stale read reply.


2. By working from first principles, can we find an unexplored part of the state space to improve consistency??

We saw that for improved consistency, it helps to

  • reduce variance for *W* write-req from coord to replicas
  • increase *A* write-reply from replicas to coord
  • increase *R* read-request from coord to replicas

What are some alternative ways to satisfy these conditions?

If we take this to logical extremes, it is better to keep the replicas close to each other (in the same cluster in LAN, or in nearby regions in WAN), and away from the client. This setup reduces *W* variance, and increases *A* and *R* durations.

I wonder if we can find other unexplored points in the space.


3. Why don't we use PBS analysis in WAN to help cloud clients decide on which regions to deploy??

I had mentioned that Azure Cosmos DB supports clients to configure the default consistency level on their database account (and later override the consistency on a specific read request). For all four relaxed consistency levels (bounded, session, consistent-prefix, and eventual), among other metrics, Cosmos DB also tracks and reports the probabilistic bounded staleness (PBS) metric, which I think is unique among available solutions.

I think this use of PBS can be extended to guide customers decide on which regions to deploy. In the Azure cloud, customers can deploy among 50+ regions, and the selection of the regions have implications for latency and consistency tradeoffs if relaxed consistency levels are chosen. Moreover,  Cosmos DB does not restrict the client to a single write region and allows multiple write-regions and resolves conflicts by way of an Arbiter and anti-entropy mechanism. So PBS metrics can also be used to get the clients get the most out of this by choosing optimal deployment regions for the access patterns. I will be looking at this in the coming weeks.

Wednesday, January 9, 2019

Paper review. An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems

Crashes happen. In fact they occur so commonly that they are classified as anticipated faults. In a large cluster of several hundred machines, you will have one node crashing every couple of hours. Unfortunately, as this paper shows, we are still not very competent at handling crash failures.

This paper from 2018 presents a comprehensive empirical study of 103 crash recovery bugs from 4 popular open-source distributed systems: ZooKeeper, Hadoop MapReduce, Cassandra, and HBase. For all the studied bugs, they analyze their root causes, triggering conditions, bug impacts and fixing.

Summary of the findings

Crash recovery bugs are caused by five types of bug patterns:
  • incorrect backup (17%)
  • incorrect crash/reboot detection (18%)
  • incorrect state identification (16%)
  • incorrect state recovery (28%) 
  • concurrency (21%)

Almost all (97%) of crash recovery bugs involve no more than four nodes. This finding indicates that we can detect crash recovery bugs in a small set of nodes, rather than thousands.

A majority (87%) of crash recovery bugs require a combination of no more than three crashes and no more than one reboot. It suggests that we can systematically test almost all node crash scenarios with very limited crashes and reboots.

Crash recovery bugs are difficult to fix. 12% of the fixes are incomplete, and 6% of the fixes only reduce the possibility of bug occurrence. This indicates that new approaches to validate crash recovery bug fixes are necessary.

A generic crash recovery model 

The study uses this model for categorizing parts of crash-recovery of a system.


The study leverages on the existing cloud bug study database, CBS. CBS contains 3,655 vital issues from six distributed systems (ZooKeeper, Hadoop MapReduce, Cassandra, HBase, HDFS and Flume), reported from January 2011 to January 2014. The dataset of the 103 crash recovery bugs this paper analyzes is available at http://www.tcse.cn/~wsdou/project/CREB.


Root cause


Finding 1: In 17/103 crash recovery bugs, in-memory data are not backed up, or backups are not properly managed.

Finding 2: In 18/103 crash recovery bugs, crashes and reboots are not detected or not timely detected. (E.g., when a node crashes, some other relevant nodes may access the crash node without perceiving that the node has crashed, and they may then hang or throw errors. Or, if a crash node reboots very quickly, then the crash may be overlooked by the crash detection component based on timeout. Thus, the crash node may contain corrupted states, and the system has nodes whose state is out-of-synch, violating invariants.)

Finding 3: In 17/103 crash recovery bugs, the states after crashes/reboots are incorrectly identified. (E.g., the recovery process mistakenly considers wrong states as incorrect: the node may think it is still the leader after recovery.)

Finding 4: The states after crashes/reboots are incorrectly recovered in 29/103 crash recovery bugs. Among them, 14 bugs are caused by no handling or untimely handling of certain leftovers.

Finding 5: The concurrency caused by crash recovery processes is responsible for 22/103 crash recovery bugs. (Concurrency between a recovery process and a normal execution amounts to 13 bugs, concurrency between two recovery processes amount to 4 bugs, and concurrency within one recovery process amounts to 5 bugs.)

Finding 6: All the seven recovery components in our crash recovery model can be incorrect and introduce bugs. About one third of bugs are caused in crash handling component. (Overall, the crash handling component is the most error-prone (34%). The next top three components, backing up (19%), local recovery (17%) and crash detection (13%) also occupy a large portion.)

Finding 7: In 15/103 crash recovery bugs, new relevant crashes occur on/during the crash recovery process, and thus trigger failures.


So it seems that for 85 out of 103 bugs (excluding the 17 bugs from Finding 1) state-inconsistency across nodes is the culprit. I don't want to be the guy who always plugs "invariant-based design" but, come on... Invariant-based design is what we need here to prevent the problems that arose from operational reasoning. In operational reasoning, you start with a "happy path", and then you try to figure out "what could go wrong?" and how to prevent them. Of course, you always fall short in that enumeration of problem scenarios and overlook corner cases, race conditions, and cascading failures. In contrast, invariant-based reasoning focuses on "what needs to go right?" and how to ensure this properties as invariants of your system at all times. Invariant-based reasoning takes a principled state-based rather than operation/execution-based view of your system. To attain invariant-based reasoning, we specify safety and liveness properties for our models. Safety properties specify "what the system is allowed to do" and ensures "nothing bad happens". For example, at all times, all committed data is present and correct. Liveness properties specify "what the system should eventually do" and ensures "something good eventually happens". For example, whenever the system receives a request, it must eventually respond to that request.

Here is another argument for using invariant-based design, and as an  example, you can check my post on the two-phase commit protocol.

Triggering conditions

Finding 8: Almost all (97%) crash recovery bugs involve four nodes or fewer.

Finding 9: No more than three crashes can trigger almost all (99%) crash recovery bugs. No more than one reboot can trigger 87% of the bugs. In total, a combination of no more than three crashes and no more than one reboot can trigger 87% (90 out of 103) of the bugs.

Finding 10: 63% of crash recovery bugs require at least one client request, but 92% of the bugs require no more than 3 user requests.

Finding 11: 38% of crash recovery bugs require complicated input conditions, e.g., special configurations or background services.

Finding 12: The timing of crashes/reboots is important for reproducing crash recovery bugs.

Bug impact

Finding 13: Crash recovery bugs always have severe impacts on reliability and availability of distributed systems. 38% of the bugs can cause node downtimes, including cluster out of service and unavailable nodes.

Finding 14: Crash recovery bugs are difficult to fix. 12% of the fixes are incomplete, and 6% of the fixes only reduce the possibility of bug occurrence.

Finding 15: Crash recovery bug fixing is complicated. Amounts of developer efforts were spent on these fixes.


MAD questions

1. Why is this not a solved problem?
The crash faults have been with us for a long time. They have become even more relevant with the advent of containers in cloud computing, which may be shutdown or migrated for resource management purposes. If so, why are we still not very competent at handling crash-recovery?

It even seems like we have some tool support as well. There are a lot of write-ahead-logging available to deal with the bugs in Finding 1 related to backing up in-memory data. (Well, that is assuming you have a good grasp on which data are important to backup via write-ahead-logging.) We can use ZooKeeper for keeping configuration information, so that the nodes involved don't have differing opinions about which node is down. While keeping the configuration in ZooKeeper helps alleviate some of the state-inconsistency problems, we still need invariant-based design (and model-checking the protocol) to make sure the protocol does not suffer from state-inconsistency problems.

This is a sincere question, and not a covert way to suggest that developers are being sloppy. I know better than that and I have a lot of respect for the developers. That means the problem is more sticky hairy at the implementation level, and our high-level abstractions are leaking. There has been relevant work "On the complexity of crafting crash-consistent applications" in OSDI'14. There is also recent followup work on "Protocol-Aware Recovery for Consensus-Based Distributed Storage" which I like to read soon.

Finally, John Ousterhout had work on write-ahead-logging and recovery in in-memory systems as part of RamCloud project, and I should check recent work from that group.


2. How does this relate to crash-only software?
It is unfortunate that the crash-only software paper has not been cited and discussed by this paper, because I think crash-only software suggested a good, albeit radical, way to handle crash-recovery. As the findings in this paper show a big part of the reason the bugs occur is because the crash-recovery paths are not exercised/tested enough during development and even normal use. The "crash only software" work had the insight to exercise crashes as part of normal use: "Since crashes are unavoidable, software must be at least as well prepared for a crash as it is for a clean shutdown. But then --in the spirit of Occam's Razor-- if software is crash-safe, why support additional, non-crash mechanisms for shutting down? A crash-only system makes it affordable to transform every detected failure into component-level crashes; this leads to a simple fault model, and components only need to know how to recover from one type of failure."


3. How does this compare with TaxDC paper? 
The TaxDC paper studied distributed coordination (DC) bugs in the cloud and showed that more than 60% of DC bugs are triggered by a single untimely message delivery that commits order violation or atomicity violation, with regard to other messages or computation. (Figure 1 shows possible triggering patterns.) While this claim sounds like a very striking and surprising finding, it is actually straightforward. What is a DC bug? It is a manifestation of state inconsistency across processes. What makes the state inconsistency into a bug? A communication/message-exchange between the two inconsistent processes.

Compared to the TaxDC paper, this paper focuses on a small set of bugs, only the crash-recovery bugs. In comparison to the TaxDC paper, the paper states that crash recovery bugs are more likely to cause fatal failures than DC bugs. In contrast to 17% of DC bugs, a whopping 38% of the crash-recovery bugs caused additional node downtimes, including cluster out of service and unavailable nodes.

Saturday, January 5, 2019

Paper review. Serverless computing: One step forward, two steps back

Serverless computing offers the potential to program the cloud in an autoscaling and pay-only-per-invocation manner. This paper from UC Berkeley (to appear at CIDR 19) discusses limitations in the first-generation serverless computing, and argues that its autoscaling potential is at odds with data-centric and distributed computing. I think the paper is written to ignite debate on the topic, so here I am writing some of my takes on these arguments.

Overall, I think the paper could have been written in a more constructive tone. After you read the entire paper, you get a sense that the authors want to open constructive dialogue for improving the state of serverless rather than to crucify it. However, the overly-critical tone of the paper leads to some unfair claims,  not only about serverless, but also about cloud computing as well. This paragraph in the introduction is a good example.
New computing platforms have typically fostered innovation in programming languages and environments. Yet a decade later, it is difficult to identify the new programming environments for the cloud. And whether cause or effect, the results are clearly visible in practice: the majority of cloud services are simply multi-tenant, easier-to-administer clones of legacy enterprise data services like object storage, databases, queueing systems, and web/app servers. Multitenancy and administrative simplicity are admirable and desirable goals, and some of the new services have interesting internals in their own right. But this is, at best, only a hint of the potential offered by millions of cores and exabytes of data.

I think this paragraph downplays a lot of technologies developed for the cloud (from both programming languages and environments perspective): MapReduce, Resilient Distributed Datasets, Hadoop environment, Spark environment, real time data processing and streaming systems, distributed machine learning systems, large-scale caching, scalable globe-spanning databases from NoSQL to NewSQL, integration frameworks, RESTful architectures, microservices frameworks,  and Lambda and Kappa architectures. In additon to these, cloud computing also gave rise to supporting systems (VMs, containers), scheduling frameworks (Mesos, Kubernetes), SLAs, and observability, debugging, and devops tools.

Some background on Serverless aka Function as a Service (FaaS)

I had reviewed a paper on serverless/FaaS earlier on this blog. It is worthwhile to check that summary for a couple minutes, for refreshing your background on FaaS. 

FaaS offerings today support a variety of languages (e.g., Python, Java, Javascript, Go) for shipping the code to a common runtime, and allow developers to register functions and declare events that trigger each function. The FaaS infrastructure monitors the triggering events, allocates a runtime for the function, executes it, and persists the results.  FaaS requires data management in both persistent and temporary storage, and in the case of AWS, this includes S3 (large object storage), DynamoDB (key-value storage), SQS (queuing services), SNS (notification services), and more. The user is billed only for the computing resources used during function invocation.

The shortcomings of current FaaS offerings

1) Limited Lifetimes. After 15 minutes, function invocations are shut down by the Lambda infrastructure. Lambda may keep the function's state cached in the hosting VM to support "warm start", but there is no way to ensure that subsequent invocations are run on the same VM.

2) I/O Bottlenecks. Lambdas connect to cloud services—notably, shared storage—across a network interface. This means moving data across nodes or racks. Recent studies show that a single Lambda function can achieve on average only 538Mbps network bandwidth.

3)  Communication Through Slow Storage. While Lambda functions can initiate outbound network connections, they themselves are not directly network-addressable in any way. A client of Lambda cannot address the particular function instance that handled the client's previous request: there is no "stickiness" for client connections. Hence maintaining state across client calls requires writing the state out to slow storage, and reading it back on every subsequent call.

4) No Specialized Hardware. FaaS offerings today only allow users to provision a timeslice of a CPU hyperthread and some amount of RAM; in the case of AWS Lambda, one determines the other. There is no API or mechanism to access specialized hardware.

All of these are factual and fair assessment of current FaaS offerings. I think "no specialized hardware" is not inherent, and can change as applications require it. It can be argued that the other three shortcomings are actually deliberate design decisions for FaaS offerings based on its primary goals: providing pay-only-per-invocation autoscaling computation. FaaS does not require reservation, and does not charge you when your code/app is not being used, yet allows the code/app to deploy quickly within a second in contrast to 20 seconds required for a container and on the order of minutes required for a VM.

Forward but also backward

The paper acknowledges autoscaling as a big step forward: "By providing autoscaling, where the workload automatically drives the allocation and deallocation of resources, current FaaS offerings take a big step forward for cloud programming, offering a practically manageable, seemingly unlimited compute platform."

But it calls out the following two points as major steps backward:
They ignore the importance of efficient data processing. Serverless functions are run on isolated VMs, separate from data. Moreover their capacity to cache state internally to service repeated requests is limited. Hence FaaS routinely "ships data to code" rather than "shipping code to data." This is a recurring architectural anti-pattern among system designers, which database aficionados seem to need to point out each generation.  
They stymie the development of distributed systems. Because there is no network addressability of serverless functions, two functions can work together serverlessly only by passing data through slow and expensive storage. This stymies basic distributed computing. That field is founded on protocols performing  fine-grained communication between agents, including basics like leader election, membership, data consistency, and transaction commit.
...
In short, with all communication transiting through storage, there is no real way for thousands (much less millions) of cores in the cloud to work together efficiently using current FaaS platforms other than via largely uncoordinated (embarrassing) parallelism.

Again I agree with the desirability of efficient data processing, shipping code to data, and network addressable processes/functions. On the other hand, current FaaS offerings are making a trade-off between efficiency and easy-to-program pay-only-per-invocation autoscaling functionality. Going from on-prem to IaaS to PaaS to FaaS, you relinquish control yet in return you get higher productivity for the developer. Systems design is all about tradeoffs.

Is it possible to have both very high control and very easy-to-program/high-productivity system? If you restrict yourself to a constrained domains, it may be able to have both features at their full-high-point together. Otherwise I doubt you can have both at their full extents for general unconstrained computation domains. However, there may be better sweet-points in the design space. Azure Durable Functions provide stateful FaaS and partially addresses the two backwardness concerns above about efficient data-processing and addressable serverless functions.

Case studies

The paper reports from 3 case studies they implemented using AWS Lambda and compare them to implementations on EC2.

1. Model training for star prediction from Amazon product review. This algorithm on Lambda is 21× slower and 7.3× more expensive than running on EC2.

2. Low-latency prediction serving via batching.  This uses SQS (I am not sure if it could be done another way). The corresponding EC2 implementation has a  per batch latency of 2.8ms and is 127× faster than the optimized Lambda implementation. The EC2 instance on the other hand has a throughput of about 3,500 requests per second, so 1 million messages per second would require 290 EC2 instances, with a total cost of $27.84 per hour. This is still a 57× cost savings compared to the Lambda implementation.

3. Distributed computing. This implements Garcia-Molina's bully leader election in Python. Using Lambda, all communication between the functions was done in blackboard fashion via DynamoDB. With each Lambda polling four times a second, they found that each round of leader election took 16.7 seconds. Communicating via cloud storage is not a reasonable replacement for directly-addressed networking--it is at least one order of magnitude too slow.

Again, I think this is not an apples to apples comparison. FaaS offerings  provide pay-only-per-invocation autoscaling computation. FaaS does not require reservation, and does not charge you when your code/app is not being used, yet allows the code/app to deploy quickly within a second in contrast to 20 seconds required for a container and on the order of minutes required for a VM. When you compare for a 3 hours continuous use FaaS will come out costing more, but over a longer period, say many days or weeks of sporadic use FaaS will give you the cheaper option because of its pay-only-per-invocation billing.

Stepping forward to the future

The paper mentions the following two as particularly important directions to address for providing stronger FaaS offerings.

Fluid Code and Data Placement. To achieve good performance, the infrastructure should be able and willing to physically colocate certain code and data. This is often best achieved by shipping code to data, rather than the current FaaS approach of pulling data to code. At the same time, elasticity requires that code and data be logically separated, to allow infrastructure to adapt placement: sometimes data needs to be replicated or repartitioned to match code needs. In essence, this is the traditional challenge of data independence, but at extreme and varying scale, with multi-tenanted usage and fine-grained adaptivity in time. High-level, data-centric DSLs--e.g., SQL+UDFs, MapReduce, TensorFlow-- can make this more tractable, by exposing at a high level how data  flows through computation. The more declarative the language, the more logical separation (and optimization search space) is offered to the infrastructure. 
Long-Running, Addressable Virtual Agents. Affinities between code, data and/or hardware tend to recur over time. If the platform pays a cost to create an affinity (e.g. moving data), it should recoup that cost across multiple requests. This motivates the ability for programmers to establish software agents--call them functions, actors, services, etc.-- that persist over time in the cloud, with known identities. Such agents should be addressable with performance comparable to standard networks. However, elasticity requirements dictate that these agents be virtual and dynamically remapped across physical resources. Hence we need virtual alternatives to traditional operating system constructs like "threads" and "ports": nameable endpoints in the network. Various ideas from the literature have provided aspects of this idea: actors, tuplespaces pub/sub and DHTs are all examples. Chosen solutions need to incur minimal overhead on raw network performance.

MAD questions 


1. How much of the claims/arguments against FaaS also apply to PaaS?
PaaS is in between IaaS and FaaS in the spectrum. It is prone to the same arguments made against limitations of FaaS here. For PaaS you can also argue that it ignores the importance of efficient data processing and it stymies the development of distributed systems. PaaS is a bit better than in FaaS in the closer analysis though: It doesn't have limited lifetimes, it can maintain sessions for clients, and it can support specialized hardware better. On the other hand, FaaS can scale to more invocations much faster in in a couple seconds compared to 20+ seconds for PaaS that use containers.

2. Is ease-of-development more important than performance?
Around 2010 Hadoop MapReduce got a lot of traction/adoption, even though the platform was very inefficient/wasteful and not in "distributed systems spirit" (as this paper calls FaaS). This was pretty surprising for me to witness. It seemed like  nobody cared how inefficient the platform was; people were OK waiting hours for the inefficient mapreduce jobs to finish. They did this because this beats the alternative of spending even more hours (and developer cost) in developing an efficient implementation that does the job.

It seems like ease-of-development takes precedence over performance/cost. In other words, worse is better!

FaaS is all about optimizing developer productivity and time-to-market. FaaS enables developers to prototype something in a week rather than trying to build the same functionality "the right way" in a couple months. It is not worth investing many weeks into developing an optimized system, before you could test if there is a good product fit. But after you prototype using FaaS and test the product, decide on ways to improve and tune the system, if you still need more efficiency and lower cloud costs, you can provide that by leveraging on the experience you got from your prototype and implement an optimized version of the system "the right way" over IaaS over PaaS.

FaaS provides ease of development, and quick on-demand scaling to handle flash-floods. Also despite the disadvantages in efficiency, being stateless is a big plus for fault-tolerance. So I think current FaaS offerings will not have much contest at least for another 5 years. We are still at the start of the FaaS technology curve. In the meanwhile more efficient, more reactive, fluid versions will get ready, and they will hopefully slowly improve on the  ease-of-programming aspect. The early prototypes of the fluid systems, even for constrained domains, still require a lot of programmer skill and knowledge of distributed systems internals, and we need to simplify programming for those systems.

3. How does this play with disaggregated storage/computing trend?
As a cloud-service provider your biggest challenge is to find cost-effective ways to host your clients in a multitenant environment while providing them isolation. It is very hard to rightsize the clients. If you overestimate, you lose money. If you underestimate, the clients get upset, take their business elsewhere, and you lose money.

Enter disaggregation. Disaggregation gives you flexibility for multitenancy and enables keeping the costs down. It simplifies the management of resources and rightsizing the clients and enables the cloud providers to provide performant yet cost-efficient multitenancy. Yes, there is an overhead to disaggregation, namely shipping data to code/compute, but that overhead is balanced by the flexibility of multitenancy you get in return.


Acknowledgment.
I thank Chris Anderson, Matias Quaranta, Ailidani Ailijiang for insightful discussion on Azure functions and Azure durable functions.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...