Saturday, January 5, 2019

Paper review. Serverless computing: One step forward, two steps back

Serverless computing offers the potential to program the cloud in an autoscaling and pay-only-per-invocation manner. This paper from UC Berkeley (to appear at CIDR 19) discusses limitations in the first-generation serverless computing, and argues that its autoscaling potential is at odds with data-centric and distributed computing. I think the paper is written to ignite debate on the topic, so here I am writing some of my takes on these arguments.

Overall, I think the paper could have been written in a more constructive tone. After you read the entire paper, you get a sense that the authors want to open constructive dialogue for improving the state of serverless rather than to crucify it. However, the overly-critical tone of the paper leads to some unfair claims,  not only about serverless, but also about cloud computing as well. This paragraph in the introduction is a good example.
New computing platforms have typically fostered innovation in programming languages and environments. Yet a decade later, it is difficult to identify the new programming environments for the cloud. And whether cause or effect, the results are clearly visible in practice: the majority of cloud services are simply multi-tenant, easier-to-administer clones of legacy enterprise data services like object storage, databases, queueing systems, and web/app servers. Multitenancy and administrative simplicity are admirable and desirable goals, and some of the new services have interesting internals in their own right. But this is, at best, only a hint of the potential offered by millions of cores and exabytes of data.

I think this paragraph downplays a lot of technologies developed for the cloud (from both programming languages and environments perspective): MapReduce, Resilient Distributed Datasets, Hadoop environment, Spark environment, real time data processing and streaming systems, distributed machine learning systems, large-scale caching, scalable globe-spanning databases from NoSQL to NewSQL, integration frameworks, RESTful architectures, microservices frameworks,  and Lambda and Kappa architectures. In additon to these, cloud computing also gave rise to supporting systems (VMs, containers), scheduling frameworks (Mesos, Kubernetes), SLAs, and observability, debugging, and devops tools.

Some background on Serverless aka Function as a Service (FaaS)

I had reviewed a paper on serverless/FaaS earlier on this blog. It is worthwhile to check that summary for a couple minutes, for refreshing your background on FaaS. 

FaaS offerings today support a variety of languages (e.g., Python, Java, Javascript, Go) for shipping the code to a common runtime, and allow developers to register functions and declare events that trigger each function. The FaaS infrastructure monitors the triggering events, allocates a runtime for the function, executes it, and persists the results.  FaaS requires data management in both persistent and temporary storage, and in the case of AWS, this includes S3 (large object storage), DynamoDB (key-value storage), SQS (queuing services), SNS (notification services), and more. The user is billed only for the computing resources used during function invocation.

The shortcomings of current FaaS offerings

1) Limited Lifetimes. After 15 minutes, function invocations are shut down by the Lambda infrastructure. Lambda may keep the function's state cached in the hosting VM to support "warm start", but there is no way to ensure that subsequent invocations are run on the same VM.

2) I/O Bottlenecks. Lambdas connect to cloud services—notably, shared storage—across a network interface. This means moving data across nodes or racks. Recent studies show that a single Lambda function can achieve on average only 538Mbps network bandwidth.

3)  Communication Through Slow Storage. While Lambda functions can initiate outbound network connections, they themselves are not directly network-addressable in any way. A client of Lambda cannot address the particular function instance that handled the client's previous request: there is no "stickiness" for client connections. Hence maintaining state across client calls requires writing the state out to slow storage, and reading it back on every subsequent call.

4) No Specialized Hardware. FaaS offerings today only allow users to provision a timeslice of a CPU hyperthread and some amount of RAM; in the case of AWS Lambda, one determines the other. There is no API or mechanism to access specialized hardware.

All of these are factual and fair assessment of current FaaS offerings. I think "no specialized hardware" is not inherent, and can change as applications require it. It can be argued that the other three shortcomings are actually deliberate design decisions for FaaS offerings based on its primary goals: providing pay-only-per-invocation autoscaling computation. FaaS does not require reservation, and does not charge you when your code/app is not being used, yet allows the code/app to deploy quickly within a second in contrast to 20 seconds required for a container and on the order of minutes required for a VM.

Forward but also backward

The paper acknowledges autoscaling as a big step forward: "By providing autoscaling, where the workload automatically drives the allocation and deallocation of resources, current FaaS offerings take a big step forward for cloud programming, offering a practically manageable, seemingly unlimited compute platform."

But it calls out the following two points as major steps backward:
They ignore the importance of efficient data processing. Serverless functions are run on isolated VMs, separate from data. Moreover their capacity to cache state internally to service repeated requests is limited. Hence FaaS routinely "ships data to code" rather than "shipping code to data." This is a recurring architectural anti-pattern among system designers, which database aficionados seem to need to point out each generation.  
They stymie the development of distributed systems. Because there is no network addressability of serverless functions, two functions can work together serverlessly only by passing data through slow and expensive storage. This stymies basic distributed computing. That field is founded on protocols performing  fine-grained communication between agents, including basics like leader election, membership, data consistency, and transaction commit.
...
In short, with all communication transiting through storage, there is no real way for thousands (much less millions) of cores in the cloud to work together efficiently using current FaaS platforms other than via largely uncoordinated (embarrassing) parallelism.

Again I agree with the desirability of efficient data processing, shipping code to data, and network addressable processes/functions. On the other hand, current FaaS offerings are making a trade-off between efficiency and easy-to-program pay-only-per-invocation autoscaling functionality. Going from on-prem to IaaS to PaaS to FaaS, you relinquish control yet in return you get higher productivity for the developer. Systems design is all about tradeoffs.

Is it possible to have both very high control and very easy-to-program/high-productivity system? If you restrict yourself to a constrained domains, it may be able to have both features at their full-high-point together. Otherwise I doubt you can have both at their full extents for general unconstrained computation domains. However, there may be better sweet-points in the design space. Azure Durable Functions provide stateful FaaS and partially addresses the two backwardness concerns above about efficient data-processing and addressable serverless functions.

Case studies

The paper reports from 3 case studies they implemented using AWS Lambda and compare them to implementations on EC2.

1. Model training for star prediction from Amazon product review. This algorithm on Lambda is 21× slower and 7.3× more expensive than running on EC2.

2. Low-latency prediction serving via batching.  This uses SQS (I am not sure if it could be done another way). The corresponding EC2 implementation has a  per batch latency of 2.8ms and is 127× faster than the optimized Lambda implementation. The EC2 instance on the other hand has a throughput of about 3,500 requests per second, so 1 million messages per second would require 290 EC2 instances, with a total cost of $27.84 per hour. This is still a 57× cost savings compared to the Lambda implementation.

3. Distributed computing. This implements Garcia-Molina's bully leader election in Python. Using Lambda, all communication between the functions was done in blackboard fashion via DynamoDB. With each Lambda polling four times a second, they found that each round of leader election took 16.7 seconds. Communicating via cloud storage is not a reasonable replacement for directly-addressed networking--it is at least one order of magnitude too slow.

Again, I think this is not an apples to apples comparison. FaaS offerings  provide pay-only-per-invocation autoscaling computation. FaaS does not require reservation, and does not charge you when your code/app is not being used, yet allows the code/app to deploy quickly within a second in contrast to 20 seconds required for a container and on the order of minutes required for a VM. When you compare for a 3 hours continuous use FaaS will come out costing more, but over a longer period, say many days or weeks of sporadic use FaaS will give you the cheaper option because of its pay-only-per-invocation billing.

Stepping forward to the future

The paper mentions the following two as particularly important directions to address for providing stronger FaaS offerings.

Fluid Code and Data Placement. To achieve good performance, the infrastructure should be able and willing to physically colocate certain code and data. This is often best achieved by shipping code to data, rather than the current FaaS approach of pulling data to code. At the same time, elasticity requires that code and data be logically separated, to allow infrastructure to adapt placement: sometimes data needs to be replicated or repartitioned to match code needs. In essence, this is the traditional challenge of data independence, but at extreme and varying scale, with multi-tenanted usage and fine-grained adaptivity in time. High-level, data-centric DSLs--e.g., SQL+UDFs, MapReduce, TensorFlow-- can make this more tractable, by exposing at a high level how data  flows through computation. The more declarative the language, the more logical separation (and optimization search space) is offered to the infrastructure. 
Long-Running, Addressable Virtual Agents. Affinities between code, data and/or hardware tend to recur over time. If the platform pays a cost to create an affinity (e.g. moving data), it should recoup that cost across multiple requests. This motivates the ability for programmers to establish software agents--call them functions, actors, services, etc.-- that persist over time in the cloud, with known identities. Such agents should be addressable with performance comparable to standard networks. However, elasticity requirements dictate that these agents be virtual and dynamically remapped across physical resources. Hence we need virtual alternatives to traditional operating system constructs like "threads" and "ports": nameable endpoints in the network. Various ideas from the literature have provided aspects of this idea: actors, tuplespaces pub/sub and DHTs are all examples. Chosen solutions need to incur minimal overhead on raw network performance.

MAD questions 


1. How much of the claims/arguments against FaaS also apply to PaaS?
PaaS is in between IaaS and FaaS in the spectrum. It is prone to the same arguments made against limitations of FaaS here. For PaaS you can also argue that it ignores the importance of efficient data processing and it stymies the development of distributed systems. PaaS is a bit better than in FaaS in the closer analysis though: It doesn't have limited lifetimes, it can maintain sessions for clients, and it can support specialized hardware better. On the other hand, FaaS can scale to more invocations much faster in in a couple seconds compared to 20+ seconds for PaaS that use containers.

2. Is ease-of-development more important than performance?
Around 2010 Hadoop MapReduce got a lot of traction/adoption, even though the platform was very inefficient/wasteful and not in "distributed systems spirit" (as this paper calls FaaS). This was pretty surprising for me to witness. It seemed like  nobody cared how inefficient the platform was; people were OK waiting hours for the inefficient mapreduce jobs to finish. They did this because this beats the alternative of spending even more hours (and developer cost) in developing an efficient implementation that does the job.

It seems like ease-of-development takes precedence over performance/cost. In other words, worse is better!

FaaS is all about optimizing developer productivity and time-to-market. FaaS enables developers to prototype something in a week rather than trying to build the same functionality "the right way" in a couple months. It is not worth investing many weeks into developing an optimized system, before you could test if there is a good product fit. But after you prototype using FaaS and test the product, decide on ways to improve and tune the system, if you still need more efficiency and lower cloud costs, you can provide that by leveraging on the experience you got from your prototype and implement an optimized version of the system "the right way" over IaaS over PaaS.

FaaS provides ease of development, and quick on-demand scaling to handle flash-floods. Also despite the disadvantages in efficiency, being stateless is a big plus for fault-tolerance. So I think current FaaS offerings will not have much contest at least for another 5 years. We are still at the start of the FaaS technology curve. In the meanwhile more efficient, more reactive, fluid versions will get ready, and they will hopefully slowly improve on the  ease-of-programming aspect. The early prototypes of the fluid systems, even for constrained domains, still require a lot of programmer skill and knowledge of distributed systems internals, and we need to simplify programming for those systems.

3. How does this play with disaggregated storage/computing trend?
As a cloud-service provider your biggest challenge is to find cost-effective ways to host your clients in a multitenant environment while providing them isolation. It is very hard to rightsize the clients. If you overestimate, you lose money. If you underestimate, the clients get upset, take their business elsewhere, and you lose money.

Enter disaggregation. Disaggregation gives you flexibility for multitenancy and enables keeping the costs down. It simplifies the management of resources and rightsizing the clients and enables the cloud providers to provide performant yet cost-efficient multitenancy. Yes, there is an overhead to disaggregation, namely shipping data to code/compute, but that overhead is balanced by the flexibility of multitenancy you get in return.


Acknowledgment.
I thank Chris Anderson, Matias Quaranta, Ailidani Ailijiang for insightful discussion on Azure functions and Azure durable functions.

1 comment:

Unknown said...

On the title: "Some background on Serverless aka Function as a Service (FaaS)"

I think there is some discussion online that Serverless is not just FaaS as there are platforms like Zeit (https://zeit.co/blog/serverless-docker) that allow full applications on docker containers to be serverless as well i.e. pay-as-you-go, infinite scale, etc.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...