Towards Modern Development of Cloud Applications

- November 22, 2023

This paper is from HotOS'23. At 6 pages, it is an easy-to-read paper, but it is not an easy-to-agree-with paper. The message is controversial: Don't do microservices, write a monolith, and our runtime will take care of deployment and distribution. This is a big claim, and we have been burned by ambitious attempts like this many times before. I realize big claims are part of the style of HotOS, where work-in-progress and sometimes provocative papers make a debut to kickstart a discussion. This paper sure does a good job of starting a discussion.

Good

There is code, and it is opensource, so this is not just a speculation paper. A Go framework does exist, which has been under development for sometime inside Google. Given Google's expertise on infrastructure and Go, I think this framework will be a big boon to the Google Cloud Platform (GCP), if it gets into production.

To evaluate the framework (let's call it ServiceWeaver, with its Github name, shall we?), they consider a popular web application: Online Boutique. They say that Online Boutique is "representative of the kinds of microservice applications developers write". It consists of about 10K lines of code, implemented as ...(wait for it)... 11 microservices!

For evaluation, Table 2 shows the number of CPU cores used and their end-to-end latencies. For 10K queries per second, the 11 microservices Go implementation uses 78 cores, but the monolithic implementation (deployed with their ServiceWeaver runtime) uses 28 cores. These numbers are without colocation of any components. The savings become more impressive if you put all the 11 components into a single OS process, the number of cores drops to 9 and the median latency drops to 0.38 ms, both an order of magnitude lower than the baseline.

Ok, let's step back for a minute. Did a 10K LOC application need to be implemented as 11 microservices? Did it have to be distributed in the first place? If you start with a distributed to a fault baseline, it is easy to show impressive improvements.

Let's remember Frank McSherry's holy war against unnecessarily distributed analytics services. I had reviewed the "Scalability, but at what COST?" paper here. Frank had shown that "some single threaded implementations [on Frank's laptop] were found to be more than an order of magnitude faster than published results (at SOSP/OSDI!) for systems using 100s of cores"! If you start with "poor baselines and low expectations", it is easy to show impressive improvements.

Let's get back to the evaluation section of the paper again. They say that most of the performance benefits of the monolithic implementation comes from getting rid of versioning and field numbers. Wow! How do you do atomic monolithic deployments? The answer is using blue-green deployments! But such one-shot deployments would be particularly hard to coordinate across AZs, let alone across regions. And finally, how do you deal with versions in the database, and schema changes when doing these deployments?

To conclude listing the "good" parts, I want to mention the challenges discussion in the introduction. There are remedies to these (such as knowing what you are doing), but there is no denying that these are real challenges.

I think the biggest challenge with microservices is complexity of integration. When you start building with microservices, integration becomes challenging: the later you delay the integration, the bigger the pain.

Bad

The claims are not scoped well. I think this framework is good for many web/frontend applications. But the paper has a general claim. For crying out loud, the paper starts with this sentence: "When writing a distributed application, conventional wisdom says to split your application into separate services that can be rolled out independently."

After that sentence, I read the entire paper with distributed applications/systems in mind. The paper doubles down on this claim in the last sentence of the introduction. "Though these challenges and our proposal are discussed in the context of serving applications, we believe that our observations and solutions are broadly useful."

But as I kept reading, I realized that this would not apply to general distributed services, and specifically backend systems. I realized that this is more applicable for a limited domain, like platform as a service (PaaS) applications, such as a web application, with limited freedoms. As I mentioned above, this would be a great boon for web-services built on GCP, for example. And that looks like the end game here.

At the end of the paper, in Section 8.3, there is a very short paragraph talking about distributed systems. After having said so many things about how the ServiceWeaver framework/runtime distributes things and takes care of distributed systems concerns, this paragraph comes across as confusing. Too little, too late?

Ugly

By engaging the microservices versus monolith architecture discussion, the paper pokes the bee-hive, without answering the real questions. What do I mean by real questions? Can this alternative approach address the problems microservices has made non-problems, that we forgot they were problems.

"Tradition is a set of solutions for which we have forgotten the problems. Throw away the solution and you get the problem back. Sometimes the problem has mutated or disappeared. Often it is still there as strong as it ever was." -- Donald Kingsbury

This harkens back to the famous Chesterson's fence principle, which cautions against dismissing established systems without comprehending their original purpose, and that second-order effects should be considered.

There exists in such a case a certain institution or law; let us say, for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, “I don’t see the use of this; let us clear it away.” To which the more intelligent type of reformer will do well to answer: “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.”

One big benefit of microservices is that it serves as a technical patch to a social problem: organizing development across two-pizza teams. You assign each team of 5-10 a microservice, and this reduces the communication overhead. Microservices are not without its challenges. Integration is a challenge, but at least it gets you going.

When developing the work as a monolith, where do you start coding, how do you grow the code? Scaling needs to be thought out. Every 10X in size requires a different design. How would growing a monolith for scaling work? Does it start with an unscalable system first? But then what is the design path to make it scalable?

Sure, you can do separation-of-work and reduction of coordination with the monolith approach if you know what you are doing. But if you know what you are doing, you would avoid many problems with microservices as well.

The paper didn't cite and address the classic "A note on distributed computing" by Jim Waldo in 1994. That paper has a section titled: "Dejavu all over again". In 1994! That was before DCOM and CORBA. Looks like enough time has passed and the pendulum swings yet again, one more time.

Figure 1 sounds great on paper. ServiceWeaver can put components all together for efficiency. But what about blast-radius, bursty/coordinated traffic, and metastable failures?

"Did Google invent AGI? These questions are AGI complete." If I was reading this paper 5 years ago, I would not hesitate to write that as a counter-argument. With recent advancements in ML, maybe it is time to reconsider this smart middleware approach again. I don't know. Still this is a tall order. How do you automate design, especially distributed systems design?

Comments

Anonymous said…

I think your article misses the point of the paper. It's not about monolith vs microservices; it's a new paradigm to combine the benefits of both. You can certainly organize teams to work with this paradigm without having a lot of communication overhead. It's similar to modular monoliths, but with a more flexible deployment topology. You are right about potential issues with blast-radius, bursty/coordinated traffic, and metastable failures, but that is why ServiceWeaver allows users to choose which deployment style they want to use for each component, hence reducing the downside.

November 22, 2023 at 2:39 PM