Sunday, January 14, 2018

The Lambda and the Kappa Architectures

This article, by Jimmy Lin, looks at the Lambda and Kappa architectures, and through them considers a larger question: Can one size fit all?

The answer, it concludes, is it depends on what year you ask! The pendulum swings between the apex of one tool to rule them all, and the other apex of multiple tools for maximum efficiency. Each apex has its drawbacks: One tool leaves efficiency on the table, multiple tools spawns integration problems.

In the RDBMS world, we already saw this play out. One size RDBMS fitted all, until it couldn't anymore. Stonebraker declared "one size does not fit all", and we have seen a split to dedicated OLTP and OLAP databases connected by extract-transform-load (ETL) pipelines. But these last couple years we are seeing a lot of one size fits all "Hybrid Transactional/Analytical Processing (HTAP)" solutions being introduced again.

Lambda and Kappa

OK, back to telling the story from the Lambda and Kappa architectures perspective. What are the Lambda and Kappa architectures anyway?

Lambda, from Nathan Marz, is the multitool solution. There is a batch computing layer, and on top there is a fast serving layer. The batch layer provides the "stale" truth, in contrast, the realtime results are fast, but approximate and transient. In Twitter's case, the batch layer was the MapReduce framework, and Storm was the serving layer on top. This enabled fast response at the serving layer, but introduced an integration hell. Lambda meant everything must be written twice: once for the batch platform and again for the real-time platform.The two platforms need to be indefinitely maintained in parallel and kept in sync with respect to how each interact with other components and integrates features.

Kappa, from Jay Kreps, is the "one tool fits all" solution. The Kafka log streaming platform considers everything as a stream. Batch processing is simply streaming through historic data. Table is merely the cache of the latest value of each key in the log and the log is a record of each update to the table. Kafka streams adds the table abstraction as a first-class citizen, implemented as compacted topics. (This is of course already familiar/known to database people as incremental view maintenance.)

Kappa gives you a "one tool fits all" solution, but the drawback is it can't be as efficient as a batch solution, because it is general and needs to prioritize low-latency response to individual events than to high-throughput response to batch of events.

What about Spark and Apache Beam?

Spark considers everything as batch. Then, the online stream processing is considered as microbatch processing. So Spark is still a one tool solution. I had written earlier about Mike Franklin's talk which compared Spark and Kappa architecture.

Apache Beam provides abstractions/APIs for big data processing. It is an implementation of Google Dataflow framework, as explained in the Millwheel paper. It differentiates between event time and processing time, and uses a watermark to capture the relation between the two. Using the watermark it provides information about the completeness of observed data with respect to event times, such as 99% complete in 5 minute mark. The late arriving messages trigger a makeup procedure to amend previous results. This is of course close to the Kappa solution, because it treats everything, even batch, as stream.

I would say Naiad, TensorFlow, timely dataflow, differential dataflow are "one tool fits all" solutions, using similar dataflow concepts as in Apache Beam.

Can you have your cake and eat it too? and other MAD questions.

Here are the important claims in the paper:

  • Right now, integration is a bigger pain point, so the pendulum is now on the one-tool solution side.
  • Later, when efficiency becomes a bigger pain point, the pendulum will swing back to the multi-tool solution, again.
  • The pendulum will keep swinging back and forth because there cannot be a best of both worlds solution.

1) The paper emphasizes that there is no free lunch. But why not?
I think the argument is that a one tool solution cannot be as efficient as a batch solution, because it needs to prioritize low-latency response to individual events rather than prioritizing high-throughput response to batch of events.

Why can't the one tool solution be more refined and made more efficient? Why can't we have finely-tunable/configurable tools? Not the crude hammer, but the nanotechnology transformable tool such as the ones in the Diamond Age book by Neal Stephenson?

If we had highly parallel I/O and computation flow, would that help achieve a best of both worlds solution?

2) The paper mentions using API abstractions as a compromise solution, but quickly cautions that this will also not be able to achieve best of both worlds, because abstractions leak.

Summingbird at Twitter is an example of an API based solution: Reduced expressiveness (DAG computations) is traded of for achieving simplicity (no need to maintain separate batch and realtime implementations). Summingbird is a domain specific language (DSL) that allows queries to be automatically translated into MapReduce jobs and Storm topologies.

3) Are there analogous pendulums for other problems?
The other day, I posted a summary of Google's TFX (TensorFlow Extended) platform. It is a one tool to fit all solutions approach, like most ML approaches today. I think the reason is because integration and ease-of-development is the biggest pain point these days. The efficiency for training is addressed by having parallel training in the backend, and training is already accepted to be a batch solution. When integration/development problems are alleviated, and we start seeing very low-latency training demand for machine learning workloads, we may expect to see the pendulum to swing to multitool/specialization solutions in the space.

Another example of the pendulum thinking is in the decentralized versus centralized coordination problem. My take on this is that centralized coordination is simple and efficient, so it has a strong attraction. You go with a decentralized coordination solution only if you have a big pain point with the centralized solution, such as geographic separation induced latency. But even then hierarchical solutions or federations can get you close to best of both worlds.

The presentation of the paper

I respect Jimmy Lin's take on the subject because he has been in the trenches in his Twitter times, and he is also an academic and can evaluate intrinsic strength of the ideas abstracted away from the technologies. And I really enjoyed reading the paper in this format. This is a "big data bite" article, so it is written in a relaxed format, and manages to teach a lot in 6 pages.

However, I was worried when I read the first two paragraphs, as it gave some bad signals.  The first paragraph referred to the one tool solution as a "hammer", which is associated with crude and rough. The next paragraph said: "My high level message is simple: there is no free lunch." That is a very safe position, it may even be vacuous. And I was concerned that Jimmy Lin is refraining to take any positions. Well, it turns out, this was indeed his final take all things considered, and he took some strong positions in the article. His first rant (yes, really, he has a sidebar called "Rant") about Lambda architecture has some strong words.

No comments: