Paper summary. SnailTrail: Generalizing critical paths for online analysis of distributed dataflows

Monitoring is very important for distributed systems, and I wish it would receive more attention in research conferences. There has been work on monitoring for predicate detection purposes and for performance problem detection purposes. As machine learning and big data processing frameworks are seeing more action, we have been seeing more work on the latter category. For example in ML there have been work on how to figure out what is the best configuration to run. And in the context of general big data processing framework there has been work on identifying performance bottlenecks.

Collecting information and creating statistics about a framework to identify the bottleneck activities seems like an easy affair. However, the "making sense of performance" paper (2015) showed that this is not as simple as it seems, and sophisticated techniques such as blocked time analysis are needed to get a more accurate picture of performance bottlenecks.

This paper (by ETH Zurich and due to appear in NSDI 18) extends the performance bottleneck analysis to more general frameworks and to be supported by an online monitoring tool called SnailTrail. SnailTrail is written in Rust, over the Timely Dataflow framework. It supports monitoring of applications written in Flink, Spark, Timely Dataflow, Tensorflow, and Heron.

SnailTrail overview

The SnailTrail tool operates in 5 stages:
  • it ingests the streaming logs from the monitored distributed application,
  • slices those streams into windows, 
  • constructs a program activity graph (PAG) for the windows, 
  • computes the critical path analysis of the windows, and 
  • outputs the summaries. 

Program activity graph (PAG) is a directed acyclic graph. The vertices denote the start and end of activities, such as: data processing, scheduling, buffer management, serialization, waiting, application data exchange, control messages, or unknown activities. The edges has a type and a weight for the activities. The edges capture the happened-before relationships between the vertices. Figure 2.a shows an example. The figure also shows how to project this PAG to an interval, which is pretty straightforward and as you expect.

As output, SnailTrail provides summaries for operation, worker, and communication bottlenecks. These summaries help identify which machine is overloaded, and which operators should be re-scaled. The figure below shows examples.

Critical path analysis

Critical path analysis (CPA) is a technique originally introduced in the context of project planning. CPA produces a metric which captures the importance of each activity in the transient critical paths of the computation.
The CPA score calculation in SnailTrail centers on the Transient Path Centrality concept which corresponds to the number of paths this activity appears on. In the formula e[w] corresponds to the weight of the edge e, which is taken as the start/end time duration of the activity.


In Figure 3, the activity (u,v) has the highest CPA score because it is involved in all of the 9 paths that appear in this window.

A noteworthy thing in the PAG in Figure 3 is the wait period (denoted as dashed lines) after b in worker 1. Because worker 1 is blocked with a wait after b, that path terminates at b, and does not feed into another path. The paper explains this as follows: "Waiting in our model is always a consequence of other, concurrent, activities, and so is a key element of critical path analysis: a worker does not produce anything useful while waiting, and so waiting activities can never be on the critical path." Because the two wait states terminates in deadend some of the paths, in this interval depicted in Figure 3, there are only 9 possible paths.

The formula above enables us to calculate the CPA scores without enumerating and materializing all the transient critical paths of the computation in each window. But how do you calculate N (which, for Figure 3 is calculated as 9) without enumerating the critical paths? It is simple really. The PAG is a directed acyclic graph (DAG). Everytime there is a split in the path, you copy path_count to both edges. Everytime two edges join, you set the new path_count to be the sum of the path_counts in the two incoming edges. At the end of the interval, look at how many paths are exiting, that is N. Give this a try for Figure 3 yourself.

MAD questions

1) What could be some drawbacks/shortcomings of CPA? One reason CPA may not be representative is because one execution of the application may not be typical of all executions. For example in Fig 3, w3 may take long time in the next execution in c-d activity and that could become the bottleneck in another execution of the application. But it is possible to argue that since SnailTrail produces summaries, this may not be a big issue. Another reason CPA may not be representative is because the execution may be data dependent. And again it is possible to argue that this won't be a big issue if the application uses several data in processing, and things get averaged.

Another criticism of CPA could be that it does not compose. Try combining two adjacent windows; that would lead to changing the CPA scores for the activities. Some previously low scored  activities will jump up to be high, and some highly scored activities will go down. Because of this non-composition, it becomes important to decide what is the best window/interval size to calculate these CPA metrics for summarization purposes. On the other hand, it may be reasonable to expect these metrics to be non-composable, since these metrics (blocked time analysis and critical time analysis) are designed to be complex to capture inherent/deep critical activities in the computations.

Could there be another approach than the CPA scoring presented here to capture the importance of an execution activity in the paths of the computation?  Maybe something that uses the semantics of activity types and their relation to each other?


2) The SnailTrail method uses snapshots to determine the start and end of the windows that the CPA scoring algorithm works on. Does time synchronization need to be perfect for snailtrail snapshot and analysis to work? What are the requirements on the time synchronization for this to work? 

It turns out this currently requires perfect clock synchronization. The evaluation experiments are run in the same machine with 32 cores. Without perfect clock synchronization, the snapshots may not be consistent and that could break the path calculation and CPA scoring. Hybrid Logical Clocks can help deal with the uncertainty periods in NTP clocks and can make the method work. The paper addresses this issue in Appendix F: "In SnailTrail, we assume that the trend toward strong clock synchronization in datacenters means that clock skew is not, in practice, a significant problem for our analysis. If it were to become an issue, we would have to consider adding Lamport clocks and other mechanisms for detecting and correcting for clock skew."


3) The paper doesn't have an example of improvement/optimization based on summary analytics. Even when the summary shows the high scored CPA activities to improve upon, it will be tricky for a developer to go in and change the code to improve things, because there is no indication of how easy it would be to improve the performance on this high CPA score activities. 

One way to address this could be to extend the CPA idea as follows. After ranking the activities based on CPA scores, construct another ranking of activities based on ease of optimizing/improving (I don't know how to do this, but some heuristics and domain knowledge can help). Then start the improvements/optimizations from the easiest to optimize activities that have highest CPAs.
Another way to make the summary analytics more useful is to process it further to provide more easily actionable suggestions, such as add more RAM, more CPU, more network/IO bandwidth. Or the suggestions could also say that you can do with less of resources of one kind, which can helpf for saving money in cloud deployments. If you can run with similar performance but on cheaper configurations, you would like to take that option. Would this extension require adopting SnailTrail monitoring and CPA scoring more towards capturing  resource usage related metrics? 


4) Since CPA first appeared in the project/resource management domain, could there be other techniques there that can apply to performance monitoring of distributed execution?


5) I think SnailTrail breaks new ground in the "context" or "philosophy" of monitoring: it is not trace based, it is not snapshot based, but it is window/interval-based. In our work on Retroscope, we argued it is important to aggregate the paths for both spatially (across the distributed computation) and temporally (as an evolving picture of the system's performance). SnailTrail extends the snapshot view to window views.

Miscellaneous


Frank McSherry recently joined ETH Zurich's systems group and will be helping with the strymon project and the SnailTrail work, so I'll be looking forward to more monitoring work from there.  

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book