Retroscope: Retrospective cut-monitoring of distributed systems (part 3)

- December 22, 2017

This post continues the discussion on monitoring distributed systems with Retroscope. Here we focus on cut monitoring approach Retroscope uses. (This post is jointly written with Aleksey Charapko and Ailidani Ailijiang.)

Retroscope is a monitoring system for exploring global/nonlocal state history of a distributed system. It differs from other monitoring tools due to the way it inspects the system state. While request tracers inspect the system by following the trace of a request (i.e. request r in the figure), Retroscope performs cut monitoring and examines the system at consistent global cuts, observing the state across many machines and requests. It moves along the system history and scans a progression of states one cut at a time, checking cut Ts1 and then Ts2 and so on.

Retroscope’s cut monitoring approach is complementary to the request tracing solutions, and brings a number of advantages. First, by exposing the nonlocal state, Retroscope enables users to examine nonlocal properties of distributed applications. Using Retroscope you can inspect state distributed across many machines and can reason about the execution of a complex distributed application through invariant checking. Furthermore, by sifting through many past nonlocal states, you can perform root-cause analysis and use the across-node context to diagnose race conditions, nonlocal state inconsistencies, and nonlocal invariant violations.

To illustrate some of these benefits, we use Retroscope and the Retroscope Query Language (RQL) to study the data staleness of replica nodes in a ZooKeeper cluster. Staleness is a non-local property that cannot be easily observed by other monitoring techniques. To our surprise, we found that even a normally operating cluster can have a large staleness. In one of our observations in AWS EC2, some ZooKeeper replicas were lagging by as much as 22 versions behind the rest of the cluster as we discuss at the end of this post.

Feasibility of Cut Monitoring

Ok, if cut monitoring is so useful why was this not done before? The answer is cut monitoring was not very feasible. A standard way to do cut monitoring is with vector clocks (VC), but VC do not scale well for large systems due to its O(N) space complexity. Moreover, using VC results in identifying excess number of concurrent cuts for a given point, many of which are false positives that do not occur in actual system execution.

Retroscope employs hybrid logical clocks (HLC) and a scalable stream processing architecture to provide a feasible end-to-end solution for cut monitoring. The NTP-synchronized physical clock component of HLC shrinks the number of consistent cuts at a given point to only 1. (It may be argued that this reduces the theoretical coverage compared to VC, but this a good tradeoff to take to improve performance and avoid false-positives resulting from VC.) Using HLC also allows us to construct consistent cuts without the need to coordinate across nodes. Finally, the HLC size is constant, and this reduces the communication overheads. We talked about these advantages in Part 1.

To achieve a scalable implementation of Retroscope, we leveraged Apache Ignite for stream processing, computation, and storage. We arranged the log ingestion in a way to minimize data movement and to improve data locality and achieve maximal parallelism when searching. We had covered these issues in Part 2.

In our prototype, Retroscope processing deployed on one quad-core server was processing over 150,000 consistent cuts per second. Horizontal scalability is one of the strongholds of Retroscope’s architecture. Adding more compute power, allows Retroscope to redistribute the tasks evenly across all available servers and achieve a nearly perfect speedup (93% going from 4 to 8 servers).

Ok, now back to the ZooKeeper case study to show the advantages cut monitoring approach.

The ZooKeeper Case Study

Users interact with Retroscope via the declarative Retroscope Query Language (RQL). The users only need to specify the nonlocal predicates to search for, and leave the rest for the system to figure out.

To illustrate Retroscope and RQL, we considered the replica staleness monitoring in Apache ZooKeeper a. In ZooKeeper, a client can read data from any single replica, and if the replica is not fully up-to-date, the client will read stale data. The staleness is a nonlocal property, because it is defined by considering the states of other replicas at that same point in time. Using a simple RQL query, we can find the cuts that violate normal (less than 2 versions) staleness behavior of a cluster:
SELECT r1 FROM zklog
WHEN Max(r1) - Min (r1) > 1 ;
In this query, r1 is the version of a node’s state. The system retrospectively looks at past application states and search for the ones that satisfy this staleness predicate.

We observed many cuts having the staleness problem, with a few larger spike (up to 22 version stale!) that captured our attention. To investigate the causes for the excessive staleness cases, we need to inspect the message exchange in the system at those points. Here is the query we use for that:
SELECT r1, sentCount, recvCount, diff, staleness
FROM zklog
COMPUTE
GLOBAL diff
AND GLOBAL staleness
AND (staleness := Max(r1) - Min (r1))
AND (diff:= NodeSum(sentCount) - NodeSum(recvCount))
AT TIME t1 TO t2

In this query we included another nonlocal property: the number of messages in transit between nodes. The query scans through past cuts around the time of observed staleness we identified earlier. This allows us to visualize both staleness and the number of messages being in-transit between nodes in the cluster. We see that the staleness spikes at the same time as the number of “in-flight” messages increases.

The number of messages “stuck” in the network tells us still only a little about the communication patterns in the cluster. To gain more insight in the message exchanges, we look at the in-flight messages more rigorously and examine the sets of sent and received messages at each node with this query:
SELECT sentM, recvM, inFlight, r1, staleness
FROM zklog
COMPUTE
GLOBAL staleness
AND (staleness := Max(r1) - Min(r1))
AND GLOBAL inFlight
AND (inFlight := Flatten(sentM) \ Flatten(recvM))
AT TIME x TO y

We run this query with a custom query processor that visualizes the results as a “heat-map” of message exchange. Here is an example of how messages were flowing in the system right before and at the peak of the staleness event. The deeper blue color represents greater number of messages being in the network between nodes. We see more messages in-flight in both directions between nodes #3 (leader) and #4, suggesting that staleness is caused by messages being stuck in-transit between these nodes for longer than usual. This indicates a possibility of a momentary millibottleneck in the network between the node #3 and node #4.