Saturday, March 29, 2014

How to write your research paper

The legend of Köroğlu

I will start with a story of oppression and an uprising, involving a mythical horse in the process. Way to start a post about writing research papers, huh?

In 16th century Anatolia, there was a corrupt and oppressor mayor in the Bolu state. The mayor one day decided that he should find and gift the best horse in the world to the Sultan. He contacted a very skilled horse breeder. The breeder said the horse that is deserving of the Sultan should be very special, and said none of his horses is worthy of this. He went on a quest for this horse himself. One day he saw some street kids abusing a feeble and awkward-looking foal. He immediately recognized the potential in this foal and bought the foal, and headed for the mayor's palace. The mayor got outraged, being the ignorant oppressor he is, he thought the breeder is mocking him by offering this weak awkward foal. The mayor immediately ordered the breeder to be blinded.

The breeder had a young son, who became devastated by his father's situation. His father was now blind ("kör" in Turkish), and the son later got nicknamed "Köroğlu", the blind's son. The breeder, instead of worrying about his eyes was more worried about the foal, and instructed his son to build a pitch-black stable for the foal. He then instructed his son to constantly tend to the foal and fatten the foal as much as possible. For many months, the foal was made to stay in this pitch-black cave to eat and fatten up. The breeder did not start any training at all. Many months later, the breeder instructed Köroğlu to get this fat horse out and started a strict training regimen for the horse. The fat quickly turned to muscle, and the horse got very lean in a short time.

The legend is that the horse got so fast that it would run over a mud field and would not get any mud on its feet. Köroğlu used this horse to get his father's revenge from the mayor and became a Robin Hood like figure. Here is a link to the 1968 Turkish movie made for commemorating this legend.

Back to writing!

And I claim that a legend about a horse and an outlaw gives great lessons about writing your paper?! I must be nuts!

Give your idea a chance to grow and thrive 

All excellent ideas/papers/design start in a feeble fragile form. Very much like that foal. Don't judge too soon, otherwise you will misjudge. Instead if you can glimpse a sliver of potential, give your idea a chance to grow.

(With experience, you will be able to tell which feeble ideas are promising, which are not. You will get better at it, like the old breeder.)

In this initial phase (the cave phase), don't listen to any critics. Keep this feeble idea close to your chest. You would need to guard it even from your own criticisms early on. Suppress the criticisms for a while. Feed the idea to see what it can become.

Here Jony Ive talks about Steve Jobs' approach to creative work:
"And just as Steve loved ideas, and loved making stuff, he treated the process of creativity with a rare and a wonderful reverence. You see, I think he better than anyone understood that while ideas ultimately can be so powerful, they begin as fragile, barely formed thoughts, so easily missed, so easily compromised, so easily just squished."
Good thing the reviewers don't get to see the first drafts of any idea/paper, otherwise nothing would get published in the conferences or journals.

Fatten it up

In the cave phase, you need to greedily feed and build up your manuscript.

And this is how you do it: Start writing as soon as possible, and write while you do the work. That means, you keep a lab notebook. This doesn't need to be physical notebook. Open a directory for your new research idea, and create a notes.txt file to act as your lab notebook. In this lab notebook, you will be able to explore each sub idea and produce in bulk without any pressure of good/presentable writing. Since nobody is going to see this writing, you won't have restraints and you can go fast. You should come up with new tangential ideas, and explore all potential directions the original idea can grow. See my post about free writing for more information.

(I use the file as my lab notebook. Org-mode in Emacs is great to outline a project, keep track of the progress of each sub-idea, and manage and review ToDo items for the project.)

So feed it, build it up. Fatten it up. At the end of this you will have a fat mess in your hand. Don't feel ashamed about it. Instead, feel proud.

(Warning: If you have to keep twisting and spinning the same idea too many times just to squeeze out a small contribution, that is bad. There should be potential in the idea. Don't try to resuscitate an idea, if it refuses to grow despite your nourishing.)

Train hard: turn fat into muscles! 

This is the coming out of the cave phase. After finding your purpose and voice, you should now try to present it coherently and clearly. Now, you should be ruthless about getting your paper back in shape. Cut ruthlessly to make it lean. Cut the fluffy parts, the unnecessary tangents, and even the parts that can give the wrong vibe and that may lead an unsuspecting reader to a dead-end. Make it succinct and as simple as possible.

Editing is much much easier than starting with nothing and having to write from scratch, especially when the conference deadline is looming. If you haven't tried this approach to writing a paper before, you will be surprised how much easier it is to edit a fluffy mess into a coherent draft than writing from scratch. I have witnessed many times how quick a 20 pages of mess can be edited to form a 10 page good looking draft.

Read the Elements of Style to learn more about how to edit and produce a coherent presentable manuscript.


Don't take horse breeding advice from me, I haven't bred/trained any horses in my life. But you can take the writing advice. I use it every time I write, including this post.

Other related posts

Here are some of my related/nontechnical posts.
How I write 
How I read a research paper
My Advice To My Graduate Students
One Pomodoro, two pomodoro, three pomodoro, four
Black sheep 
Energy approach to life, the universe, and everything
Antifragility from an engineering perspective 

Monday, March 10, 2014

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

This paper is from Google. This is a refreshingly honest and humble paper. The paper is not pretending to be sophisticated and it doesn't have the "we have it all, we know it all" attitude. The paper presents the Dapper tool which is trying to solve a real problem, and it honestly represents how this simple straightforward solution fares and where it can be improved. This is the attitude of genuine researchers and seekers of truth.

It is sad to see that this paper did not get published in any conferences and is still listed as a Google Technical Report since April 2010. What was the problem? Not enough novelty? Not enough graphs?

Use case: Performance monitoring tail at scale

Dapper is Google's production distributed systems tracing infrastructure. The primary application for Dapper is performance monitoring to identify the sources of latency tails at scale. A front-end service may distribute a web query to many hundreds of query servers. An engineer looking only at the overall latency may know there is a problem, but may not be able to guess which of the dozens/hundreds of services is at fault, nor why it is behaving poorly. (See Jeff Dean and Barraso paper for learning more about the latency tails at scale).

It seems like performance monitoring was not the intended/primary use case for Dapper from the start though. Section 1.1 says this: The value of Dapper as a platform for development of performance analysis tools, as much as a monitoring tool in itself, is one of a few unexpected outcomes we can identify in a retrospective assessment.

Design goals and overview

Dapper has three design goals:

  • Low overhead: the tracing system should have negligible performance impact on running services. 
  • Application-level transparency: programmers should not need to be aware of (write code for /instrument for) the tracing system. 
  • Scalability: Tracing and trace collection needs to handle the size of Google's services and clusters.

Application-level transparency was achieved by restricting Dapper's core tracing instrumentation to a small corpus of ubiquitous threading, control flow, and RPC library code. In Google environment, since all applications use the same threading model, control flow and RPC system, it was possible to restrict instrumentation to a small set of common libraries, and achieve a monitoring system that is effectively transparent to application developers.

Making the system scalable and reducing performance overhead was facilitated by the use of adaptive sampling. The team found that a sample of just one out of thousands of requests provides sufficient information for many common uses of the tracing data.

Trace trees and spans

Dapper explicitly tags every record with a global identifier that links the reports for generated messages/calls back to the originating request. In a Dapper trace tree, the tree nodes are basic units of work and are referred to as spans. The edges indicate a casual relationship between a span and its parent span. Span start and end times are timestamped with physical clocks, likely NTP time (or TrueTime?).

Trace sampling and collection

The first production version of Dapper used a uniform sampling probability for all processes at Google, averaging one sampled trace for every 1024 candidates. This simple scheme was effective for the high-throughput online services since the vast majority of events of interest were still very likely to appear often enough to be captured.

Dapper performs trace logging and collection out-of-band with the request tree itself. Thus it is unintrusive on performance, and not paired to the application strongly.

The trace collection is asynchronous, and the trace is finally laid out as a single Bigtable row, with each column corresponding to a span. Bigtable's support for sparse table layouts is useful here since individual traces can have an arbitrary number of spans. In BigTable, it seems that the columns correspond to the "span names" in Figure 3, i.e., the name of the method called. The median latency for trace data collection is less than 15 seconds. The 98th percentile latency is itself bimodal over time; approximately 75% of the time, 98th percentile collection latency is less than two minutes, but the other approximately 25% of the time it can grow to be many hours. The paper does not mention about the reason of this very long tail, but this may be due to the batching fashion that the Dapper collectors work.

Experiences and Applications of Dapper in Google

Dapper's daemon is part of Google's basic machine image and so Dapper is deployed across virtually all of Google's systems, and has allowed the vast majority of our largest workloads to be traced without need for any application-level modifications, and with no noticeable performance impact.

The paper lists the following Dapper use cases in Google:

  • Using Dapper during development (for the Google AdWords system)
  • Addressing long tail latency
  • Inferring service dependencies
  • Network usage of different services
  • Layered and shared storage services  (for user billing and accounting for Google App Engine)
  • Firefighting (trying to quickly-fix a distributed system in peril) with Dapper

Dapper is not intended to catch bugs in codes and track root causes of problems. It is useful for identifying which parts of a system is experiencing slowdowns.

Tuesday, March 4, 2014

Naiad: A timely dataflow system

What is in a name?

Naiad is from Microsoft Research. Dryad, a general purpose runtime for execution of data parallel applications, was also from Microsoft Research. An application written for Dryad is modeled as a directed acyclic graph (DAG) and Dryad is the "tree nymph" in Greek mythology. Naiad is a stream processing platform and Naiad is the "stream nymph" in Greek mythology.

Naiad is an opensource project that has been receiving a lot of attention recently. I expect we will hear more about Naiad, because it is very useful for low-latency real-time querying and high-throughput incremental-processing of streaming big data. What is not to like?

Naiad is useful especially in incremental processing of graphs. As has been observed before, MapReduce is inappropriate for graph processing because of the large number of iterations needed in graph applications. MapReduce is a functional language, so using MapReduce requires passing the entire state of the graph from one stage to the next, which is inefficient. And for real-time applications batch processing delays of MapReduce becomes unacceptable.

Dataflow graph

The developer supplies a a logical graph to Naiad to describe the dataflow of computation. (Don't confuse this with the large scale input graph that Naiad computes on). The edges in this graph show dataflow. The vertices are stateful computation stages.

Figure 1 shows the overall architecture, with two main separate components: (1) incremental processing of incoming updates and (2) low-latency realtime querying of the application state. The query path is cleanly separated from the update path. Queries are done separately on a slightly stale version of the current application state. This way, the query path does not get blocked or incur delays due to the update processing. This also avoids complexity: If queries shared the same path with updates, the queries could be accessing partially-processed/incomplete/inconsistent states, and we would have to worry about how to prevent this.

With this architecture, Naiad is able to provide <100ms interactive query processing, <1s batch updates, and <1ms loop iterations.

(The separate query path is not a new idea. In big data processing, there is a batch layer that does occasional/periodic batch processing. This batch processing would output indexed state (new graph) and queries were performed over this output state in the serving layer in a read-only and quick manner.)

Loops in dataflow graph

The logical dataflow graph can have loops and even nested loops. (Note that, in contrast, a MapReduce computation dataflow does not have any loops, it is a chain of stages; at each stage you progress forward using output of previous stage and producing input for the next stage.)

The loop concept in the dataflow graph is very useful as it enables new applications that may not be possible to compute with MapReduce like frameworks (at least in a reasonably efficient manner). Loops are natural way of dealing with iterative graph processing as in PageRank and machine learning applications.

Naiad even allows nested loops. But, as useful as they are, loops complicate the job of a stream processing system significantly. How do you keep track of when data is purged out, and that data doesn't keep looping in the system? How do you keep differentiate between older data looping in the system versus new data that is just entering the loop? To deal with these issues the paper introduces an epoch based timestamp to label data that is being processed. These timestamps make the model suitable for tracking global progress in iterative algorithms in a local manner. The progress tracking looks like a deep topic, the Naiad paper refers the readers to a separate 2013 paper for the full explanation of the progress tracking algorithm.

The paper calls the resulting model, the timely dataflow model. In sum, the timely dataflow model supports:
+ structured loops allowing feedback in the dataflow,
+ stateful data flow vertices capable of consuming and producing records without global coordination, and
+ notifications for vertices once they have received all records for a given round of input or loop iteration.

Naiad runtime

The logical dataflow graph, which denotes the stages of computation and the dataflow between these stages, is mapped on the physical worker machines in many-to-1 fashion. Each worker may host several stages of the dataflow graph. The workers are data nodes. They keep a portion of the input data (usually a large-scale input graph, such as Twitter follower graph) in memory. So it makes sense to move computation (dataflow graph stages) to the data (the partitioned input graph).

Writing programs with Naiad

A vertex (of the logical dataflow graph) may invoke two system-provided methods:
this.SendBy(e : Edge, m : Message, t : Timestamp)
this.NotifyAt(t : Timestamp).

Each call to u.SendBy(e,m,t) results in a corresponding invocation of v.OnRecv(e,m,t), where e is an edge from u to v, and each call to  v.NotifyAt(t) results in a corresponding invocation of v.OnNotify(t).

The OnRecv method may send elements on the first output as soon as they are first observed, allowing for low latency, but to ensure correctness the vertex must use OnNotify to delay sending a final synopsis until all inputs have been observed. In other words, SendBy and OnRecv are more suitable for streaming, and NotifyAt and OnNotify are more suitable for batching.

As such, Naiad provides tunable consistency. The developer can use loose-consistency operation like OnReceive or a strong consistency operation (that requires waiting) like OnNotify.

A prototypical Naiad program is given in the paper as follows.

Evaluation results

The paper has extensive evaluation results. Naiad was deployed on up to 64 computers and scalability results are shown for throughput, global barrier latency, progress tracking and speedup. PageRank (on Twitter follower graph), logistic regression (as an example of batch iterative machine learning) and k-Exposure algorithms (for Twitter topics) are used as examples.


A feedback first: It would have been very useful if the paper used different words for edges/vertices in logical dataflow graph versus those in the input graph that workers compute and modify. This gets very confusing at places. (See it even became confusing as I wrote the above.)

The paper is 16 pages long, and packed with information. But several things remain unclear to me after reading.

How does Naiad do rate control? Within a loop at each epoch a larger neighborhood of a vertex may get affected/triggered (e.g., think of a PageRank like spreading application). How does this not cause an input avalanche? How does Naiad do rate control to send/initiate only as much as it can consume on the worker nodes?

It is not clear if we can implement tightly coordinated applications in Naiad. By tightly coordinated applications I mean applications that require multihop transactions on input graph, such as graph coloring and graph subcoloring.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...