Friday, July 4, 2014

Distributed is not necessarily more scalable than centralized

Centralized is not necessarily unscalable! 

Many people automatically associate centralized with unscalable, and distributed with scalable. And, this is getting ridiculous.

In the Spring semester, in my seminar class, a PhD student was pitching me a project for distributed storage: syncing from phone to work/home computers and other phones. The pitch started with the sentence "Dropbox is unscalable, because it is centralized". I was flabbergasted, and I asked a couple of times "Really? Do you actually claim that Dropbox is unscalable?". The student persisted and kept repeating that "Dropbox has a bottleneck because it is a centralized storage solution, and the distributed solution doesn't have that bottleneck". I couldn't believe my ears.

Dropbox already proved it is scalable: It serves files for more than 200 million users, who store 1 billion files every 24 hours. That it has a centralized architecture hosted in the cloud doesn't make it unscalable. As far as I can see there is no bottleneck caused by Dropbox having a more centralized architecture.

(For those who want to nitpick, I know Dropbox is not fully centralized; it uses AWS S3 for storage and Dropbox-company servers for metadata management. Also, it employs data parallelism in the backend for scalability, but, on the spectrum, it is closer to a centralized architecture than a fully decentralized one.)

Distributed is not necessarily scalable!

Some people when faced with a problem think, I know, I'll use distributed computing. Now they have N^2 problems. -- @jamesiry
Here is the second part. Distributing a system does not necessarily make it scalable. In fact, a fully decentralized architecture can sometimes be a disadvantage for scaling.

Consider Lamport's mutual exclusion (ME) algorithm presented in his seminal "Time, Clocks, and the Ordering of Events in a Distributed System". This ME algorithm is fully decentralized, and requires O(N) messages to be exchanged in response to one ME request. The Lamport ME algorithm employs broadcasts to keep all the nodes informed of all updates and get them on the same (more or less) state.

Now consider a centralized algorithm for ME: there is a centralized coordinator; the nodes send their request to the coordinator, and the coordinator assigns ME accordingly. (For the literalist: You can still have causal ordering in the centralized algorithm. Just use VC when nodes communicate and include VC in the request messages.) The centralized ME algorithm is more scalable: only 1 message is exchanged in response to one ME request. It has less drama and it is easier to maintain and build over.

Single point of failure?

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. -- Leslie Lamport
A common reflex argument about centralized solutions is that it constitutes a single point of failure (SPOF). But if a distributed solution is not designed carefully, it will have multiple points of failures (MPOF). Which one would you rather have?

Let's reconsider the Lamport ME and the centralized ME algorithms. The distributed algorithm does not offer any fault-tolerance advantages. Both algorithms are prone to getting stuck with one crash failure.

In fact, we can argue that it is easier to design fault-tolerance to a centralized solution: You can employ Paxos to replicate the centralized server. In contrast, it is often much harder to design and add fault-tolerance to a distributed system. Since a distributed system is complex, it is more prone to introduce corner cases that jeopardize fault-tolerance.

Conclusion

Distributed is not necessarily more scalable than centralized;
And centralized is not necessarily a scalability bottleneck.

As a distributed systems professor, I wouldn't imagine myself defending centralized solutions. But there it is.

To avoid potential misunderstandings, I am not saying fully distributed/decentralized solutions are bad and to be avoided. There are advantages to decentralization, like latency reduction. And some conditions necessitate decentralization, like geographic/political/corporate isolation. We know in the real world it is a mix of centralized, up to where that is manageable and has reasonable cost, and some distributed architecture. This also depends very much on the application/task.

PS: Maybe we should do an XtraNormal animation movie about this "centralized unscalable and distributed scalable" mania. Any takers?

PS2: I thank @tedherman for improvements to the 1st draft.

PS3: Optimistic replication is a great survey of more decentralized replication protocols, their advantages, and challenges.

Bonus Section: Paxos is a relatively centralized approach to distributed consensus

Consensus is usually not an all-hands-on process. That can be hard to scale. Consider our democratic system: It is pretty centralized; we only elect leaders to rule for us.

In the same sense, you can think of Paxos as the more centralized approach to distributed consensus and state machine replication. In Paxos, the participants do not interact with all other participants to decide the order of requests to be accepted, instead the leader dictates the order of requests and the participants just accept them. (A fully decentralized consensus algorithm would be like the synchronous rounds consensus algorithm where in every round each participant communicates with all other participants so that they can converge on the same state.)

Monday, June 30, 2014

Management

Managing your resources (energy, time, and students) is a nontechnical topic, but nevertheless it is essential for your success in academia. There isn't much talk or guidance in these topics at the graduate school. You are expected to attain these skills on your own or maybe acquire them by osmosis from professors and colleagues.

Here I will keep it short, and just post my summary slides of 3 great books I read on management.

The first one is the seven habits book. This book is about managing yourself as an effective person. I first read this book around 18 and found it long, tedious, and boring. Reading it again at 38, I think the book has great advice.
Link to my summary slides on the seven habits book.

Getting Things Done (GTD) is the best book on time and project management with low stress. This summer, do yourself a favor: Read the book, and adopt the GTD system ASAP. You will thank me later.
Link to my summary slides on the GTD book.

I am not aware of much work on student/postdoc management, although this is very important for project management. The one minute manager provides a minimalist method for managing people. There are also some advice on the web (this scrum method is worthy of note), but if you can suggest well-researched time-tested approaches for student management, I am eagerly waiting. This is an area where I should learn more about.

Friday, June 27, 2014

Targeted crowdsourcing using app/interest categories of users

Part of my research is on crowdsourcing. Basically, crowdsourcing means performing micro-collaborations with many people to complete a task. You divide the task into microtasks and outsource it to people. They provide solutions to your microtasks, and you aggregate those to obtain the solutions to the microtasks, and then ultimately to your task.

Aggregating the responses from the crowd is a challenge of itself. If the questions are asked as open ended questions, the answers would come in a variety of types, and you would not be able to aggregate them automatically with a computer. (You may use human intelligence again to aggregate them, but how are you going to aggregate/validate these next level aggregators?)

To simplify the aggregation process, we use multiple-choice question answering (MCQA). When the answers are provided in choices, a, b, c, or d, they become unambiguous and easier to aggregate with a computer. The simplest solution for aggregation of MCQA is the majority voting: whichever option was chosen most is provided as the ultimate answer.

Recently, we started investigating MCQA-based crowdsourcing in more depth. What are the dynamics of MCQA? Is majority voting good enough for all questions? If not, how can we do better?

To investigate these questions, we designed a gamified experiment. We developed an Android app to let the crowd answer questions with their smartphones as they watch the Who Wants To Be A Millionaire (WWTBAM) quiz show on a Turkish TV channel. When the show is on air in Turkey, our smartphone app signals the participants to pickup their phones. When a question is read by the show host, my PhD students would type the question and answers, which would be transmitted via Google Cloud Messaging (GCM) to the app users. App users play the game, and enjoy competing with other app users, and we get a chance to collect precious data about MCQA dynamics in crowdsourcing.

Our WWTBAM app has been downloaded and installed more than 300,000 times and has enabled us to collect large-scale real data about MCQA dynamics. Over the period of 9 months, we have collected over 3 GB of MCQA data. In our dataset, there are about 2000 live quiz-show questions and more than 200,000 answers to those questions from the participants.

When we analyzed the data we collected, we found that majority voting is not enough for all questions. Although majority voting does well in the simple questions (the first 5 questions) and achieves more than 90% accuracy rate, as the questions get harder, the accuracy of majority voting plummets quickly to 40%. (There are 12 questions in WWTBAM. The question difficulty increases with each question. Questions 10, 11, 12 are seldom reached by the quiz contestants.)

We then focused on how to improve the accuracy of aggregation. How can we weigh the options to give more weight to correct answers and let them win even when they are in the minority?

As expected, we found that the previous correct answers by a participant indicate higher likelihood of being correct in this answer. By collaborating with colleagues in data mining, we came up with a page-rank like solution for history-based aggregation. This solution was able to raise the accuracy of answers to 90% for even the harder questions.

We also observed some unexpected findings from the data collected by our app. Our app collected the response time of the participants, and we saw that the response time has some correlation to correct responses. But the relation is funny. For the easier questions (the first 5), earlier responses are more likely to be correct. But for the harder questions, delayed responses are more likely to be correct. We are still trying to see how we can put this observation into good use.

Another surprising result came recently. One of my PhD students, Yavuz Selim Yilmaz, proposed a simple approach, which at the end provided as effective as the sophisticated history-based solution. This approach did not even use the history of participants, and that makes it more applicable. Yavuz's approach was to /use the interests of participants to weigh their answers/.

In order to obtain the interests of the participants, Yavuz had a very nice idea. He proposed to use the category of the apps installed in the participants phone. Surprised, I asked him how he plans to learn the other apps installed in the participant phones. Turns out this is one of the basic permissions Android gives to an installed app (like our WWTBAM app): it can query and learn about the other installed apps in the users phone. (That it is this easy is telling about Android privacy and security. We didn't collect/maintain any identifying information on users, but this permission can potentially be used for bad.)

Yavuz assigned interest categories to participants using Google Play Store's predefined 32 categories for apps (e.g. Books and Reference, Business, Comics, Communication, Education, Entertainment, Finance). If a participant has more than 5 apps installed in one of these categories, the participant was marked as having interest in that category. We used half the data as training set and found which interest categories produce the highest accuracy for a given question number. Then in the testing set, the algorithm is simply to use majority voting among the category which is deemed most successful for a given question number. Is this too simplistic an approach?

Lo and behold, this approach lifted the accuracy to around 90% across all level of questions. (This paper got the outstanding paper award in the Collaboration Technologies and Systems (CTS 2014) Conference)


Ultimately we want to adopt the MCQA-crowdsourcing lessons we learned from WWTBAM in order to build crowdsourcing apps in location-based recommendation services.

Another application area of MCQA-crowdsourcing would be performing market research. A lot of people in the industry, consumer goods, music, and politics are interested in market research. But market research is difficult to get right, because you are trying to predict if a product can get traction by asking about it to a small subset of people which may not be very relevant, representative. The context and interests of the people surveyed are important in weighing out the responses. (I hope this blog post will be used in the future to kill some stupid patents proposed on this topic ;-)

Monday, June 23, 2014

Writing versus Typing

Recently, there have been several high profile articles on how writing with pens is much better for the brain than typing. One article presented a study that found: If you write rather than type, you will learn and recall more of the lecture.

For full disclosure, I am a fountain-pen fan and I enjoy the elegance and beauty of writing with the pen. I like writing so much that I have been thinking about converting to a tablet solution (MS Surface Pro 3).

But, as I weigh my options, I cannot get myself to go for a tablet solution or a dual (laptop+tablet) solution. Typing simply knocks the socks off writing when it comes to productivity.

Writing with a fountain-pen has many drawbacks. First of all it is not digital. It can not be easily stored and archived. It is not searchable, and so is not available easily. Most importantly, the writing produced by the fountain-pen is not easily editable. So, this forces you to be extra careful for writing, and self-censor, and this kills creativity.

Let me reiterate this point. The fundamental rule of constructing prose is that you keep the creating (drafting) and editing functions separate. Since editing is hard when using a fountain-pen, you get cautious and blend editing into creating/drafting. And that is not kosher.

Writing with a tablet also has many drawbacks. It is digital alright, but its text (handwritten text) editing is still very clumsy. Copying and moving your handwriting around is harder than simply wrangling text in a text editor. Transposing words, inserting a word in between, deleting a sentence, etc., are hard. Moreover, the tablet is not refined enough yet to give the fountain-pen experience and simplicity. Even the small inconveniences/bumps make your experience unbearable and can keep you away from writing. The tablet is a compromise solution between handwriting and typing. And instead of offering best of both worlds, it tends to offer worst of both worlds.

Typing does not suffer from these drawbacks. The only drawbacks to typing are that the writing looks too uniform. But using a special font you can avoid this. I use the Apple Chalkboard font on my Emacs, and I like it a lot. The Apple Chalkboard font provide some visual differences between different parts of the text. Furthermore, Emacs makes editing text, searching, replacing, etc., very fast that I don't get bogged down when revising my writing.

On Emacs, using the Org-mode offers extra benefits for me. The outline mode is useful for brainstorming and organizing my thoughts and writing. Org-mode is also my GTD tool. I can easily track issues and ToDo lists inside my projects using Org-mode.

So, for me, the choice is clear: Using Emacs Org-mode on my Macbook Air. I am not even considering a dual (laptop+tablet) solution, because using two separate systems for writing inevitably leads to integration problems and complexity. However, I occasionally use my fountain-pen for brainstorming, which I enjoy a lot.

Sunday, June 15, 2014

Singularity

Singularity is a term coined to describe the merging of human and computer intelligence, and as a result, the rise of a meta-intelligence. Proponents of singularity tries to posit singularity as the next step in human progression, where humans cease to exist and transcent into a hybrid race of computer/human entity. The singularity idea has been portrayed in popular culture in several movies, most popular of which are the Terminator and Matrix movies.

History of discussion on singularity

Vernor Vinge, science fiction writer, first wrote about the vision of technological singularity and coined the term in 1993. He wrote "Within thirty years, we will have the technological means to create superhuman intelligence. Shortly after, the human era will be ended."

Ray Kurzweil, inventor and futurist, is a fervid proponent of technological singularity. Kurzweil puts the timeline of singularity to 2040. In 2011, Ray Kurzweil sponsored a movie/documentary on singularity, titled "Transcendent Man", that has been screened in five major cities in the U.S., as well as London. In December 2012 Kurzweil was hired by Google as a director of engineering to "work on new projects involving machine learning and language processing".

In 2000, Bill Joy, a very famous computer scientist (the primary figure behind BSD operating system and the widely used Java programming language), joined this discussion. In his Wired Magazine article, "Why the future doesn't need us", Bill Joy said he was convinced that growing advances in genetic engineering and nanotechnology would bring severe risks and catastrophe to humanity.

Arguments and counterarguments about the feasibility of singularity

Proponents of singularity often cite the Moore's law in electronics to support their claim. Moore's law states crudely that the capacity of computer chips double every two years. That is, the speed and capability of computers grow with an exponential speed. Such an exponential growth is a powerful enabler. Consider the series 1, 2, 4, 8, 16, 32,... The small increments in the beginning may be misleading about the speed of growth of this series. The 20th element in this series is 1 million. The 266th element in this series is 10^80, which is more than the number of atoms in the universe.

The argument the proponents of singularity use is that, thanks to the exponential growth, the processing powers of the computers will reach to such high levels in the next couple decades that it will be possible to simulate the human brain in high fidelity. Working of each neuron in the brain will be simulated in real time to achieve a simulation of the brain. At that point essentially, the computer will have the equivalent of human intelligence. In the succeeding years with the increase in capacity, the computer intelligence will be several folds ahead of human intelligence.

Opponents of the feasibility of singularity cite that exponential growth is hard to sustain. Exponential growth is seen in the beginning of the series, but then due to limitations/adversities the series level off and stay constant. An example is the population of rabbits. Initially the increase is exponential, however then due to scarcity of food sources, and due to predators, the population stabilizes around a constant. Similarly, it is argued that the exponential progress computer processing speeds will hit a brick wall. At the chip level, physical issues such as heating will make exponential speedup unsustainable. At the multicore level or cluster level, latency, consistency, and scalability issues will prevent exponential growth.

Kurzweil's argument is a bit more involved than simple exponential growth, however. Underlying all of Kurzweil's ideas regarding the progress of technology and the Singularity is the Law of Accelerating Returns. The Law states that technological progress occurs exponentially instead of linearly, meaning that each new advancement enables several higher advancements instead of just one higher advancement, and concordantly, every year, more useful inventions and discoveries are made than were made in the last. The first generation artificial intelligence (AI) approaches failed, but simulating brain may work if we know the workings of the brain in excruciating detail.

On the other hand, the opponents like to point that the workings of the brain as a whole is still a big mystery. We have information about the rough mechanism of working of a neuron. An excited neuron can transmit a signal to a neighboring neuron through its synapses. But, there is no clear explanation about how thought occurs from this process. Brain-scanning techniques are improving as they are based on computers, but the brain may throw more complexity surprises as we learn more about it. The brain may owe much of its power to these organic material, and the very low-level analog physical interactions. These physical phenomena could be close to impossible to model/simulate in digital environment. Henry Markram, lead researcher of the "Blue Brain Project" for simulating mammal brain at the molecular level has stated that "it is not [their] goal to build an intelligent neural network". "[That would] be very difficult because, in the brain, every molecule is a powerful computer and we would need to simulate the structure and function of trillions upon trillions of molecules as well as all the rules that govern how they interact. You would literally need computers that are trillions of times bigger and faster than anything existing today."

Another relevant question is whether we can have the parallel processing architectures to support the parallel processing that goes on in the brain? The brain uses far more parallel processing than exists in most classical computing designs. Moreover, even if the computer simulates the human brain successfully, what makes the opponents think that the human brain scales to two folds, ten folds, or $10^10 folds? Human brain computation may be inherently unscalable. Also, if the computer models the human brain, human emotions are also modeled. Then would the resulting computer be stable? As it scales, would it go existential/suicidal or become an arrogant killer?

Aftermath of singularity

Several questions are raised about the aftermath of singularity. Can a downloaded personality replace the spirit? How does this amount to living forever? One singularity promises is similar to claiming that you can live forever by cloning yourself. One copy dies but another copy survives again. But it is clear that the copies are different entities. And I think that is cheating, that is not true immortality. If we take Singularity's approach to immortality a little further than we can argue that humans can achieve immortality through their work/art. As Woody Allen said: "I don't want to achieve immortality through my work. I want to achieve it by not dying."

Saturday, March 29, 2014

How to write your research paper

The legend of Köroğlu

I will start with a story of oppression and an uprising, involving a mythical horse in the process. Way to start a post about writing research papers, huh?

In 16th century Anatolia, there was a corrupt and oppressor mayor in the Bolu state. The mayor one day decided that he should find and gift the best horse in the world to the Sultan. He contacted a very skilled horse breeder. The breeder said the horse that is deserving of the Sultan should be very special, and said none of his horses is worthy of this. He went on a quest for this horse himself. One day he saw some street kids abusing a feeble and awkward-looking foal. He immediately recognized the potential in this foal and bought the foal, and headed for the mayor's palace. The mayor got outraged, being the ignorant oppressor he is, he thought the breeder is mocking him by offering this weak awkward foal. The mayor immediately ordered the breeder to be blinded.

The breeder had a young son, who became devastated by his father's situation. His father was now blind ("kör" in Turkish), and the son later got nicknamed "Köroğlu", the blind's son. The breeder, instead of worrying about his eyes was more worried about the foal, and instructed his son to build a pitch-black stable for the foal. He then instructed his son to constantly tend to the foal and fatten the foal as much as possible. For many months, the foal was made to stay in this pitch-black cave to eat and fatten up. The breeder did not start any training at all. Many months later, the breeder instructed Köroğlu to get this fat horse out and started a strict training regimen for the horse. The fat quickly turned to muscle, and the horse got very lean in a short time.

The legend is that the horse got so fast that it would run over a mud field and would not get any mud on its feet. Köroğlu used this horse to get his father's revenge from the mayor and became a Robin Hood like figure. Here is a link to the 1968 Turkish movie made for commemorating this legend.

Back to writing!

And I claim that a legend about a horse and an outlaw gives great lessons about writing your paper?! I must be nuts!

Give your idea a chance to grow and thrive 

All excellent ideas/papers/design start in a feeble fragile form. Very much like that foal. Don't judge too soon, otherwise you will misjudge. Instead if you can glimpse a sliver of potential, give your idea a chance to grow.

(With experience, you will be able to tell which feeble ideas are promising, which are not. You will get better at it, like the old breeder.)

In this initial phase (the cave phase), don't listen to any critics. Keep this feeble idea close to your chest. You would need to guard it even from your own criticisms early on. Suppress the criticisms for a while. Feed the idea to see what it can become.

Here Jony Ive talks about Steve Jobs' approach to creative work:
"And just as Steve loved ideas, and loved making stuff, he treated the process of creativity with a rare and a wonderful reverence. You see, I think he better than anyone understood that while ideas ultimately can be so powerful, they begin as fragile, barely formed thoughts, so easily missed, so easily compromised, so easily just squished."
Good thing the reviewers don't get to see the first drafts of any idea/paper, otherwise nothing would get published in the conferences or journals.

Fatten it up

In the cave phase, you need to greedily feed and build up your manuscript.

And this is how you do it: Start writing as soon as possible, and write while you do the work. That means, you keep a lab notebook. This doesn't need to be physical notebook. Open a directory for your new research idea, and create a notes.txt file to act as your lab notebook. In this lab notebook, you will be able to explore each sub idea and produce in bulk without any pressure of good/presentable writing. Since nobody is going to see this writing, you won't have restraints and you can go fast. You should come up with new tangential ideas, and explore all potential directions the original idea can grow. See my post about free writing for more information.

(I use the notes.org file as my lab notebook. Org-mode in Emacs is great to outline a project, keep track of the progress of each sub-idea, and manage and review ToDo items for the project.)

So feed it, build it up. Fatten it up. At the end of this you will have a fat mess in your hand. Don't feel ashamed about it. Instead, feel proud.

(Warning: If you have to keep twisting and spinning the same idea too many times just to squeeze out a small contribution, that is bad. There should be potential in the idea. Don't try to resuscitate an idea, if it refuses to grow despite your nourishing.)

Train hard: turn fat into muscles! 

This is the coming out of the cave phase. After finding your purpose and voice, you should now try to present it coherently and clearly. Now, you should be ruthless about getting your paper back in shape. Cut ruthlessly to make it lean. Cut the fluffy parts, the unnecessary tangents, and even the parts that can give the wrong vibe and that may lead an unsuspecting reader to a dead-end. Make it succinct and as simple as possible.

Editing is much much easier than starting with nothing and having to write from scratch, especially when the conference deadline is looming. If you haven't tried this approach to writing a paper before, you will be surprised how much easier it is to edit a fluffy mess into a coherent draft than writing from scratch. I have witnessed many times how quick a 20 pages of mess can be edited to form a 10 page good looking draft.

Read the Elements of Style to learn more about how to edit and produce a coherent presentable manuscript.

Conclusion

Don't take horse breeding advice from me, I haven't bred/trained any horses in my life. But you can take the writing advice. I use it every time I write, including this post.

Other related posts

Here are some of my related/nontechnical posts.
How I write 
How I read a research paper
My Advice To My Graduate Students
One Pomodoro, two pomodoro, three pomodoro, four
Black sheep 
Energy approach to life, the universe, and everything
Antifragility from an engineering perspective 

Monday, March 10, 2014

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

This paper is from Google. This is a refreshingly honest and humble paper. The paper is not pretending to be sophisticated and it doesn't have the "we have it all, we know it all" attitude. The paper presents the Dapper tool which is trying to solve a real problem, and it honestly represents how this simple straightforward solution fares and where it can be improved. This is the attitude of genuine researchers and seekers of truth.

It is sad to see that this paper did not get published in any conferences and is still listed as a Google Technical Report since April 2010. What was the problem? Not enough novelty? Not enough graphs?

Use case: Performance monitoring tail at scale

Dapper is Google's production distributed systems tracing infrastructure. The primary application for Dapper is performance monitoring to identify the sources of latency tails at scale. A front-end service may distribute a web query to many hundreds of query servers. An engineer looking only at the overall latency may know there is a problem, but may not be able to guess which of the dozens/hundreds of services is at fault, nor why it is behaving poorly. (See Jeff Dean and Barraso paper for learning more about the latency tails at scale).

It seems like performance monitoring was not the intended/primary use case for Dapper from the start though. Section 1.1 says this: The value of Dapper as a platform for development of performance analysis tools, as much as a monitoring tool in itself, is one of a few unexpected outcomes we can identify in a retrospective assessment.

Design goals and overview

Dapper has three design goals:

  • Low overhead: the tracing system should have negligible performance impact on running services. 
  • Application-level transparency: programmers should not need to be aware of (write code for /instrument for) the tracing system. 
  • Scalability: Tracing and trace collection needs to handle the size of Google's services and clusters.

Application-level transparency was achieved by restricting Dapper's core tracing instrumentation to a small corpus of ubiquitous threading, control flow, and RPC library code. In Google environment, since all applications use the same threading model, control flow and RPC system, it was possible to restrict instrumentation to a small set of common libraries, and achieve a monitoring system that is effectively transparent to application developers.

Making the system scalable and reducing performance overhead was facilitated by the use of adaptive sampling. The team found that a sample of just one out of thousands of requests provides sufficient information for many common uses of the tracing data.

Trace trees and spans

Dapper explicitly tags every record with a global identifier that links the reports for generated messages/calls back to the originating request. In a Dapper trace tree, the tree nodes are basic units of work and are referred to as spans. The edges indicate a casual relationship between a span and its parent span. Span start and end times are timestamped with physical clocks, likely NTP time (or TrueTime?).

Trace sampling and collection

The first production version of Dapper used a uniform sampling probability for all processes at Google, averaging one sampled trace for every 1024 candidates. This simple scheme was effective for our high-throughput online services since the vast majority of events of interest were still very likely to appear often enough to be captured.

Dapper performs trace logging and collection out-of-band with the request tree itself. Thus it is unintrusive on performance, and not paired to the application strongly.

The trace collection is asynchronous, and the trace is finally laid out as a single Bigtable row, with each column corresponding to a span. Bigtable's support for sparse table layouts is useful here since individual traces can have an arbitrary number of spans. In BigTable, it seems that the columns correspond to the "span names" in Figure 3, i.e., the name of the method called. The median latency for trace data collection is less than 15 seconds. The 98th percentile latency is itself bimodal over time; approximately 75% of the time, 98th percentile collection latency is less than two minutes, but the other approximately 25% of the time it can grow to be many hours. The paper does not mention about the reason of this very long tail, but this may be due to the batching fashion that the Dapper collectors work.

Experiences and Applications of Dapper in Google

Dapper's daemon is part of Google's basic machine image and so Dapper is deployed across virtually all of Google's systems, and has allowed the vast majority of our largest workloads to be traced without need for any application-level modifications, and with no noticeable performance impact.

The paper lists the following Dapper use cases in Google:

  • Using Dapper during development (for the Google AdWords system)
  • Addressing long tail latency
  • Inferring service dependencies
  • Network usage of different services
  • Layered and shared storage services  (for user billing and accounting for Google App Engine)
  • Firefighting (trying to quickly-fix a distributed system in peril) with Dapper

Dapper is not intended to catch bugs in codes and track root causes of problems. It is useful for identifying which parts of a system is experiencing slowdowns.

Tuesday, March 4, 2014

Naiad: A timely dataflow system

What is in a name?

Naiad is from Microsoft Research. Dryad, a general purpose runtime for execution of data parallel applications, was also from Microsoft Research. An application written for Dryad is modeled as a directed acyclic graph (DAG) and Dryad is the "tree nymph" in Greek mythology. Naiad is a stream processing platform and Naiad is the "stream nymph" in Greek mythology.

Naiad is an opensource project that has been receiving a lot of attention recently. I expect we will hear more about Naiad, because it is very useful for low-latency real-time querying and high-throughput incremental-processing of streaming big data. What is not to like?

Naiad is useful especially in incremental processing of graphs. As has been observed before, MapReduce is inappropriate for graph processing because of the large number of iterations needed in graph applications. MapReduce is a functional language, so using MapReduce requires passing the entire state of the graph from one stage to the next, which is inefficient. And for real-time applications batch processing delays of MapReduce becomes unacceptable.

Dataflow graph

The developer supplies a a logical graph to Naiad to describe the dataflow of computation. (Don't confuse this with the large scale input graph that Naiad computes on). The edges in this graph show dataflow. The vertices are stateful computation stages.

Figure 1 shows the overall architecture, with two main separate components: (1) incremental processing of incoming updates and (2) low-latency realtime querying of the application state. The query path is cleanly separated from the update path. Queries are done separately on a slightly stale version of the current application state. This way, the query path does not get blocked or incur delays due to the update processing. This also avoids complexity: If queries shared the same path with updates, the queries could be accessing partially-processed/incomplete/inconsistent states, and we would have to worry about how to prevent this.

With this architecture, Naiad is able to provide <100ms interactive query processing, <1s batch updates, and <1ms loop iterations.

(The separate query path is not a new idea. In big data processing, there is a batch layer that does occasional/periodic batch processing. This batch processing would output indexed state (new graph) and queries were performed over this output state in the serving layer in a read-only and quick manner.)

Loops in dataflow graph

The logical dataflow graph can have loops and even nested loops. (Note that, in contrast, a MapReduce computation dataflow does not have any loops, it is a chain of stages; at each stage you progress forward using output of previous stage and producing input for the next stage.)

The loop concept in the dataflow graph is very useful as it enables new applications that may not be possible to compute with MapReduce like frameworks (at least in a reasonably efficient manner). Loops are natural way of dealing with iterative graph processing as in PageRank and machine learning applications.

Naiad even allows nested loops. But, as useful as they are, loops complicate the job of a stream processing system significantly. How do you keep track of when data is purged out, and that data doesn't keep looping in the system? How do you keep differentiate between older data looping in the system versus new data that is just entering the loop? To deal with these issues the paper introduces an epoch based timestamp to label data that is being processed. These timestamps make the model suitable for tracking global progress in iterative algorithms in a local manner. The progress tracking looks like a deep topic, the Naiad paper refers the readers to a separate 2013 paper for the full explanation of the progress tracking algorithm.

The paper calls the resulting model, the timely dataflow model. In sum, the timely dataflow model supports:
+ structured loops allowing feedback in the dataflow,
+ stateful data flow vertices capable of consuming and producing records without global coordination, and
+ notifications for vertices once they have received all records for a given round of input or loop iteration.

Naiad runtime


The logical dataflow graph, which denotes the stages of computation and the dataflow between these stages, is mapped on the physical worker machines in many-to-1 fashion. Each worker may host several stages of the dataflow graph. The workers are data nodes. They keep a portion of the input data (usually a large-scale input graph, such as Twitter follower graph) in memory. So it makes sense to move computation (dataflow graph stages) to the data (the partitioned input graph).

Writing programs with Naiad

A vertex (of the logical dataflow graph) may invoke two system-provided methods:
this.SendBy(e : Edge, m : Message, t : Timestamp)
this.NotifyAt(t : Timestamp).

Each call to u.SendBy(e,m,t) results in a corresponding invocation of v.OnRecv(e,m,t), where e is an edge from u to v, and each call to  v.NotifyAt(t) results in a corresponding invocation of v.OnNotify(t).

The OnRecv method may send elements on the first output as soon as they are first observed, allowing for low latency, but to ensure correctness the vertex must use OnNotify to delay sending a final synopsis until all inputs have been observed. In other words, SendBy and OnRecv are more suitable for streaming, and NotifyAt and OnNotify are more suitable for batching.

As such, Naiad provides tunable consistency. The developer can use loose-consistency operation like OnReceive or a strong consistency operation (that requires waiting) like OnNotify.

A prototypical Naiad program is given in the paper as follows.


Evaluation results

The paper has extensive evaluation results. Naiad was deployed on up to 64 computers and scalability results are shown for throughput, global barrier latency, progress tracking and speedup. PageRank (on Twitter follower graph), logistic regression (as an example of batch iterative machine learning) and k-Exposure algorithms (for Twitter topics) are used as examples.

Discussion

A feedback first: It would have been very useful if the paper used different words for edges/vertices in logical dataflow graph versus those in the input graph that workers compute and modify. This gets very confusing at places. (See it even became confusing as I wrote the above.)

The paper is 16 pages long, and packed with information. But several things remain unclear to me after reading.

How does Naiad do rate control? Within a loop at each epoch a larger neighborhood of a vertex may get affected/triggered (e.g., think of a PageRank like spreading application). How does this not cause an input avalanche? How does Naiad do rate control to send/initiate only as much as it can consume on the worker nodes?

It is not clear if we can implement tightly coordinated applications in Naiad. By tightly coordinated applications I mean applications that require multihop transactions on input graph, such as graph coloring and graph subcoloring.

Wednesday, February 26, 2014

Energy approach to life, the universe, and everything

I recently read Scott Adams new book titled: "How to fail at almost everything and still win big". I enjoyed reading the book a lot; Scott discusses a lot of intriguing ideas as usual. The main idea in the book was that you need systems instead of goals. Goals are for losers because motivation is ephemeral. In contrast, systems/processes are persistent, durable (by definition). As Bezos says "Good intentions don't work, mechanisms work!"

Another main idea in the book is the importance of monitoring and managing your energy for being successful. The book advises you to "watch what you eat, exercise, match mental states to activity and attack tasks when you have the appropriate energy level".

I think this energy approach to personal productivity and life is promising. So I wanted to collect my thoughts on this and see what I can add on this topic.

Things that increase energy and decrease energy

It isn't the mountains ahead to climb that wears you down.
It is the pebble in your shoe.
– Muhammad Ali
This quote from Muhammad Ali is extremely insightful, and nicely summarizes everything that needs to be said on this topic. Here are my observations on what energizes me and drains my energy.

IncreasingDecreasing
believingdoubting
startingprocrastrating
planningworrying
curiosityboredom
making/writingconsuming/reading
small victoriesbig victories
meditationstress
exercisingsitting
tea/healthy foodeating a lot
happy moodbad mood
waking up earlywaking up late
reading a good bookbrowsing junk/news

Conserving energy

To keep your energy level high, you need to find tricks to conserve energy. Instilling useful "habits" is a great trick to conserve energy. When you make something a habit, you don't need to waste your energy for remembering to do it and more importantly for finding the willpower to do it. Habits make inertia work for you. The key to instilling habits is to start with baby steps. See tiny habits by Dr. Fogg to learn more.

Keeping things simple in your life helps conserve energy. Being an organized person (Getting Things Done) avoids ongoing chaos in your life, and conserves energy. I use Emacs Org-Mode to track all my tasks so I can clear my mind. Even by cleaning/organizing your room office you can notice a boost in your energy.

On this tangent, being a principled/virtuous/religious person can help a lot for success. Once you make your decision and commit to it, the temptations you need to fight become less, you can ignore a lot of distractions because they are out of bounds (designated as sin) for you. This is a very pragmatic (very Benjamin Franklin) way to approach the issue of character/virtue/religion, but there it is.

Growing your energy potential

Maintaining high energy levels is good, but you know what is better: Growing your energy potential.

The antifragility book tells us that to organic things grow by exploring and pushing/straining its limits occasionally. Related to this topic is the "Growth mindset" concept, which has been put forth by the Stanford psychiatrist Carol Dweck (I highly recommend her book). Growth mindset describes an antifragile approach to personality and life. Growth mindset people like challenges and failures, as these would make them learn, improve, and grow.

Following this line of thought, in order to grow your energy potential, you need to strain it and totally deplete it from time to time. Occasionally, you should take on more than you can handle, exercise vigorously and push your physical limits, pull some all-nighters working on a project, and fail at some of your projects. Pushing your limits and failing is good for you. It will make you grow. If nothing else, it will teach you some humbleness and throw you out of the fixed mindset you may have acquired (e.g., I know it all, I have it all, I am naturally successful). You need some occasional resets to be able to experience the beginner's mind.

Focusing/intensifying your energy

It is also important to learn how to control and focus your energy to get things done. I doubt there is an easy or universal way to achieve this.

I think intensifying your emotions helps for focusing your energy. Being curious and asking questions focuses your energy, because you will really want to get answers to your unresolved questions. Being very determined and motivated (you need to find ways to motivate yourself) will help in focusing your energy. Finally, even anger helps. I occasionally argue and pick fights with papers/books, in order to focus my energy to better understand them.

Wednesday, February 12, 2014

Consistency-Based Service Level Agreements for Cloud Storage

This paper is from Microsoft Research and appeared in SOSP'13. This paper is more of a position and vision paper. The paper introduces a consistency-based SLA concept, which can provide a win-win arrangement for cloud providers and developers.

Problem 

For performing reads from key-value stores, currently you have two options. You can do strongly-consistent reads by increasing the size of your read replica quorum, but this will increase latency of the responses, and you don't get the flexibility to revert to a quick dirty (eventually-consistent) read if a strong-consistent read would take a long time to respond. Or you go with best effort reads (which are eventually-consistent) from the key value store because you insist on low-latency answers.

(Actually there is another option: You can use a strongly consistent multiversion data store like BigTable or Spanner, and relax it by reading slightly stale data and get flexibility. I will revisit this option in the discussion.)

| Rank | Consistency | Latency | Utility |
|   1.    | strong         | 150 ms  |     1.0  |
|   2.    | eventual      | 150 ms  |     0.5  |
|   3.    | strong         | 500 ms  |    0.25 |

Enter the consistency-based SLA concept. The SLA acts as an interface between the application developer and the inners of the cloud. The developer provides a wishlist for their get (i.e., read) operations from the key-value store as above. Here the developer says "I want a reply in under 150 ms and prefer strongly consistent data but will accept any data; if no data can be obtained quickly then I am willing to wait up to 500ms for up-to-date data." The cloud-provider backend is structured such that it keeps track of which of these reads is feasible currently for that location and it satisfies the highest ranked one it can in order to give the best utility to the developer.

Using such an SLA makes good business sense. With this SLA the developers put their money where their mouth is. They agree to pay more for better utility provided to them. The cloud-providers can use the SLAs to prioritize the read requests: they can give more priority to consistency requiring higher paying (higher utility) customers.



To illustrate, Figure 3 shows some read latencies at a given point from given locations. The developer does not have access to all per region or per client latencies like this, but in the SLA she can state her ranked preferences for latency and consistency of the reads she thinks would make most sense for her application, and through this interface she has access to dynamic tuning of performance of her application.

Pileus Architecture:

To showcase the SLA, the authors developed a replicated key-value store called Pileus. Pileus is a type of cloud formation, it is a cap cloud. (Get it? A "CAP" cloud.) Pileus dynamically selects which servers to access in order to deliver the best service given the current configuration and system conditions.

Some storage nodes are designated as primary nodes, which hold the master data, while others are secondary nodes. All Puts (i.e., writes) in Pileus are performed and strictly ordered at a primary site. Secondary nodes eventually receive from the primary core all the updated objects along with their update timestamps. Since all Put operations are assigned increasing update timestamps from the primary site and the asyncronous replication protocol transfers updated objects in timestamp order, at any point in time, each secondary node has received a prefix of the overall sequence of Put operations.

When selecting the node to which a Get operation should be sent, the desired consistency guarantee, along with the previous object versions that have been read or written in the current session and the key being read, determines the minimum acceptable read timestamp. The minimum acceptable read timestamp indicates how far a secondary node can lag behind the primary and still provide an answer to the given Get operation with the desired consistency. This is being decided by the client library of Pileus.

This architecture forces all the writes to be performed on a single primary core limits the problem space, and simplifies things for ensuring consistency for the reads in the consistency-spectrum. But this also limits the performance on reads (except for eventual-consistency reads). Moreover, with this setup you don't get to specify latency bounds for writes.

Evaluation results show that consistency-based SLAs can indeed improve application-specific levels of service (i.e., utility).

Discussion

Q: How rich is the class of applications that benefit from this SLA?

A: I am still confused about this. It sounds like this can be applicable to a large class of applications, but sometimes I revert to thinking maybe not that big.

For latency-favoring (eventual-consistency happy) applications there are existing solutions: DynamoDB, and several key-value stores. And the target applications are those that tolerate relaxed consistency but, nevertheless, benefit from improved consistency. It may seem that these are already served to some extent by the eventual-consistent key-value stores. They are just best effort. You don't know what you get, but fresher more consistent data improves service the same as in Pileus. Pileus gives you tuned performance, but maybe you could have gotten that performance by probabilistic means also. (Peter Bailis has a very nice work on probabilistically bounded staleness, which is also a related approach here.)

For consistency-favoring applications, there are existing solutions like Bigtable, Spanner. And you can still do a quick dirty read from Spanner, by giving a slightly past read timestamp. This works because Spanner is a multiversion key-value store. But I guess you still need to manage when you would want to revert to the quick dirty reads.


Q: How does Pileus change the application code?

A: Yes we learn from API when we get back a consistent read and when not, but reacting on the type of reads may lead to polluting my program with a lot of branches and checks. Maybe programming languages people may have an answer to that. I guess, this way is still better than monitoring for latencies and implement these tuning in your application.