Tuesday, April 30, 2013

My Advice To My Graduate Students

My advice to you has 4 parts: Publish or perish, Work hard, Understand the philosophy of research, and Manage your time well. (A nice complement to this post is my advice to my undergraduate students.)


Be goal-oriented

Always know the criteria needed for the next step, and focus on addressing these criteria.

Your next step after PhD is, ideally, to get a faculty job. And this requires you to have both good quality and quantity of publications. Aim to have 10 publications, half of them at top conferences.

So publish or perish. 4.0 GPA is not your goal. Taking a lot of classes is not your goal.

Get your first paper out very fast

This will give you the confidence to write other papers. This will teach you that research is not an exotic thing that only professors do. This will shave off couple of years from your PhD. Your first paper does not need to be very good. Get it out fast.

Write well, write a lot, revise a lot

Writing clearly and communicating your work in an accessible manner to others is as important as doing the work. I will say it: if solving the problem takes 50% of your effort, the other 50% effort should be on writing it clearly.

Writing is nature's way of letting you know how sloppy your thinking is. Writing enables you to analyze your work, make sense of your work, see the big picture, and find several new ideas along the way.

You can be a very good writer by refining/rewriting several times and by following simple rules.

Keep a research notebook and write a lot

Since writing is very important, you should keep a research notebook/folder accessible to you at any time, and note down every idea/insight you get.

I cannot emphasize the importance of writing enough.

When you are working on a project, start writing as early as possible. This will help steer the research/work you are doing. Write down every failure, problem, and solution idea. Use these material to write a research note or technical report as you go along. It should be such that when the research is done, the writing (first draft at least) is also done.

Revisit your research ideas folder periodically to see if there is anything worth pursuing there. You will be pleasantly surprised, if you are doing it right.


Be proactive, take initiatives

Graduate school is very different than undergraduate years. It is unstructured. There is no spoon-feeding. So you should be proactive and take initiatives to get your research going. My job is to facilitate things and act as an advisor (hence my title as "Advisor"). Your job is to get things done and drive the work. You are the locomotive of research.

Don't get stuck, try different things. Progress is more important than perfection. I won't be upset if you are making mistakes; actually making mistakes means you are working, and I will be happy.

I will be upset if you are stuck and are not trying anything, but instead trying to come up with excuses.

Make mistakes

Being a perfectionist does not payoff. Actually, perfection comes from iteration. You start with a rough draft (which is by definition imperfect), and refine and refine.

Trying to be perfect and get it right the first time gets you stuck.

Rejection is part of the deal. It is nothing to worry about or be ashamed about. If you get rejected, you at least got good feedback. Revise your work, and submit to the next good conference.

Don't take pride of how smart you are

Being smart does not matter, working hard does. So don't take pride of how smart you are. Don't get your ego in the way, otherwise you won't be able to question everything. Stay hungry, stay foolish.

Pedagogues tell us that it is wrong to praise children as being smart. This leads to them being very defensive, and not attempting to solve hard problems: because if they fail, they will be exposed as being "not smart". The right way to praise children is for how hardworking, persistent they are. These children then won't be afraid to take on hard problems.
Nothing in this world can take the place of persistence. Talent will not; nothing is more common than unsuccessful people with talent. Genius will not; unrewarded genius is almost a proverb. Education will not; the world is full of educated derelicts. Persistence and determination alone are omnipotent.
Calvin Coolidge
Being smart is a gift to you, the important thing is what you make of it. Be humble. Your duty is to work hard; the results will follow.

Work persistently

Understand the effort versus payback graph. At the start, although you are putting a lot of effort, no commensurate visible output occurs. This is because you haven't accumulated enough critical mass yet to leverage upon. So initially it requires a lot of effort to get a paper out, say 12 months. But after you pass an activation threshold, you start to get accelerated returns. You have built "critical mass" and gained momentum. Now it requires less effort to get a paper out, say 3 months.

PhD is a marathon. Don't get burned out in the first couple of miles. Manage your expectations, and psychological stamina well. Form a reading/support group with your friends. Don't hesitate to seek advice and counseling.


Find the "right" research problem

Research is all about finding the right problem. Once you find the right problem, solving it is easy.

Finding the right problem is an art. Initially, I will supply the problems to you. Later on you should be able to find the right problems yourself. This will be possible through acquiring a "taste for problems" through reading a lot of good papers and writing research notes.

The right problem has two components: you and the problem. Here are my heuristics for guiding myself to find the right problem.
  1. work on what only you can do: leverage on your strengths (see the discussion on toolset below)
  2. work on what you love: you have to be passionate about your work
  3. work on use-inspired problems: this is a precondition to do useful work
  4. work on the most important problems in your domain: address the fundamental/inherent problems, not the accidental problems.
The right problem is not very hard ---in which case you would be stuck for years. The right problem is at the boundary of your limits. You strain to solve the problem, but you know it is solvable.

A very nice reference at this point is to read the "Your research and You" by Richard Hamming.

Use your toolset to attack domain problems

I will explain this with an example. My toolset is distributed systems theory/algorithms, formal methods, and fault-tolerance (self-stabilization) techniques. The toolset is important for solving problems efficiently, and in a manner only you can do.

The domain on the other hand is important for identifying useful problems to attack with your toolset. The domains I work on are sensor/actor networks, crowdsourced sensing/collaboration via smartphones, and distributed/networked systems (data center computing, these days).


Kill distraction 

Good work requires intense focus. A minor distraction will divert your focus, and you may need a long time to recover back to the same level of concentration.

So kill your browser. Use it in batch-mode only when you really need to. Don't have your email open in your browser, that may lead to checking email every 15-20 minutes killing your quality time. Check email only couple times a day.

Success = time * focus. Time Focused time is very precious. Go to the library work for an undistracted 2-3 hours if you want to get quality work done. This is absolutely critical.

Saturday, April 27, 2013

Fault Tolerance via Idempotence (paper summary)

This paper (by Ramalingam and Vaswani) proposes a generic/automatic way to use idempotence---which requires the system to tolerate duplicate requests--- for handling communication and process failures in a software system efficiently.

Using idempotence for dealing with process and communication failures has been investigated also in the "Idempotence is not a medical condition", but there no (generic) solution for achieving idempotence was provided.

Automating idempotence

For automatically ensuring idempotence for a system, the paper makes use of the monad idea, and designs & implements the idempotence monad.

The idea underlying the idempotence monad is simple: "Given a unique identifier associated with a computation, the monad adds logging and checking to each effectful step in the workflow to ensure idempotance". Armed with this idempotence monad, the paper shows that idempotence, when coupled with a simple retry mechanism, provides a solution to the process and communication failures.

Consider a bank example, where the parameter requestId serves to distinguish between different transfer requests and identify duplicate requests. This service can be made idempotent and failfree by (1) using this identifier to log debit and credit operations, and (2) modifying the debit and credit steps to check--using the log-- if the steps have already been performed. This strategy ensures that multiple (potentially partial and concurrent) invocations of transfer with the same identifier have the same effect as a single invocation.

But, manually ensuring idempotence is tedious, error-prone and makes implementation less comprehensible. So the paper describes a monad-based library that realizes idempotence and failure-freedom in a generic way. "A type with a monad structure defines what it means to chain operations, or nest functions of that type together. This allows the programmer to build pipelines that process data in steps, in which each action is decorated with additional processing rules provided by the monad. As such, monads have been described as programmable semicolons." This is also the aspect-oriented programming way.

The Idempotence Monad:

  • Associate every computation instance (that we wish to make idempotent) with a unique identifier.
  • Associate every step in an idempotent computation with a unique number.
  • Whenever an effectful step is executed, persistently record the fact that this step has executed and save the value produced by this step.
  • Every effectful step is modified to first check if this step has already been    executed. If it has, then the previously saved value (for this step) is used instead of executing the step again.

Decentralized Idempotence

The implementation of the idempotence monad is designed to work with key-value data stores, and does not assume a dedicated storage for logs that can be accessed atomically with each transaction. The implementation adopts the key-value datastore to simulate a distinct address space for logging.

This leads to a decentralized implementation of idempotence that does not require any centralized storage or any (distributed) coordination between different stores. Thus, this implementation of idempotance preserves the decentralized nature of the underlying computation.

Evaluation and critique

The idempotence monad has been implemented in C# and F# targeting Windows Azure, which provides a key-value store. The evaluations in the paper shows that the performance overheads of using the idempotence monad over hand-coded implementations of generic idempotence are not significant.

But, this is an incomplete evaluation: The compared baseline of hand-coded implementations is constrained to add logging and checking to the operation ---as it was the case for the monad. So, the only overheads in monad solution over hand-coded are that the compiler generated monad code "tend to capture a lot more state than hand-coded implementations, and may also add unnecessary logging to transactions that are already idempotent". It could be interesting to repeat the evaluations with a non-constrained developer that can implement custom solution to fault-tolerance leveraging more on the application logic and implementation information.

Another thing to compare in future evaluations could be to that of alternative generic fault-tolerance approaches: e.g., using Replicated State Machines (e.g., via Paxos, or primary/secondary replication), or using the CRDT approach (for replicating the node, when it is applicable).

Actually, taking a step back to re-examine the problem, we can notice that leveraging on TCP gets us through most of the way: We won't have message losses if we use TCP, and would get message losses only because of process crashes. So, we only would need a way to tolerate process crashes, say via replication, checkpointing, or a fast-restartable system.

Finally, the trends toward providing better plumbing at datacenters also alleviates the communication/process failures problems. (For example, Amazon Simple Queue Service (Amazon SQS) offers a reliable, highly scalable, hosted queue for storing messages as they travel between computers. By using Amazon SQS, developers can simply move data between distributed components of their applications that perform different tasks, without losing messages or requiring each component to be always available.)

Thursday, April 25, 2013

Attacking Generals and Buridan's Ass, or How did the chicken cross the road?

Our department moved into a brand-new beautiful building last year. My new office at this building has a nice side-view of the lake, and also a bird-view of the pedestrian crosswalk on the road in front of the building.

I spend a good fraction of my time in the office on my makeshift standing desk in front of the window. So I get to watch (from my peripheral vision) the pedestrians crossing the crosswalk. This makes for a surprisingly suspenseful and distracting pastime. It turns out that crossing the crosswalk is not a trivial task, but rather a complex process. Below I talk about the crosswalk scenarios I observed, and how these relate to some fundamental impossibility results in distributed systems.

The crosswalk

This crosswalk does not have any traffic lights as it is on a low-traffic/in-campus road. According to the crosswalk protocol, the pedestrian has the right-of-way on the crosswalk: the car needs to stop if the pedestrian shows "intent to cross" (which is unfortunately not well-defined). In the common case, there is no car approaching and the pedestrian crosses easily. In the ideal execution of the crosswalk protocol, the car sees the pedestrian approaching the crosswalk and slows down in advance, and this (the car's intention to give right-of-way to the pedestrian) becomes clear to the pedestrian upon which she starts crossing.

Now let's look at the non-ideal executions of the crosswalk protocol. Sometimes it is not clear to the pedestrian whether the car is slowing down or not, yet with some leap of faith the pedestrian starts on the crosswalk. (Probably this is the correct thing to do as a pedestrian: After observing that the car is in comfortable stopping distance from the crosswalk, just take that leap of faith and start walking.) Then, the car slows down giving indication that the driver saw the pedestrian. But, sometimes the car may not slow down or may not appear to slow down from the pedestrian's point of view. In this case, the pedestrian may run to complete the crossing, or stand still in the middle of crosswalk looking confused, waiting for the car to make the next move (either cross or slow down).

Another faulty case occurs when the pedestrian hesitates before entering the crosswalk. The car slows down, and the pedestrian motions to the car with his hand to pass. The car hesitates for a short moment (or takes some time to accelerate again), and then both the car and the pedestrian start moving at the same time. Then they both stop. Then the pedestrian insists on vigorously hand-waving the car to continue, and the car driver motions the pedestrian to continue. This is the equivalent of the awkward corridor dance that ensues when two people cannot negotiate which side they should pass on the corridor. However, this is more embarrassing and dangerous. This case occurs surprisingly frequent, and in one case involved an Engineering professor as the pedestrian :-)

So, as a pedestrian, never ever hand-wave the car to pass. You have the right-of-way as a pedestrian, and by signaling the car to pass you are just messing up the crosswalk protocol. In the distributed systems lingo, you are a faulty process---or even a Byzantine process--- as you deviate from the protocol. And if this happens to you as a driver, when a pedestrian motions you to continue, ignore it, follow the protocol: Just stop and let the pedestrian pass. Better violate the progress condition than violating the safety condition.

This is supposed to be a simple protocol; the pedestrian has the right-of-way in the crosswalk. But as I covered a bit above, a lot of things can go awry in this simple protocol. There are a lot of corner cases because this is a distributed system with pedestrian and driver agents and is subject to faults and non-ideal environmental conditions.

The distributed system

As a distributed systems person, I can't help but relate these difficulties to how hard consensus is in a distributed system. A very basic and comprehensive impossibility result in distributed systems states that "Using a communication link that may experience intermittent failures, there is no deterministic algorithm for reaching agreement!" (This is known as the attacking generals problem, and it holds for both the asynchronous and the synchronous system models.)

The interaction of the pedestrian and the car driver is a lossy channel indeed, due to the different understanding/interpretations of intents. So, according to the attacking generals impossibility result, you cannot have a protocol that satisfies both the safety and progress conditions of the crosswalk protocol for all possible executions. We can only hope that safety is not violated (thankfully, that is the case on this crosswalk so far), and the only thing that suffers is the progress condition (and that only temporarily).

There are further bad news for the crosswalk protocol unfortunately. Even with the ideal conditions at the higher level (lossless channels and crash-immune agents), another impossibility result lurks at the lower level. This is known as the arbiter problem, or in its classical formulation as Buridan's ass: "an ass that starves to death because it is placed equidistant between two bales of hay and has no reason to prefer one to the other." This impossibility result states that it is impossible to build an arbiter (a device for making a binary decision based on inputs that may be changing) that is guaranteed to reach a decision in a bounded length of time in all possible executions. ("The basic proof that an arbiter cannot have a bounded response time uses continuity to demonstrate that, if there are two inputs that can drive a flip-flop into two different states, then there must exist an input that makes the flip-flop hang." Another proof of impossibility without appeal to continuity was given here.) Please read Leslie Lamport's fascinating account of the problem and his problems with getting the paper he wrote on this published.

The lessons

There are two lessons to draw here. The first one is that implementing even simple protocols are messy (especially so for distributed systems) due to the many corner cases and low level problems. Abstractions start to leak at the low level. So cut enough slack for safety; don't over-optimize and make your system fragile. Keep backup/recovery strategies for corner cases.

The second lesson is that, practice always finds a way. Despite the attacking generals impossibility result, there are millions of crosswalk crossings and billions of TCP connections every day. The impossibily results say you cannot have both safety and progress in all executions. Practical systems have probabilistic guarantees and backup plans for corner cases: Progress may get a temporarily hit, or safety gets violated but retry/recovery kicks in. And this works for us.
Theory is when you know everything, but nothing works.
Practice is when everything works, but no one knows why.

Wednesday, April 24, 2013

Key-CRDT Stores

This is a master's thesis from the CRDT group, presenting the design and implementation of a key-value CRDT store, named SwiftCloud. SwiftCloud extends the Riak Key-Value store in to a Key-CRDT store by incorporating CRDTs in the system's data-model, namely in the value of the key-value tuple. (By the way, Riak---by Basho inc.--- is a NoSQL database implementing the principles from Amazon's Dynamo paper. Enjoy the brief singing of Riak description here!)

SwiftCloud achieves automatic conflict resolution relying on properties of CRDTs, and provides strong eventual consistency. (This post will make more sense if you read this first) SwiftCloud uses state based replication. Strong eventual consistency between replicas is achieved by merging the states of the replicas. SwiftCloud employs versioned CRDTs to support transactions. Transactions never abort due to write/write conflicts, as the system leverages CRDT properties to merge concurrent transactions.

The Key-Value store interface

A K-V store stores and retrieves byte arrays from the database. In Riak, content is stored as binary data and is identified by (bucket, key) pairs. A (bucket, key) pair has an associated value. Riak provides the following configuration options:

  • # replicas for an object (default is 3)
  • # replicas that reply to a read operation (read quorum)
  • # replicas to write for a write operation (write quorum)
  • strategy for dealing with conflicts (last writer wins or keep multiversions) 

When using the keep multiversions conflict resolution, it is important to always fetch an object before storing it, even when you want to overwrite it. This way the system internally stores a version vector to order the operations; this is to detect that the new value is newer than the previous one and to store both.

SwiftCloud architecture

The interface of the SwiftClient works as a wrapper for Riak Java Client methods. Thus, it allows to fetch and store CRDT objects and do automatic merge of conflicting CRDTs. Fetch and Store operations access stored CRDTs and serialize/deserialize them automatically. To deliver a consistent version of the CRDT to the application, the fetch operation automatically merges the conflicting updates. The clocks associated to CRDTs are used during the merge operation. Clocks are also merged and the resulting clock is associated to the CRDT, before delivering it to the client.


SwiftCloud supports transactions with the following properties: Inside a transaction, the application accesses a snapshot of the database, which includes all CRDTs in the system. All updates of a transaction are executed atomically. Transactions never abort --concurrent updates are merged using CRDT rules.

The transactional system of SwiftCloud builds on the convergence properties of CRDTs and the multi-versioning support provided by the Versioned CRDTs. The unique identifiers stored as part of multiversioned CRDTS enables the system to reconstruct the evolution of a CRDT or undo operations.

The evaluations in the thesis show that the non-transactional SwiftCloud has little impact on performance. The transactional SwiftCloud imposes only a small overhead due to the extra communication steps in the protocol for executing transactions.


Since this works builds heavily on the CRDT idea, it is prone to the same drawback of not providing meaningful invariants across replicas. Current CRDT designs are not suitable to some applications, as they cannot preserve data invariants, such as not allowing an integer to become negative. Future work will try to address that question.

Tuesday, April 23, 2013

Conflict-free Replicated Data Types

Below are my notes summarizing the paper "Conflict-free Replicated Data Types" by Marc Shapiro, Nuno Preguica, Carlos Baquero, and Marek Zawirski. The paper is available here.

Replicated state machines (RSMs) are a basic and important tool for distributed systems. The idea in RSMs is if the replicas start from the same initial state and perform the same updates with the same order, then their end states are the same. The "strong consistency" approach provides this guarantee by serializing updates to the replicas in a global total order. However, the down side to strong consistency approach is that it is a performance & scalability bottleneck, and it also conflicts with availability and partition-tolerance (due to the CAP theorem).

Replicating data under Eventual Consistency (EC) allows any replica to accept updates without remote synchronization. However, published EC approaches are ad-hoc and error-prone (they come without a formal proof of corrrectness). CRDT work tries to address this problem by proposing a simple formally-proven approach to EC, called Strong Eventual Consistency (SEC), which avoids the complexity of conflict-resolution and roll-back.

SEC defines a stronger convergence property than EC. SEC states that correct replicas that have delivered the same updates have equivalent state "immediately", whereas EC states this "eventually" ---since conflict recovery needs to be performed across replicas. (Actually Strong Eventual "Convergence" is a more appropriate and accurate term to describe SEC than Strong Eventual Consistency, because SEC does not really improve the consistency over EC.)

SEC obviates the need for the replicas to coordinate for conflict recovery because it provides conflict-freedom by leveraging on monotonicity in a semi-lattice or commutativity (a trivial example to this is a "set" data type). On the other hand, the downside to SEC and the CRDT approach is that the applicability is limited to simple locally verifiable invariants. In other words, conflict-freedom is traded off with meaningful & useful invariants that span across replicas: When a replica is at a particular state, it is hard to state any predicate about the state of the other replicas. (Note that, in strong consistency---achieved using Paxos for example--- you could state that the state of all replicas are equal. CRDT is PAEL and Paxos is PCEC.)

In the rest of the paper, two sufficient conditions are presented for SEC ---a state-based implementation of CRDT and an operation-based implementation of CRDT--- and a strong equivalence between the two conditions is proven.

State-based replication

Executing an update modifies the state of a single replica. Using gossip protocols, every replica occasionally sends its local state to some other replica, which merges this state into its own state. Every update eventually reaches every replica, either directly or indirectly. (For efficiency, the metadata of the object to be replicated may be gossiped first.)

A semilattice is a partial order equipped with a least upper bound (LUB) for all pairs. LUB is commutative, idempotent, and associative. Monotonic semi-lattice is where (i) The state set S forms a semilattice ordered by . (ii) Merging state s with remote state s' computes the LUB of the two states. (iii) State is monotonically non-decreasing across updates, i.e., s s + u.

Theorem 1 (Convergent Replicated Data Type (CvRDT)). Assuming eventual delivery and termination, any state-based object that satisfies the monotonic semilattice property is SEC.

At this point it is important to note that monotonic semilattice property is not a simple property, because providing a LUB for all state pairs take some pains. For instance, a simple diamond tiling lattice does not satisfy this property.

Operation-based replication

Alternative to the state-based style, a replicated object may be specified in the operation-based style. An op-based object has no merge method; instead an update is split into a pair (t, u), where t is a side-effect-free prepare-update method and u is an effect-update method. The effect-update method executes at all downstream replicas. A sufficient condition for convergence of an op-based object is that "all" its concurrent operations commute. An object satisfying this condition is called a Commutative Replicated Data Type (CmRDT).

Theorem 2 (Commutative Replicated Data Type (CmRDT)). Assuming causal and exactly-once delivery of updates and method termination, any op-based object that satisfies the commutativity property for all concurrent updates, and whose delivery precondition is satisfied by causal delivery, is SEC.

CvRDTs and CmRDTs are equivalent

The equivalence means, we can rewrite a CvRDT as a CmRDT, and vice a versa, with some effort. The proofs in the paper essentially show that this can be done, albeit in an inefficient manner. To give some intuition about this equivalence, let's compare a CmRDT counter implementation versus a CvRDT counter implementation below.

A CmRDT counter is very simple to implement, when you get a new increment operation delivered just increment your counter by 1. If we unequivocally identify every operation and the delivery pre-condition guarantees that all operations are delivered and executed exactly once through the causally-ordered broadcast middleware, replicas will converge regardless of whatever order operations are applied. We have SC.

A CvRDT monotonic counter is not that trivial to implement: Consider a counter that maintains the number of increments. To merge the value (i.e., state) of two replicas with value 1, we could sum the values, or select the maximum value of the merging replicas, but neither would return the correct value in every case: If the replicas had concurrent updates, then the value should be 2. However, if the two replicas are readily synchronized (through gossip), the final value should be 1. This make it impossible to use either max or sum as the merge procedure. Instead, the correct solution is inspired by vector clocks: Store the number of increments for each replica indexed by position in a vector. The query operation retrieves the sum of every vector position and the merge procedure selects the maximum value for each index in the vector.

So what happens is this. In order to avoid state replication, CmRDT assumes an underlying reliable causally-ordered broadcast communication middleware; one that delivers every message to every recipient exactly once and in an order consistent with happened-before. This is actually hiding state under the rug, because often (for most examples) the causally-ordered broadcast communication middleware requires the replicas to keep version vectors, and to buffer/wait operations before delivering (as long as it takes) to ensure that all the operations that causally-precede these operations are delivered first.

Concluding remarks

CRDT provides a limited remedy to CAP theorem. A CRDT replica is always available for both reads and writes, independently of network conditions. Any communicating subset of replicas of a SEC object eventually converges, even if partitioned from the rest of the network. This is not in any way at odds with the CAP theorem, because CRDT weakens consistency to provide more availability. The consistency is weakened so much that it is not possible to provide an invariant that spans acrosss replicas. To address the useful invariants problem, future work prescribes investigating stronger global invariants (using probabilistic or heuristic techniques), and supporting infrequent, non-critical synchronous operations (such as committing a state or performing a global reset).

Sunday, April 21, 2013

AWS Summit NYC, the rest of Day 2

In the afternoon of day 2, there were breakout sessions. Below are summaries from the ones that I found useful. There was also an expo on day 2, and my notes on the expo are towards the end of this post.

Building web-scale applications with AWS

This presentation opened with a customer anectode. As you may know, Reddit runs on AWS. When President Obama did an "Ask Me Anything" on Reddit recently, Reddit called AWS a day in advanced to give heads up. There were 3 million pageviews on Reddit that day, 500K of which was for the President's AMA. Reddit has added 60 dedicated instances and handled the extra traffic smoothly.

The presenter gave the takehome message in advance for his talk: While you scale, architect for failure, and architect with security. The rest of the presentation talked about these (mostly the first point) in more detail.

Architecting for failure. AWS has 9 regions around the world, and supports within each region different availability zones (AZs). An AZ is basically an independent datacenter. Use different AZs and regions to architect for failure.

There are various storage options for architecting for failure. There is S3, which provides durable (everything replicated to 3 nodes by default), scalable, highly-available storage. S3 also has good CloudFront (and also Route53) integration for low-latency access worldwide. Glacier provides a very cheap solution for longterm infrequently-accessed object storage, with 1 cent per gig per month. The tradeoff is that if you want to retrieve an object, this may take up to 5 hours for preparation (since it is based on magnetic-tape based storage). Finally, there is elastic block storage (EBS) which provides support up to 1 TB, is snapshotable, and is a very good choice for random IO.

Then there are database options, they provide a readily queryable system with consistency/performance options. A common use case for the database is for managing session-state for web applications. The session state store should be performant, scalable, reliable. Managing session state with the AWS database options achieve these, as well as divorcing the state from the applications (which is a great tool for architecting for failure). There are several options AWS provides for databases: DynamoDB, RDS (Relational Database Service), and NoSQL high I/O datastore solutions. AWS ElastiCache (protocol compliant with MemCached) helps provide speed support for these databases.

Data-tier scaling is the bane of the architect. It is still a hard problem, but AWS services can help make this a bit easier. There is no silver bullet, you just have to be aware of your options. First you can try vertical scaling, go for a bigger VM instance. For horizontal scaling, the simplest thing you can try is to have a master node and add read-only copies as slaves. (Using RDS and Oracle can create the read-slaves with click of a button.) If you need to go with the high-scalability version, this makes things more complex. This is the sharding and using hash-rings idea. You can shard by function or key space, in RDBMS or NoSQL solutions. DynamoDB provides a provisioned throughput NoSQL database with /fully-managed horizontal scaling/. Leveraging on its SSD-backed infrastructure, DynamoDB can pull this trick. DynamoDB is also well-integrated to AWS CloudWatch, which provided system-wide visibility into resource utilization, application performance, and operational health. Finally, AWS Redshift provides a petabyte-scale data warehousing solution optimized for high query performance on large-scale datasets.

Loose coupling.  When architecting for failure, another important strategy is to use/adopt loose coupling in your applications. The looser coupled your application is the better it scales and tolerates failures. AWS SQS (Simple Queue Service) is your best friend for this. SQS offers a reliable, highly scalable, hosted queue for storing messages as they travel between computers. You use SQS as buffers between system components to helps make the system more loosely-coupled. SQS allows for parallel processing (through fan outs) as well as tolerating failure. A relevant service here is the AWS SNS (Simple Notification Service) which can be used to fanout messages to different SQS based on detection of failure or more traffic. Also you can then add CloudWatch for auto-scaling: if one queue size gets big, you can set the appropriate response to acquiring additional AWS spot instances. (AWS spot is the name-your-own price option for VMs, sort of like PriceLine but for VMs.)

Finally, when architecting for scalability and fault-tolerance, spread the load by using AWS ELB (Elastic Load Balancing) at the frontend. ELB creates highly scalable applications, by distributing load across EC2 instances, and also supports distributing load to multiple AZs.

Architecting for security. AWS provides Identity and Access Management (IAM) helps you to securely control access to AWS services and resources for your users. IAM supports temporary security credentials, and a common use case for IAMs is to set up identity federations to AWS APIs.

Some soundbites. A couple of nice soundbites from this presentation are "scalability: so your application is not a victim of its success" and "scaling is the ability to move the bottlenecks around to the least expensive part of the architecture".

After the talk, I approached the presenter and asked him: What about tightly-synchronized apps (for example transactional processing applications), how does AWS support them? His first answer is to scale vertically by using a bigger instance. When I asked about a scale-out solution, he said that there is no direct support for that. But a possible solution could be to use DynamoDB to record transaction state. DynamoDB provides consistency over its replicas, so it should be possible to leverage on that to build a scale-out transactional application.

Big Data (and Elastic MapReduce)

The first part of the talk introduced big data and the second part talked about Elastic MapReduce (EMR) for big data analytics.

With the accelerated rate of data generation the data volume grows rapidly, and storing and analyzing this data for actionable information becomes a challenge. Big data analysis have a lot of benefits. Razorfish agency analyzed credit card transaction data to improve advertising effectivenes, and achieved they achieved 500% return on ad spend by targeted advertisements. Yelp, an AWS customer, analyzed their usage logs and identified early on a bias for mobile usage, and invested heavily in the mobile development. In Jan 2013, 10 million unique mobile devices used Yelp. Social network data analysis keeps a tab on the pulse of the planet, and a lot of companies are analyzing social network data today for market analysis, product placement, etc.

EMR is AWS's managed Hadoop service. (Read here about MapReduce.) Hadoop has a robust ecosystem with several database and machine learning tools, and EMR benefits from that. EMR providesagility for experimentation, and also cost optimizations (due to its integration with AWS spot, name-your-own-price supercomputing).

AWS datapipeline helps manage and orchestrate data-intensive workloads via its pipeline automation service, and execution and rety logic. A basic pipeline is of the form "input-datanode" -> activity -> "output-datanode", but it is easily expandable with additional checks (precondition) and notifications (of faults), and arbitrary complex fanning out and fanning-in.

Partner & Solution Expo

There was a partner and solution expo as part of day 2 of the summit. The expo was more fun than I expected, there was a lot of energy in the room. You can see a list of companies in the expo under the event sponsors category at the AWS summit webpage.

Most of these companies offer some sort of cloud management solutions as-a-service (AAS) to cloud computing customers: security AAS, management AAS, scalability AAS, analytics AAS, database AAS. When I asked some of these companies "what prevents AWS from doing this cloud management support solutions themselves as part of AWS, and whether they feel threatened by that", they said "this is not AWS's business model". This seemed like a weak answer to me, because AWS has been continuously improving the management of its cloud computing services offerings to make it easier to use for the customers. Later when I asked the same question to an AWS guy, he told me that "These third parties should innovate faster than AWS otherwise they become irrelevant", which seemed like a more logical answer. Maybe the third parties can also differentiate by providing more customized service and support to companies, where AWS provides more generic services. After his forthcoming answer to this question, I asked the AWS guy, whether AWS is starting to eat IBM's lunch. He didn't comment on that. My take on this is that by democratizing IT services, AWS is enabling (will enable) many third parties to start digging into IBM's solutions/consulting market.

I talked with the AWS training people, they are in the process of preparing more training material for AWS. This will be very useful for developers, and also could be a good resource to include in the undergrad/grad distributed systems courses with project components.

I also talked with Amazon Mechanical Turk (AMT) people. They have customers from news and media companies, and were trying to make new connections in the expo. Maybe Amazon should look into integrating AMT with AWS and provide "crowdsourcing as a service" well-integrated to other AWS services.

Finally, I also chatted with the Eucalyptus people. Eucalyptus offers a private cloud solution compatible with AWS. Eucalyptus is opensource, with a business model similar to RedHat (providing support and training). They told me that their use-case is to enable you to test your applications in your private cloud in small-scale and then deploy at scale in AWS. But, that doesn't look like an essential service to me; I don't understand why one couldn't have done that initial testing on the AWS already. Their main competitor is OpenStack, which is backed by RackSpace and a large coalition. OpenStack has been gaining a lot of momentum in recent months.

Wrapping up

AWS is by far the top cloud computing provider, and it is slowly but surely building a datacenter OS, setting the standards and APIs for this along the way. AWS has currently 33 services (S3, RDS, DynamoDB, OpWorks, CloudWatch, CloudFront, ElastiCache, ELB, EMR, SQS, SNS, etc.) to support building cloud computing applications. I had the impression that AWS is about to see an explosion in its customer base. Most of the people I met during the summit have been using AWS in a limited way, and were currently investigating if or how they can use it more intensively.

During the expo session, I had a chance to approach Werner and make acquaintance. I am a tall guy, but he is really tall and big. He is very down-to-earth person; he stood in the AWS booth and met with AWS customers to listen and help for at least 2 hours.

PS: 17 presentations from the #AWS Summit in NYC are made available here.

Friday, April 19, 2013

AWS Summit NYC, Day 2, keynote

In the morning of day 2, we were led into the very large room for the keynote. There were probably 1000-1500 people in the hall, sitting in tight packed formation. Amazon VP & CTO Werner Vogels entered the stage with fanfare (the hall darkened, upbeat music blasting from the speakers). Werner gave a superb, clear, and engaging talk. You can read my notes of it below, or watch the keynote here (around 2 hours long).

The AWS summit series kickstarted in NYC, because there is a good tech crowd here, and more importantly many AWS customers are in NYC. AWS summit series this year will be in 13 cities, for which a total of 20K people registered.

It is hard to believe that AWS S3 launched only in 2006. There has been amazing progress in 7 years in AWS. Now there are a total of 33 services running, which span compute, storage, database and application management. 100,000 businesses are using AWS This list includes even an extraterrestrial client: NASA JPL is a good customer, and Curiosity rover data flows to AWS, where many customers access this data for data-mining.

AWS provides a wide-range of technologies to avoid lock-in to a specific OS and database, tool, and programming language. Consulting partners and technology partners are growing at a tremendous rate. Last year AWS had launched the AWS marketplace, where you can search for software that is prepackaged to run on AWS. Today there are 25 categories, 778 product listings on the AWS marketplace.

AWS takes pride in their focus on customers; they had 31 price reductions since 2006. Economies of scale (and also Moore's law) draw down the costs, and AWS tries to return these back to the customers (this is actually an investment to grow faster). Werner is visibly excited by this: "Wider adoption due to lower prices will enable tremendous beautiful applications, since you will no longer be constrained by the infrasture or usage costs". Werner believes that "the real growth is just around the corner, this is just the beginning".

AWS services

Next Werner goes on to say a couple sentence for some major AWS services.

S3 (simple storage service) recently crossed over 2 trillion objects. It has seen quadratic growth since 2006, and is routinely serving more than a million storage requests a second.

DynamoDB provides managed non relational database that is consistent, high performant, and scalable. AWS focused on the database because traditionally the database has always been the bottleneck in scaling. DynamoDB provides auto-scaling, with low-latency. It is indexed by key and also supports ranges on the key. In the keynote, local secondary index support for DynamoDB is announced. Werner said "Now that, the database is no longer the bottleneck, you can focus on your application to make it scalable".

Next Werner started to talk about the transformation AWS enables: what drives it, who is doing it, and what is next?
The most radical and transformative of inventions are those that empower others to unleash their creativity and pursue their dreams. -- Jeff Bezos 

What drives this Transformation

The success of cloud is because it fell into an economic model where there is an abundance of products, intensifying competition, significant reduced consumer loyalty against products, limited capital. These drive tremendous uncertainty about whether your product is going to be successful. In times of uncertainty you need very different resource models; you need flexibility to address these uncertainties: acquire resources on demand, release resources when no loger needed, pay for what you use, leverage other's core competencies, turn fixed costs into variable. IT wanted to do this for a long time, AWS enabled it. AWS provided the following benefits.

  • AWS help trade capital expense for variable expense to get started: E.g.,Samsung TVs run software that needs to connect to backend infrastructure, so Samsung went with AWS, saved millions of dollars in fixed capital expense.
  • AWS provides lower variable expense than companies can do for themselves.
  • You no longer have to guess capacity.
  • Dramatically improved speed and agility. If you want to increase innovation you have to reduce the cost of failure in AWS you can do many experiments at the same time.
  • Stop spending money on undifferentiated heavy lifting.
  • Go global in minutes.

Who's doing it

Every industry! AWS help achieve radical transformations in every industry.

In insurance business, the climate corporation is building completely radical approach to insuring farmers. They do weather insurance, but no agent is sent to physically visit the farms. This is done in a completely data driven manner. Once a month they run 5K cores on AWS Elastic MapReduce, go through 10K scenario models, spinning 20TB of data. Climate coorp couldn't do this on their premise, massive amount of capital needed to build this infrastructure themselves.

Financial services uses AWS (see the customer talks below for Nasdaq's talk). This is a challenging industry, where there is extreme competition, and stringent regulatory compliance/security requirements.

Media and advertising companies use AWS. Werner said "almost any media you think is AWS customer" and counted among others Fox, PBS, ABC, Newsweek, Washington Post, Independent, Reuters. ABC uses AWS to stream media for exactly the device you are watching on (iphone,ipad, computer) and perform context/location-aware ad insertion. The poster child is of course Netflix, the media streaming giant which has all their operations on AWS. Netflix provides their cloud tools as open source, and sponsor a Netflix open source prize which is in progress.

Manufacturing which has become increasingly global and highly collaborative requires significant scale. AWS powered Boeing for designs. Autodesk another customer. General Electric another customer gave a brief talk, found at the bottom of this post.

Healthcare and biotech has been seeing rapid innovation and has global scale collaboration needs. Center for disease control (CDC) collects data about health risks, does realtime flu tracking and shares with healthcare workers through AWS. Unilever is using AWS to study armpit bacteria to be able to build better deodorants, they also study oral bacteria for producing more effective toothpaste. AWS S3 hosts and makes publicly available 250TB of data from the 1000 genomes project to help research. Illumina, a biotech company, uses AWS in their efforts on sequencing of human genome. Illumina produces DNA sequencing instruments, and using AWS they provide a service where their devices publish data to cloud application. This has uses in cancer treatment: everybody's cancer is different, treatment should be individualized. Illumina's customers are biologists, who using Illumina's basespace portal can now share data with ease rather than resorting to shipping harddisks.

Hospitality sector (hotel IT) is also seeing rapid growth and opportunities. Four seasons hotel, Intercontinental, and Hotelogix are AWS customers. Airbnb, hosted on AWS, upsets the hotels, and enables you to book room/apartment directly from the owners.

What is next?

Werner starts by talking about big data trends: the move to realtime information and deeper integration. These lead to internet of things. Connected devices are incredible data generators.

Shell oil company, another AWS customer, is experimenting with new sensors that generate 1TB of data each, and they are planning on dropping 10K sensors around the world. (Seems like fat sensors is becoming the new trend.) The Mount Sinai hospital uses sensors in hospital for management (beds, doctors, patients, etc).

Your music/books/content is now in the cloud, and devices serve just as a window to that.  Werner wants the treadmill machine in the gym to be like that, another window to the cloud.  The treadmill should realize who I am and set up automatically.

Smartphones are also a primary big data generator and first class citizen in internet of things. Smartphone location-aware services are very significant, and the development of these apps will be dramatically transformed by cloud support. The apps will just stitch a web of services together, this will be the modern style of app dev.

As a final point, Werner mentions security. AWS has readily available encryption services to protect customers and they want to see that encryption is dead simple to use. For example on amazon.com web site all traffic and storage is encrypted. He announce AWS Relational Database Service (RDS) support for transparent data encryption for Oracle.

AWS customers talks

To support his points Werner invited, spread around in his talk, 4 AWS customers to talk about their use of AWS. I am summarizing each of these customers' uses-cases together below.

Bristol-Myers Squib: They service to clinical pharma, molecular dynamics, and computational genomics. They use AWS to support scientists running clinical trial simulations, which are very computationally intensive. Using AWS provides them to have on demand compute capacity, time and money savings, agile innovation, and reduced patient burden (fewer subjects, fewer blood tests).

Nasdaq: Nasdaq is synanomous with stock exchange, but nasdaq also provides IT solutions to industry as their business. Using AWS, they launched FinQloud platform which provides financial-services customers with a secure financial-industry only cloud.

General Electric: They talked about a platform they will build: cloud enabled crowd-driven ecosystem for evolutionary design (CEED). There were no details or concrete applications, but CEED will help address the design and creation of complex systems that require global collaboration. (Yes, very vague I know.)

Mortar: Mortar is a big data startup that won the AWS global startup challenge competition this year.  Mortar, founded on 2011 on AWS,  provides "Hadoop as a service". They aim to democratize data analytics leveraging on AWS which democratized infrastructure.

AWS summit NYC, day 1

I attended AWS summit NYC on Wednesday and Thursday. As an academician, this was a little unconventional conference for me to attend. I regularly attend ACM, IEEE research conferences on focused areas (wireless networks, distributed systems, self-stabilization), but I wanted to go out of my comfort zone to learn about the problems AWS and the AWS ecosystem are working on. Today AWS is leading the cloud computing market, and the AWS infrastructure and platform is an amazing feat of engineering, to say the least. Jeff Bezos, Amazon CEO, has not received a fraction of the genius CEO status alloted to Steve Jobs, but by all accounts Jeff Bezos is not any less deserving. (Read Steve Yegge's platforms rants to see how in 2002 Amazon started an enormous concerted effort for building up its platforms to become what they are today.)

I arrived at 2pm on Wednesday to attend the "cloud kata for start-ups". The name sounds exciting, right? And, they promised powerpoint free presentations! Wonderful, right Unfortunately the presentations still turned out to be canned and dry, for the most part. Instead of powerpoint slides, the speakers dashed through pre-recorded screencast videos, and I experienced the screencast version of death-by-bulletpoint. The problem with screencast videos is it has the potential to combine the very-fast pace of powerpoint presentations while lacking the structure "potentially" provided by powerpoint. The session contents also did not have much to do with the startups and their needs. As a result I got underwhelmed, but fortunately, this turned out to be a false alarm. Werner Vogels, Amazon VP & CTO, wowed people with his keynote the next day, giving an energetic and concise-to-the-point presentation. Day 2 was a blast, and I had a lot of fun. I will post my day 2 notes soon. Read on for my short day 1 notes.

The presentations in the cloud-kata for start-ups session ended up giving short introductions to several AWS services. Some of these are as follows. Elastic cache service adds a scalable and configurable cache layer between your database and applications. Cloud formation service provides a templating service where you can start cloud former, generate a template for your AWS system deployment configuration (e.g., a simple LAMP and elastic load balancer front-end cluster) and store the template on AWS S3. Later you can replicate the same deployment in another region with one command using this template. Opsworks, a very recent service from AWS, provides help for operation management and building workflows. In Opsworks, you can use chef recipes, custom json, add layers (phplayer, MySQL layer), arrange behavior for highload periods, and specify additional machines coming up at certain times of day to help. Finally, AWS Redshift provides a fast, powerful, and fully managed petabyte-scale data warehouse service in the cloud. It offers you fast query performance for very large datasets using SQL-based tools.

Some startups that based their operations on AWS also presented in the session as well. Datadog presented a web-based AWS monitoring tool, which provides monitoring and simple analytics of Nagios alerts as a service. Sumall presented their "analytics for small business", which aims to make getting and visualizing data easy. Sumall gathers data from the business's Google-analytics, Twitter, Instagram, Facebook pages in order to analyze if there are relationships with sales and these social networks activity related to the company. Lastly, Codecademy provides an interactive website to enable interactive learning of programming languages.

After the session finished at 5:30pm, I walked to Google NYC  to visit a student of mine who joined there after PhD. The walk was very nice, due to the beautiful spring weather in NYC. Google had bought a very large building in Manhattan, probably costed them a lot of money. The inside decor was also very modern, with a lot of open space for the developers to relax and mingle. Probably that also costed a fortune.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...