Posts

Showing posts from 2011

Pregel: a system for large-scale graph processing

For large-scale graph processing one way to go is, of course, to use Hadoop and code the graph algorithm as a series of chained MapReduce invocations. MapReduce, however, is a functional language, so using MapReduce requires passing the entire state of the graph from one stage to the next, which is inefficient (as I alluded to at the end of this summary).

Google Pregel provides a simple straightforward solution to the large-scale graph processing problems. The Pregel solution is to use round(superstep)-based synchronized computation at the vertices supported with message-passing between the rounds. Pregel keeps vertices and edges on the machines that perform computation, and uses network transfers only for messages. This way Pregel avoids the communication overhead and programming complexity incurred by MapReduce chained iterations.

Model
In Pregel, in each iteration (superstep), a vertex can receive messages sent to it in the previous iteration, send messages to other vertices, modify…

Reverse-scooping

"scooping (v): publish a news story before (a rival reporter, newspaper, or radio or television station)." Scooping is what happens when another team beats you to publishing results on a problem. Getting scooped is very frustrating and is dreaded by many PhD students. I heard stories about poor Math PhD students who worked on a problem for years only to discover that they got scooped by a couple months.

OK, then what is reverse-scooping? It is a term I coined last year. (Of course, the irony is that after a Google search I discovered that I was almost reverse-scooping someone else ;-). In reverse-scooping, you solve a problem and publish it first. Then several months later, another team (generally from a better-known university) solves the same problem and publish it at a more visible venue. They get all the credit, their work gets cited a lot, and it is as if your work doesn't exist! Congratulations, you got reverse-scooped.

I got reverse-scooped more than a couple of …

Scalable SPARQL Querying of Large RDF Graphs

Due to its excellent price/performance ratio, Hadoop has become the kitchen sink of big data management and analysis. Hadoop defaults are, of course, not suitable for every kind of data analysis application. For example using Hadoop on relational data processing incurs a lot of waste/inefficiencies. Similarly for graph data processing Hadoop is very inefficient. This paper by Abadi et. al. shows that Hadoop's efficiency on graph data (for semantic web subgraph matching application) can be improved 1000 times by fixing the following defaults in Hadoop.

1. Hadoop, by default, hash partitions data across nodes. For graph processing this is very inefficient. The paper advocates using a locality-preserving partitioning (METIS partitioner) that maps nearby vertices to the same worker as much as possible.

2. Hadoop, by default, replicates each data 3 times. "However, ... the data that is on the border of any particular partition is far more important to replicate than the data that …

Let's go back to the fountain pen

I love my Emacs (and org-mode), but there is something else about my fountain-pen. My fountain-pen is excellent for doodling and doing creative thinking. I guess the personality and non-uniformity of the handwriting adds a lot to the thinking process. Or maybe it is that handwriting requires more hand-eye coordination, or is more relaxing than typing. We are wired as analog and may process/control analog things better. I don't know what that special thing is. But for a task that requires deep thinking, I first hack at it with my pen and then move to Emacs to edit the text and digitally save/archive the text.

They should produce a tablet with a natural (high resolution, comfortable) pen input. That would save me a lot of time from having to type my writing to make it digitally available. I tried some tablets, and I was not happy with their pen support. I would be a loyal customer for an iPad like tablet with natural pen input. It could be a specialized device, it doesn't have …

Privacy through lies

A short sci-fi story by Vernor Vinge, synthetic serendipity, mentions "friends of privacy" which fabricates lies to re-achieve privacy that is violated by web-services *cough*Facebook*cough*. Here is an excerpt.
Doris Nguyen. Former homemaker. Mike eyed the youngish face. She looked almost his mom’s age, even though she was 40 years older. He searched on the name, shed collisions and obvious myths; the Friends of Privacy piled the lies so deep that sometimes it was hard to find the truth. But Doris Nguyen had no special connections in her past. This privacy through lies idea has been explored somewhat. I have heard of a proposal where a node sends 4 different data to be processed (1 correct and 3 incorrect), so the server cannot learn what is the correct data, and the node can use only the reply to the correct data and discards the other 3 replies. I was wondering if this "privacy through lies" approach can be or is explored further.

Trip notes from Mobicom 2011

I am attending MOBICOM'11 at Las Vegas, and mostly enjoying it. I took some notes from the talks. Here I will share my notes about a couple of talks which I found more interesting than the others.

SmartVNC: An effective remote computing solution for smartphones. The goal here is to improve on the usability of solutions (such as AndroidVNC) that allow accessing a remote PC from a smartphone. The problem with AndroidVNC is task-inflation: user needs to do a lot of zooming and panning and also the small keyboard is inconvenient for typing. The solution offered is to use macros to automate common user-specific tasks in AndroidVNC. The macros appear as a sidebar and the user can click a macro to perform several steps quickly. These macros are defined as hooks at the GUI level (instead of app-level or OS-level) so that they can be application-agnostic as well as robust and extensibleDetecting driver phone use leveraging car speakers. The goal is to block the delivery of a call or SMS fo…

Tenure advice to new faculty

A month ago, I participated at a panel as part of the new faculty orientation program of the SUNY Buffalo. There were 5 recently tenured faculty in the panel. The idea was that we would convey our tenure advice to the new faculty. I was the only one from the Engineering School, the others were mostly from Social sciences. We had 80 minutes for 5 speakers, so I expected to get 15 minutes. But, the panel was rather informal and the speaking times were not enforced. So, I got to speak as the 4th speaker for 5 minutes only. The other speakers had come with 3-4 printed pages to talk, I had keynote prepared keynote slides (which I ended up not showing with a projector). Since I feel I didn't get enough mileage out of these slides, I am sharing them here.
UBNewFacultyAdvice.pdf

The panel also exposed me to some cultural differences among Social and Engineering disciplines. I didn't find their advice actionable, and probably they found my advice blunt. I won't discuss more. I am an…

Rapid prototyping FTW

Image
Very interesting talk. It is worth your 6 minutes. Sorry, Dijkstra approach, you lose.

I must be a sellout. My research is hot

Almost all of my students left for internships this summer. Three of my students, who are working on crowdsourcing and smartphone apps, have been grabbed by Bosch Research Pittsburgh (they got two of them) and IBM Research New York. Some of the projects these students have worked on with me include "using Twitter as a middleware for crowdsourced sensing", "detecting breakpoints in public opinion using Twitter", "crowdsourcing queue lengths at campus cafes with a smartphone campus app", "crowdsourcing location-based queries". Another student working on cloud computing have been employed by VMware (after lucking out with Google since he started interviewing late). His research is on developing user-level wide-area-network filesystems. Finally one brave soul decided he would not apply for internships and stay with me this summer to concentrate on his research (last summer, he had been on an internship with Bosch Research Paolo Alto).
There are alte…

Online migration for geodistributed storage systems, Usenix ATC 2011

Image
This paper investigates the problem of migrating data between data centers. Data needs to be moved from one center to another based on the access patterns, for example, the user may have moved from East to West coast. The problem is complicated by the large size of data that needs to be moved, the requirement to perform the migration online without blocking access to any part of the data from anywhere, and finally that the data can be accessed/modified concurrently in different locations.
To address this problem, the paper proposes an overlay abstraction. The goal of the abstraction is to implement migration as a service, so that the developer does not have to deal with the race conditions that may result while migrating data in ad hoc ways. The analogy of overlay is a sheet of transparencies. Remember the old days before powerpoint? The presenters used to print the slides on transparencies, and do animation by overlaying one transparency over another. The overlay idea is similar. &quo…

Refuse to Crash with Re-FUSE

Image
This paper appeared in Eurosys'10. This is a well written paper: the paper holds your hand, and takes you for a walk in the park. At each step of the path, you can easily predict what is coming next. I like this kind of easy-reading papers, compared to the cryptic or ambiguous papers which make you wander around or try to guess which paths to take through a jungle of junctions.
The goal of this work is to provide support for restartable user-level filesystems. But, before I can tell you more about that, we first need to discuss user-filesystems. User-filesystems provides a way to add custom features (such as encryption, deduplication, access to databases, access to Amazon S3, etc.) on top of existing kernel-level filesystems. FUSE is a popular software that facilitates building user-filesystems on top of kernel-level filesystems. FUSE is available for Linux, FreeBSD, NetBSD, and MacOSX, and more than 200 userfilesystems have already been implemented using FUSE. GlusterFS, HDFS, ZFS…

Sabbatical help

My tenure at University at Bufffalo, SUNY has just become official after the President signed on it. Having completed my 6 years at UB, I am now eligible to spend a sabbatical year. I am planning to spend 6 months of my sabbatical in the industry/research-lab working on cloud computing challenges. If you have any suggestions/connections to make this happen, please send me an email.

Why are algorithms not scalable?

Recently, a colleague emailed me the following: Since you have been reading so much about clouds, CAP, and presumably lots of consensus things, you can answer better the question of algorithm scalability. How scalable are the popular algorithms? Can they do a reasonable job of consensus with 100,000 processes? Is this even a reasonable question? What are the fundamental problems, the algorithms or the lower level communication issues? These are actually the right kind of questions to ask probing for deeper CS concepts. Here are my preliminary answers to these questions.
From what I read, consensus with 100K processes is really out of question. Paxos consensus was deployed on 5 nodes for GFS and similar systems: Zookeper, Megastore, etc. As another example Sinfonia's participant nodes in a transaction is also around limited to 5-10.
So what is wrong with algorithms, why are they unscalable? I guess one obstacle against scalability is the "online" processing requirement, …

On designing and deploying Internet scale services

This 2008 paper presents hard-earned lessons from Jim Hamilton's experience over the last 20 years in high-scale data-centric software systems and internet-scale services. I liken this paper to "Elements of Style" for the domain of Internet scale services. Like the "Elements of Style" this paper is also not to be consumed at once, it is to be visited again and again every so often.
There are three main overarching principles: expect failures, keep things simple, automate everything. We will see reflections of these three principles in several subareas pertaining to Internet-scale services below.
Overall Application Design Low-cost administration correlates highly with how closely the development, test, and operations teams work together. Some of the operations-friendly basics that have the biggest impact on overall service design are as follows.
/Design for failure/ Armando Fox had argued that the best way to test the failure path is never to shut the service down …

Centrifuge: Integrated Lease Management and Partitioning for Cloud Services

Image
For performance reasons many large-scale sites (LinkedIn, Digg, Facebook, etc.) employ a pool of backend servers that operate purely on in-memory state. The frontend servers that forward the requests to the backend servers then should be very carefully implemented to ensure that they forward the requests to the correct backends. The frontend should ensure that the selected backend has the requested object in its memory and also possesses the lease to update the object. Unfortunately, things may change quickly at the backend; backend servers may fail, new ones may join, backend servers may start to cache different objects due to load-balancing requirements, and leases may exchange hands. All of these make the task of programming the frontend very challenging and frustrating.
This work (from NSDI'10) proposes Centrifuge, a datacenter lease manager that addresses this problem. The Centrifuge system has been deployed as part of Microsoft live mesh service, a large scale commercial clou…

Smoke and Mirrors: Reflecting Files at a Geographically Remote Location Without Loss of Performance

Image
This paper from FAST'09 introduces smoke and mirrors filesystem (SMFS) which mirrors files at a geographically remote datacenter with negligible impact on performance. It turns out remote mirroring is a major problem for banking systems which keep off-site mirrors (employing dedicated high-speed 10-40Gbits optical links) to survive disasters.
This paper is about disaster tolerance, not your everyday fault-tolerance. The fault model includes that the primary site may get destroyed, and some sporadic packet losses upto 1% may occur simultaneously as well, yet still no data should be lost. (Data is said to be lost if the client is acknowledged for the update but the corresponding update/data no longer exists in the system.) The primary site being destroyed may be a bit over-dramatization. An equivalent way to state the fault model would be that the paper just rules out a post~hoc correction (replay or manual correction). Here is how manual correction would work: if power outage occur…

Consistency analysis in Bloom: a CALM and collected approach

This work from CIDR11 aims to provide theoretical foundations for correctness under eventual consistency, and identifies "order independence" (independence of program execution from temporal nondeterminism) as a sufficient condition for eventual consistency.
CALM (consistency and logical monotonicity) principle states that monotonic programs guarantee order independence, and hence, eventual consistency. A monotonic program is one where any true statement remains to be true as new axioms arrive. In contrast, in a non-monotonic program, new input can cause the earlier output to be revoked. A monotonic program guarantees eventual consistency, whereas non-monotonicity requires the use of coordination logic.
To simplify the task of identifying monotonic and nonmonotonic program segments, this work proposes using program analysis in a declarative language, Bloom. In Bloom, monotonicity of a program is examined via simple syntactic checks. Selection, join, and projection operators a…

Life beyond Distributed Transactions: an Apostate's Opinion

Image
Pat Helland is one of the veterans of the database community. He worked on the Tandem Computers with Jim Gray. His tribute to Jim Gray, which gives a lot of insights into Jim Gray as a researcher, is worth reading again and again.
This 2007 position paper from Pat Helland is about extreme scalability in cloud systems, and by its nature anti-transactional. Since Pat has been a strong advocate for transactions and global serializability for most of his career, the title is aptly named as an apostate's opinion.
This paper is very relevant to the NoSQL movement. Pat introduces "entity and activities" abstractions as building primitives for extreme scalability cloud systems. He also talks about at length about the need to craft a good workflow/business-logic on top of these primitives.
Entity and activities abstractions Entities are collections of named (keyed) data which may be atomically updated within the entity but never atomically updated across entities. An entity lives on…

Tempest and Deutronomy reviews

I am 3 weeks behind in writing summaries for the papers we discuss in my seminar. In order to have some chance of catching up, I am skipping the summaries for Tempest and Deutronomy papers, and refer to student summaries for these two.

Tempest: Soft state replication in the service tier (DSN'08)

Deuteronomy: Transaction Support for Cloud Data (CIDR'11)

Megastore: Providing scalable, highly available storage for interactive services

Image
Google's Megastore is the structured data store supporting the Google Application Engine. Megastore handles more than 3 billion write and 20 billion read transactions daily and stores a petabyte of primary data across many global datacenters. Megastore tries to provide the convenience of using traditional RDBMS with the scalability of NOSQL: It is a scalable transactional indexed record manager (built on top of BigTable), providing full ACID semantics within partitions but lower consistency guarantees across partitions (aka, entity groups in Figure 1). To achieve these strict consistency requirements, Megastore employs a Paxos-based algorithm for synchronous replication across geographically distributed datacenters.

I have some problems with Megastore, but I save them to the end of the review to explain Megastore first.

Paxos
Megastore uses Paxos, a proven, optimal, fault-tolerant consensus algorithm with no requirement for a distinguished master. (Paxos is hard to cover in a blog…

Flexible, Wide-Area Storage for Distributed Systems with WheelFS

One of my students, Serafettin Tasci, wrote a good review of this paper, so I will save time by using his review below, instead of writing a review myself.
In this paper the authors propose a storage system for wide-area distributed systems called WheelFS. The main contribution of WheelFS is its ability of adaptation to different types of applications with different consistency, replica placement or failure handling requirements. This ability is obtained via semantic cues that can be easily expressed in path names. For example to force the primary site of a folder john to be X, we can specify the cue “home/users/.Site=X/john”. This representation enables preserving of POSIX semantics and minor change in application software to use the cues.
In WheelFS, there are 4 groups of semantic cues. Placement cues are used to arrange the location of primaries and replicas of a file or folder. Durability cues specify the number and usage of replicas. Consistency cues maintain a tradeoff between co…

Ceph: A Scalable, High-Performance Distributed File System

Image
Traditional client/server filesystems (NFS, AFS) have suffered from scalability problems due to their inherent centralization. In order to improve performance, modern filesystems have taken more decentralized approaches. These systems replaced dumb disks with intelligent object storage devices (OSDs)--which include CPU, NIC and cache-- and delegated low-level block allocation decisions to these OSDs. In these systems, clients typically interact with a metadata server (MDS) to perform metadata operations (open, rename), while communicating directly with OSDs to perform file I/O (reads and writes). This separation of roles improve overall scalability significantly. Yet, these systems still face scalability limitations due to little or no distribution of the metadata workload.

Ceph is a distributed file system designed to address this issue. Ceph decouples data and metadata operations completely by eliminating file allocation tables and replacing them with generating functions (called CRU…

Sinfonia: A New Paradigm for Building Scalable Distributed Systems

Image
Sinfonia is an in-memory scalable service/infrastructure that aims to simplify the task of building scalable distributed systems. Sinfonia provides a lightweight "minitransaction" primitive that enables applications to atomically access and conditionally modify data at its multiple memory nodes. As the data model, Sinfonia provides a raw linear address space which is accessed directly by client libraries.

Minitransactions
In traditional transactions, a coordinator executes a transaction by asking participants to perform one or more participant-actions (such as retrieving or modifying data items), and at the end of the transaction, the coordinator decides and executes a two-phase commit. In the first phase, the coordinator asks all participants if they are ready to commit. If they all vote yes, in the second phase the coordinator tells them to commit; otherwise the coordinator tells them to abort.

Sinfonia introduces the concept of minitransactions, by making the observation t…

PetaShare: A reliable, efficient and transparent distributed storage management system

Image
This paper by my colleague Tevfik Kosar (to appear soon) presents the design and implementation of a reliable and efficient distributed data storage system, PetaShare, which manages 450 Terabytes of disk storage and spans 7 campuses across the state of Louisiana.
There are two main components in a distributed data management architecture: a data server which coordinates physical access (i.e. writing/reading data sets to/from disks) to the storage resources, and a metadata server which provides the global name space and metadata of the files. Metadata management is a challenging problem in widely distributed large-scale storage systems, and is the focus of this paper.
Petashare architectureThe back-end of PetaShare is based on iRODS. All system I/O calls made by an application are mapped to the relevant iRODS I/O calls. iRODS stores all the system information, as well as user-defined rules in centralized database, which is called iCAT. iCAT contains the information of the distributed sto…

Popular posts from this blog

I have seen things

SOSP19 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution

PigPaxos: Devouring the communication bottlenecks in distributed consensus

Frugal computing

Learning about distributed systems: where to start?

Fine-Grained Replicated State Machines for a Cluster Storage System

My Distributed Systems Seminar's reading list for Spring 2020

My Distributed Systems Seminar's reading list for Fall 2020

Cross-chain Deals and Adversarial Commerce

Book review. Tiny Habits (2020)