Monday, May 27, 2013

IPDPS'13 day1 graph algorithms

Here are some of my notes from the first day of IPDPS.

Optimizations & analysis of BSP graph processing models on public clouds

Mapreduce/Hadoop is not very suitable for graph processing (which requires iterating over and over on the same graph), and this led to the Pregel graph processing framework by Google. Pregel is based on the Bulk Synchronous Parallelism (BSP) model. Here is a summary of Pregel if you are not familiar with it. In short, Pregel uses a vertex-centric graph processing model, where the same code is executed on all vertices concurrently. Pregel uses message passing along edges and barrier synchronization at supersteps (i.e., rounds) to iterate over the graph. This paper looks at optimizations and analysis of BSP graph processing frameworks.

This group had access to Microsoft Azure cloud computing framework, and they wanted to experiment with Pregel there, so they implemented Pregel (following the description in the Pregel paper) from scratch in .NET environment. They early on noticed that Pregel gets fairly memory intensive as it holds on to all the messages sent to all the vertices in the worker. They started analyzing it further to see how the memory usage changes in the lifetime of a Pregel program over many supersteps. They discovered that there is a camel hump in the middle of the program lifetime (i.e., in the middle supersteps) for most traditional graph programs, such as all-pairs shortest path and betweenness centrality. This is because these graph programs tend to exchange more messages towards the middle supersteps as the computation flourishes and the number of messages exchanged subdues again as the computation comes closer to termination. (It turns out this hump is not present for PageRank.) This hump, of course, is important as it has an impact on how many workers you need to provision, because you need to provision for the worst-case memory usage of the workers.

So the group goes on to look into how they can constrain this hump to have a predictable memory usage through all supersteps and to facilitate managing the memory constraints of the workers. To this end, they come up with the swath concept. Swath is a subset of vertices of entire graph on which the algorithm is being initiated. Their goal is to pick the swath size that is the best fit into the main memory (amplitude) of the workers. They work on identifying swath initiation heuristics (when are subsequent swaths of verices activated) and swath size heuristics (how many vertices are active concurrently in a swath). They experiment with two approaches, sampling approach and adaptive approach, to determine when the next swath is initiated. By breaking computation into swaths of vertices and using our sizing heuristics, they are able to achieve up to 3.5x speedup over the maximum swath size that does not cause the a failure. Of course, a limitation of the swath approach is that it assumes that the program execution is embarassingly parallel and you can execute the program over swaths distributed in time without causing any correctness issues. So this approach is applicable only to those type of graph programs, such as all-pairs shortest path and betweenness centrality.

The hump observation in memory usage of BSP-based graph processing frameworks is a nice and useful one. We have also worked on BSP-based graph processing frameworks and focused on improving the opensource Giraph implementation of Pregel. We provided serializability to Giraph by introducing an optimization: internal vertices in a worker do not message each other but rather read each others' state directly from the memory of the worker they reside. Our optimization not only provides a stronger serializability to Giraph, but it also prevents this memory camel-hump haunting BSP programs as well. Our paper has been accepted to EuroPar13, and I will write a detailed post about our paper soon.

Multithreaded graph partitioning 

This paper describes the design and development of an extension to the Metis partitioner to enable multithreading, and while doing so the paper also thoroughly explores the Metis design space.

The Metis partitioner is a tool for divide a graph into minimally connected and roughly equal parts. While partitioning the graph, the constraint is that the largest partition produced should be smaller than a given size (the worker memory size), and this makes the problem an NP-hard problem. The way Metis works is via using a multilevel paradigm of 1) coarsening, 2) initial partitioning, and 3) uncoarsening. In the first two steps an approximate partitioning is made via coarsening, and then Metis does a refinement on "the bordering vertices" to find better partitioning in the last step. Since the coarse partitioning works over all vertices instead of just border vertices it is generally the bottleneck step.

The paper investigated several alternatives for each of the 3 steps above. For the coarsening step they looked at fine-grain matching (locking based), multi-pass matching, and unprotected matching (which requires a conflict resolution at the end, but this is no problem because only a small percentage of conflicts occurs). For the initial partitioning they tried parallel recursive bisectioning, and parallel k-sectioning. For the refinement step they tried coarse-grain and fine-grain approaches. They give an experimental evaluation of all these different approaches on graph datasets (roadmap and vlsi circuit) that consist of millions of vertices. They evaluated for performance and cut quality, and showed that their multithreaded metis is a clear winner. 

One of the lessons learned from the multithreaded metis is that using unprotected operations (for coarsening step) is not that dangerous or crazy, because cleaning up after race conditions turned out to be faster than preventing them. This group made their code open source at

Finally, some ramblings 

I never understood the people that go all that trouble to travel to a conference who only then sit in the lobby or the room to websurf and do emails hunching on their laptops. If there is a sensible explanation for this behavior can someone tell me, so I can stop wondering about this? Yes, presenters are not always doing a great job at explaining things, but after all that trouble to traveling to the conference, those people owe it to themselves to get the most out of the conference by being there as a whole and not just physically.

My theory about the low quality of some presentations is that the presenters often give presentations to impress and not to teach/educate/communicate. (Ted Herman once warned me about this when I was a graduate student, and I tried to do my best not to fall into that trap ever since.) I believe that by just focusing on the message to communicate and being committed to communicating it, most presenters would be able to do a decent job. Instead presenters seem to feel like they need to show off how technically through and how clever they had been, how sophisticated they are, and the result is a dry, defensive, and incomprehensible presentation. Look, you have been thinking about that problem for the last one year at least, and I am hearing/seeing about it the first time here, and you expect me to understand all the subtleties in that problem space in 10 minutes? If you are committed to teaching/educating/communicating in your allotted 25 minute time slot, you should focus on explaining the insight and the most important technical results (not all of them) in the simplest of terms. You can mention that the problem has several subtleties and you are referring the audience to the technical paper for the full details. I am not saying that you shouldn't be technical; you can be technical but not to the expense of being cryptic or exclusive.

Mobile Sensing Revolution keynote by Hari Balakrishnan

Last week I was in Boston to attend International conference on Parallel and Distributed Processing Systems (IPDPS), and I will post some of my notes from this conference here.

Coincidentally, Distributed Computing in Sensor Systems (DCOSS) was also being held at Boston at the same dates with IPDPS, and I could only reserve my room at the DCOSS hotel instead of the IPDPS hotel. Since the DCOSS keynote on mobile sensing by Hari Balakrishnan sounded more interesting to me than the IPDPS first day keynote, I decided to sneak in to the DCOSS room and attend that keynote instead. Hari's talk had two parts. He spent most of his time on the first part, and had to rush through the second part.

How can sensors improve networks?

The crossover happened this year: smartphone & tablets accessing the Internet overtook PCs & laptops accessing the Internet. Smartphone adoption rate is 7 times the rate of growth of world population, and there are more people using smartphones than people that have access to electrics, in-house toilet, or toothbrush.

Smartphones are truly mobile devices, and as a result they move through different environments and experience different network conditions. Even during a slow walk, signal-to-noise-ratio (SNR) changes drastically with time, and the bit-error-rate (BER) varies 5-6 orders of magnitude within this time interval. So there is a huge volatility of the network connection stemming from mobility. The question is: How do we deal with these rapidly changing networking conditions?

This is not a new problem. The way 802.11 deals with this is by using bit rate adaptation. To this end, 802.11 measures the wireless channel at the device (the packet loss rate, SNR, or BER), and determines the best bit rate and switches to that at the wireless access point (WAP) for communicating to that device. But the problem with this is, for smartphones, which are truly mobile devices, these measurements are quickly outdated.

So, how do we measure and determine the best bit rate for smartphones? Hari's idea is to use sensors on the smartphone to solve/alleviate this problem. Smartphone has several very good sensors including accelerometer, compass, gyroscope, GPS, light sensor, microphone, and even a barometer. We can use the readings from these sensors to tell which way the device is oriented, its position, direction, and speed, and use these hints to adapt to the different mobility modes accordingly. For example, the accelerometer can detect walking and this can help improve rate adaptation at the MAC layer. The compass can detect heading and this can help improve AP association at the network layer. The gps can detect speed and this can help improve vehicular routing at the network level. The smartphone can also use hints (sensor readings) from neighboring smartphones as part of a mobility hint protocol. You can also easily build a crowdsourced network quality/connection monitoring service.

Hari's group focused on detecting movement and built the "rapidsample" service for quick movement detection on the Android. Using rapidsample your device can move into an aggressive mode upon detection of movement in order to address/alleviate the movement-induced network problems. Rapidsample can reliably detect movement within 10-100ms on commodity Android devices by considering accelaration, magnitude, and variance. Rapidsample is able to get pretty low-energy sensor readings from the accelerometer on the Android. When I mentioned to Hari about our findings about costly accelerometers on Android, he said that this cost is due to the inefficient Java implementation of the drivers on the Android. Hari's group has custom implemented their accelerometer sensing in C to achieve energy-efficiency and low latency.

Using the sensor-augmented protocols (by employing rapidsample), they were able to achieve 40-60% higher throughput for a 50% mobile user. And for the access point selection problem for android for VOIP, they were able to achieve 40% fewer handoff disruptions without any change on the WAP part. Finally, they were able to achieve 20 times reduction in probe bandwidth as well thanks to rapidsample. Some limitations of their current work are that 1) while accelerometer and compass are cheap to use GPS is not, and 2) calibration across different device types is needed (while the movement hint required no calibration, the walking hint did).

Mobile sensing with driver safety

Today we still face severe road transportation challenges. There are 10 million accident/year in US which results in 35K deaths annually. Insurance companies offer incentives (upto 15% of the premium) via usage-based insurance (UBI) for safer driving. But UBI is implemented today using embedded devices which makes for a difficult deployment and maintenance.

Hari is working with insurance companies for assessing driver safety, by computing driver risk with real data from smartphone apps. Smartphones are key to mass adoption of this idea, so they built an Android background app for this. Their app is on the Android app store for people to try.

Their app uses only WAP based localization to keep a low-energy profile. As a result they claim that the app running in the background can span upto 50 hours of battery lifetime. The problem is that the WAP signals can be pretty sparse, so they had to develop a trajectory mapping algorithm that employs a hidden markov model to use both previous and future location information from the users. This trajectory mapping algorithm works on the cloud and can be pretty computationally intensive. For example a Houston-Chicago drive had recorded only 40 WAPs along the way and the trajectory mapping algorithm took 1 hour to compute the trajectory taken by the car.

Their app collected 2 million total miles in tests so far. They found that user mileage is very important, and certain combinations of accelerometer and speed is problematic, which they capture as a colored risk There is no ground truth to claim that the app makes the drivers safer drivers, but Hari claims that looking to the accelerometer logs you can see better driving habits (since the drivers are incentivized to get a discount on their insurance premium).

They found that most of the users don't care to plug in their smartphones to the car for recharging as they see this as a burden. So Hari's number 1 wish for the future smartphones is to include lower-power (virtually zero-power) sensors. This can be done by a better design of the hardware and software. As Hari puts it, you want the sensors to run continuously and put the sensor data into a buffer and when the CPU wakes up it can get the data from the buffer.

Scaling Memcache at Facebook

This paper, which appeared in NSDI 2013, describes the evolution of Facebook’s memcached-based architecture. Memcached is an open-source implementation of a distributed in-memory hash table over a cluster of computers, and it is used for providing low latency access to a shared storage pool at low cost. Memcached was first developed by Brad Fitzpatrick in 2003.

Memcached at Facebook targets a workload dominated by reads (which are two orders of magnitude more frequent than writes). Facebook needs to support a very heavy read load, over 1 billion reads/second, and needs to insulate backend services from high read rates with very high fan-out. To this end, Facebook uses memcache as a "demand-filled look-aside cache". When a web server needs data, it first requests the value from memcache by providing a string key. If it is a cache hit, great, the read operation is served. If the item addressed by that key is not cached, the web server retrieves the data from the backend database and populates the cache with the key-value pair (for the benefit of the upcoming reads).

For write requests, the web server issues SQL statements to the database and then sends a delete request to memcache to invalidate the stale data, if any. Facebook chooses to delete cached data instead of updating it in cache because deletes are idempotent (which is a nice property to have in distributed systems). This is safe because memcache is not the authoritative source of the data and is therefore allowed to evict cached data.

In a Cluster: Reducing Latency and Load

Loading a popular page in Facebook results in fetching from memcache an average of 521 distinct items (a very high fan-out indeed), which have been distributed across the memcached servers through consistent hashing. All web servers communicate with every memcached server in a short period of time, and this all-to-all communication pattern can cause incast congestion or cause a single server to become the bottleneck for many web servers.

Memcached servers do not communicate with each other. When appropriate, Facebook embeds the complexity of the system into a stateless client rather than in the memcached servers. Client logic is provided as two components: a library that can be embedded into applications or as a standalone proxy named mcrouter. Clients use UDP and TCP to communicate with memcached servers. These client-centric optimizations reduce incast-congestion at the servers (via application-specific UDP congestion control) and reduce load on the servers (via use of lease-tokens[like cookies] by the clients). The details are in the paper.

For handling failures of memcached nodes, Facebook uses the redundancy/slack idea. While Facebook relies on an automated remediation system for dealing with node failures, this can take up to a few minutes. This duration is long enough to cause cascading failures and thus Facebook uses a redundancy/slack mechanism to further insulate backend services from failures. Facebook dedicates a small set of machines, named Gutter, to take over the responsibilities of a few failed servers. Gutter accounts for approximately 1% of the memcached servers in a cluster. When a memcached client receives no response to its get request, the client assumes the server has failed and issues the request again to a special Gutter pool. If this second request misses, the client will insert the appropriate key-value pair into the Gutter machine after querying the database.

Note that this design differs from an approach in which a client rehashes keys among the remaining memcached servers. Such an approach risks cascading failures due to non-uniform key access frequency: a single key can account for 20% of a server’s requests. The server that becomes responsible for this hot key might also become overloaded. By shunting load to idle servers Facebook limits that risk. In practice, this system reduces the rate of client-visible failures by 99% and converts 10%–25% of failures into hits each day. If a memcached server fails entirely, hit rates in the gutter pool generally exceed 35% in under 4 minutes and often approach 50%. Thus when a few memcached servers are unavailable due to failure or minor network incidents, Gutter protects the backing store from a surge of traffic.

In a Region: Replication

Naively scaling-out this memcached system does not eliminate all problems. Highly requested items will become more popular as more web servers are added to cope with increased user traffic. Incast congestion also worsens as the number of memcached servers increases. They therefore split their web and memcached servers into multiple "regions". This region architecture also allows for smaller failure domains and a tractable network configuration. In other words, Facebook trades replication of data for more independent failure domains, tractable network configuration, and a reduction of incast congestion.

Across Regions: Consistency

When scaling across multiple regions, maintaining consistency between data in memcache and the persistent storage becomes the primary technical challenge. These challenges stem from a single problem: replica databases may lag behind the master database (this is a per-record master as in PNUTS). Facebook provides best-effort eventual consistency but place an emphasis on performance and availability.

Facebook designates one region to hold the master databases and the other regions to contain read-only replicas, and relies on MySQL's replication mechanism to keep replica databases up-to-date with their masters. In this design, web servers can still experience low latency when accessing either the local memcached servers or the local database replicas.

Concluding remarks

Here are the lessons Facebook draws from their experience with the memcached
system. 1) Separating cache and persistent storage systems via memcached allows for independently scaling them. 2) Managing stateful components is operationally more complex than stateless ones. As a result keeping logic in a stateless client helps iterate on features and minimize disruption. 3) Finally Facebook treats the probability of reading transient stale data as a parameter to be tuned, and is willing to expose slightly stale data in exchange for insulating a backend storage service from excessive load.

Related links
AWS seems to offer elasticache, which is protocol-compliant with Memcached.
There is also the RAMCloud project, which I had summarized earlier here.

Friday, May 17, 2013

One Pomodoro, two pomodoro, three pomodoro, four

I have been using the pomodoro technique for a couple years  to improve my productivity. Pomodoro is a timer (of 20 minutes) during which you commit to do a task. After this task timer there is a short break, after which the next task timer starts again. The pomodoro technique is described here in detail.

I like pomodoro as it helps me to concentrate and get things done.  It also helps me to get started on something I detest doing: Surely I can endure doing that thing for 20 minutes, right? This helps trick myself to break my inertia and usually I find that I can keep going for multiple pomodoros on that task.

I use the Pomodoro Desktop app (by Ugo Landini) for Mac OS X, and configure it to use 20 minutes as the task timer and 10 minutes as the break timer. I modified it to disable my laptop wifi at the start of the task timer and enable it back at the end of the task timer. For this I added do shell script "networksetup -setairportpower en0 off" to the pomodoro start script and do shell script "networksetup -setairportpower en0 on" to the end script.

Adding the automatic wifi disable/enable to the pomodoro has improved my productivity greatly. Now, I do not get lured to check twitter or gmail in the middle of writing something (which unavoidably leads to checking hacker news and quora). If what I am working on really requires to look something up on the internet, I note it (either on a paper, or most of the time I actually enter the thing to look up on the inactive chrome by opening a new tab). At the end of my task timer, my wifi restores automatically, and I grab those pages in bulk. During the break time, I can also check my mail if I feel like it, or get up, walk around and flex. During the break time I also take a step back, and think about what I have been working and how it relates to other things and fits in the big picture. The break gives my brain time to get creative and make new connections.

This setup also reduces the distractions from being always connected to the Internet. With the automated wifi disabling/enabling, I am basically using my inherent laziness to my advantage. Manually enable my wifi carries a transaction cost, and I guess I am too lazy to do that and/or too proud to accept a defeat :-)

Thursday, May 16, 2013

Android reverse-scooped us with the proximity alert service

I had mentioned about Fatih Bulut, one of my PhD students, when writing about his crowdsourced line wait-time forecasting project. Fatih has recently been working on an energy-efficient proximity alert service for Android, and published his preliminary work in M. F. Bulut, M. Demirbas. Energy Efficient Proximity Alert on Android. IEEE Workshop on Pervasive Collaboration and Social Networking, 2013. The abstract from that paper reads:
The proximity alert service on Android is important as an enabler of ubiquitous location-based services, however, it is also limited in this role due to its excessive energy expenditure. In this paper, we present the design and implementation of an energy-efficient proximity alert service for Android. Our method utilizes the distance to the point of interest and the user’s transportation mode in order to dynamically determine the location-sensing interval and the location providers (GPS, GSM, or Wi-Fi) to be used. We implement our method as a middleware service in the Android open source project. Our service, for a realistic scenario, reduces GPS usage by 96.66% and increases battery life time by 75.71% compared to the baseline proximity alert in Android.

Today, in the Google I/O conference, Google unveiled a service very similar to Fatih's proximity alert service as part of their Maps API. They utilized identical features (distance and transportation mode) as Fatih did in his implementation.
Geofencing APIs
Lets your app setup geographic boundaries around specific locations and then receive notifications when the user enters or leaves those areas.
Simple but powerful APIs: Allows batch addition and removal of geofences. Ability to manage multiple geofences at the same time. Ability to filter alerts for both entry and exit or entry only or exit only.
Optimized for battery: Adjusts location updates based on user's proximity to the geofence and user's modality (still, walking, driving, and so on).

I think this reverse-scooping is actually good news for us. It showed that we have been working on an important direction. And we now have access to a solid proximity service from Google with which we can compare and build on for our future ideas for proximity alerts. (However, I don't know how this will affect the fate of his conference paper which is under-submission ;-)