Posts

Showing posts from May, 2017

Paper summary: Making sense of Performance in Data Analytics Frameworks (NSDI 15)

Image
What constitutes the bottlenecks for big data processing frameworks? If CPU is a bottleneck, it is easy to fix: add more machines to the computation. Of course for any analytics job, there is some amount of coordination needed across machines. Otherwise, you are just mapping and transforming, but not reducing and aggregating information. And this is where the network and the disk as bottleneck comes into play. The reason you don't get linear speedup by adding more machines to an analytics job is the network and disk bottlenecks. And a lot of research and effort is focused on trying to optimize and alleviate the network and disk bottlenecks. OK this sounds easy, and it looks like we understand the bottlenecks in big data analytics. But this paper argues that there is a need to put more work into understanding the performance of big data analytics framework , and shows that at least for Spark on the benchmarks and workloads they tried (see Table 1), there are some counter intuitive...

Paper review: Prioritizing attention in fast data

Image
This paper appeared in CIDR17 and is authored by Peter Bailis, Edward Gan, Kexin Rong, and Sahaana Suri at Stanford InfoLab. Human attention is scarce, data is abundant. The paper argues, this is how we fight back: prioritize output: return fewer results prioritize iteration: perform feedback driven development and give useful details and allow user to tune the analysis pipeline   prioritize computation: aggressively filter and sample, tradeoff accuracy/completeness with performance where it has low impact, and use incremental data structures The slogan for the system is : MacroBase is a search engine for fast data. MacroBase employs a customizable combination of high-performance streaming analytics operators for feature extraction, classification, and explanation. MacroBase has a dataflow architecture (Storm, Spark Streaming, Heron). The paper argues it is better to focus on what dataflow operators to provide than to try to design from-scratch a new system (which won't...

Paper Review: Serverless computation with OpenLambda

Image
This paper provides a great accessible review and evaluation of the AWS Lambda architecture.  It is by Scott Hendrickson, Stephen Sturdevant, Tyler Harter, Venkateshwaran Venkataramani†, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and it appeared at Hot Cloud 16 Virtual machines virtualized and shared the hardware so multiple VMs can colocate on the same machine. This allowed consolidation of machines, prevented the server sprawl problem, and reduced costs as well as improving manageability. The containers virtualized and shared the operating system, and avoided the overheads of VMs. They provided fast startup times for application servers. By "fast" we mean about 25 seconds of preparation time. In both VMs and containers, there is a "server" waiting for a client to serve to. Applications are defined as collection of servers and services. "Serverless" takes the virtualization a step ahead. They virtualize and share the runtime, and now the ...

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book