Metadata

Posts

Showing posts from September, 2014

Paper Summary: High-availability distributed logging with BookKeeper

- September 29, 2014

This paper is brought to you by the Yahoo Research group that developed the ZooKeeper, and it appeared in LADIS'12 . BookKeeper targets the logging problem. More specifically, the distributed logging problem where high-availability is important and where many distributed clients are interested in reading the logs. Most current applications log to the local disk, but this constitutes a single point of failure (SPOF) and betrays high-availability. A hasty remedy is to write to an NFS partition to store log files remotely. But now the NFS server becomes the SPOF. (We can of course replicate the NSF server, but the performance would suffer.) Another solution is to use NetApp filers that implement RAID. This costs money, and still does not completely solve SPOF. BookKeeper provides a no-SPOF efficient data store for serving a large number of concurrent single-writer, multiple-reader logs. It stripes log entries across servers, leading to higher throughput. BookKeeper is opensour...

Paper summary: Tango: Distributed Data Structures over a Shared Log

- September 28, 2014

This paper is from the Microsoft Research Silicon Valley (which unfortunately recently got closed), and it appeared in SOSP'13. SOSP'13 provides open access, so here is the pdf for free. The talk video is also on YouTube as part of this SOSP'13 talks playlist . I think this paper didn't get the attention it deserves. It is really a great piece of work. To facilitate construction of highly available metadata services, Tango provides developers with the abstraction of a replicated in-memory data structure (such as a map or a tree) backed by a shared log. While ZooKeeper provides developers a fixed data structure ( the data tree) for building coordination primitives , Tango enables clients to build different data structures based on the same single shared log . Tango also provides transactions across data structures. The state of a Tango object exists in two forms. 1) a history: which is an ordered sequence of updates stored durably in the shared log, 2) any number...

Revisiting the EWDs

- September 19, 2014

Dijkstra was the original hipster. He was blogging before blogging was cool. "For over four decades, he mailed copies of his consecutively numbered technical notes, trip reports, insightful observations, and pungent commentaries, known collectively as EWDs, to several dozen recipients in academia and industry. Thanks to the ubiquity of the photocopier and the wide interest in Dijkstra’s writings, the informal circulation of many of the EWDs eventually reached into the thousands." And, thanks to the efforts of the University of Texas at Austin CS Department, all of these EWDs have been accessible to the public conveniently. I remember when I first discovered the EWDs as a fresh graduate student. I was mesmerized. I read them with a lot of joy. It was as if a new world had opened to me to discover. He had many insightful observations. I recommend all CSE graduate students to read the EWDs to grow their minds. Now, I don't agree with Dijkstra on everything. He was ...

Paper summary: Can a decentralized metadata service layer benefit parallel filesystems?

- September 19, 2014

Parallel filesystems do a good job of providing parallel and scalable access to the data transfer, but, due to consistency concerns, the metadata accesses are still directed to one metadata server (MDS) which becomes a bottleneck. This is a problem for scalability because studies show that over 75% of all filesystem calls require access to file metadata. This paper proposes to adopt ZooKeeper as a decentralized MDS for parallel filesystems and test whether that improves performance. You can ask, what is decentralized about ZooKeeper, and you would be right about the update requests. But for read requests, ZooKeeper helps by allowing any ZooKeeper server to respond while guaranteeing consistency. (You would still need to do a sync operation if the request needs the read to be freshest and satisfy precedence order.) If you recall, ZooKeeper uses a filesystem API to enable clients to build higher-level coordination primitives (group membership, locking, barrier sync). This paper is...

Distributed system seminar talk: Data grouping framework for energy-efficiency in distributed storage systems

- September 12, 2014

My research group and Tevfik's research group meet jointly for a weekly distributed systems. This gives our students a chance to give talks about current project and get feedback for improvement in a friendly setting. In this week's seminar, Luigi presented his research on building energy-efficient file systems. I was initially skeptical about energy-efficiency as a research topic. Academicians like to work on things that they can quantify and improve, so I was thinking that energy-efficiency in distributed storage was an opportunistic research problem, rather than a real-world problem. Turns out, I couldn't be any more wrong: IT companies spend $10 billions every year on energy consumption (This is 3% of entire expenditure of US!). $3.5 billion of that $10 billion is energy expenditure is due to the storage systems. Dynamic power management (DPM) is the primary mechanism for energy saving at the storage systems. DPM basically means turn the disk off if you're not ...

Paper summary: ZooKeeper: Wait-free coordination for Internet-scale systems

- September 12, 2014

Zookeeper is an Apache project for providing coordination services to distributed systems. ZooKeeper aims to provide a simple kernel (a filesystem API!) for empowering the clients to build more complex coordination primitives. In this post I will provide a summary of the ZooKeeper paper , and talk about some future directions I can see this going. "Client" denotes a user of the ZooKeeper service, "server" denotes a process providing the ZooKeeper service, and "znode" denotes an in-memory data node (similar to the filesystem inode) in the ZooKeeper. znodes are organized in a hierarchical namespace referred to as the data tree. There are 2 types of znodes. "Regular": Clients manipulate regular znodes by creating and deleting them explicitly. "Ephemeral": Clients create ephemeral znodes, and they either delete them explicitly, or let the system delete them automatically when the client's session termination. Additionally, when cr...