Saturday, September 25, 2010

Pig Latin: A Not-So-Foreign Language for Data Processing

I will be honest, I haven't read the week-3 papers. But this does not stop me from writing a short summary of the Pig Latin paper based on the notes I took during its class discussion. So I apologize if there is a mistake in my review.

(The second week-3 paper was the Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. Unfortunately we had little time to discuss this paper, and I also had to leave early. So no review for that paper. I didn't understand much, so I will just paraphrase from its abstract "A Dryad application combines computational vertices with communication channels to form a dataflow graph. Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through files, TCP pipes, and shared-memory FIFOs.")

Pig Latin: A Not-So-Foreign Language for Data Processing
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins (Yahoo Research)

Pig Latin is a front end to mapreduce. Pig Latin provides support for join split chains, which is not there natively in mapreduce (I guess this means you have think how you can program a join or split job in mapreduce). The Pig Latin pseudocode is then translated to efficient mapreduce codeto be executed in the background. Pig Latin is closer to procedural language than SQL. And that makes it easier to use by programmers and gives more control to the programmers.

I am not convinced why mapreduce is not natural for join split operations. I think a reduce worker is doing a join already. And in the couple of examples the paper presents, mapreduce consistently uses less steps than the corresponding Pig Latin code.

Here is my reservation about Pig Latin. Mapreduce did not hide parallelism, you still have think and plan for parallel execution, but in contrast, Pig Latin completely hides parallelism. So, this may get the programmer lazier and not think/plan for parallel execution and write very inefficient code. And, I don't think Pig Latin can optimize badly written code. So how bad can this get in the worst case? 10 folds slower? Does someone have an example to how bad this can get? (Maybe putting a long task in FOREACH could make for a very inefficient code.) Somebody in the class stated this after my question "Only a good mapreduce programmer would make a good piglatin programmer." I also heard that if you download Pig Latin from Hadoop website and use it with the default configuration it is very slow. You have to tune to get it normal efficiency.

I think the most useful thing about Pig Latin is that it provides an iterative debugging environment for developing code. The user takes an initial stab at writing a program, and submit it to the system for execution, inspects the output, and repeats the process.

No comments: