Tuesday, September 15, 2015

Serving at NSF panels and what it teaches about how to pitch the perfect proposal

NSF is one of the largest funding sources for academic research.  It accounts for about one-fourth of federal support to academic institutions for basic research. NSF accepts 1000s of proposals from researchers, and organizes peer-review panels to decide which ones to fund.

Serving at NSF panels are fun. They are also very useful to understand the proposal review dynamics. NSF funding rates are around 10% for computer science and engineering research proposals, so understanding the dynamics of the panel is useful for applying NSF to secure some funding.

How do you get invited as a panelist? 

You get invited to serve at an NSF panel by the program director of that panel. (Program directors are researchers generally recruited from the academia to serve at NSF for a couple years to run panels and help make funding decisions.)

If you have been around and have successfully secured NSF funding, you will get panel invitations. They will have your name and contact you. But, if you are new, don't just wait. You can email program managers at NSF at your topic, and ask them to consider inviting you as a panelist, because you will need the experience. This doesn't always work, but it helps. I found this from NSF about volunteering as a reviewer at a panel.

Preparation for a panel: Reading, reading, reading, writing reviews

As a panelist, you will be assigned around 8 proposals to review in 3 weeks. Each proposal body consists of 15 pages. So that means a lot of reading in the next couple weeks. And it can get boring, if you just read idly. You should try to read actively, discuss with the proposal, and write notes  as you read the proposal that will help you prepare your review.

Tips on reading papers also apply somewhat to reading proposals. But proposals have their peculiarities. Proposals need to have nicely motivated vision and clearly defined research problems, but they don't need to have all the solutions worked up. Instead of full-fledged solutions, the proposals provide draft solution approaches and competitive advantage insights to attack these research questions.

Pitching a successful proposal is an art. I provide some tips at the end of this post.

NSF panel structure

You travel to NSF headquarters at Washington DC a day before the panel. (NSF pays for your travel and reimburses your stay.) Panels are generally for one and a half day. The first day all the proposals get discussed. (Actually some proposals that don't receive any "Very Good" ratings can be triaged/skipped; for those only the 3 reviewer reports are sent back without a panel discussion summary.)  The second day (i.e., the half day) is for preparing and reviewing the panel discussion summaries of proposals discussed in the first day.

In the panel, everybody is smart and knowledgeable in their fields. Of course some are smarter and more knowledgeable, and it is a treat to listen them discuss proposals. If the panel is in your narrow field of expertise, it is gonna be a geek-fest for you. You will geek out and have a lot of fun. If the panel is in a field you are familiar but is not in your specialized field of expertise, it is still a lot of fun as you get to learn new things.

At the end of first day proposals are sorted into 4 categories: HC (High Competitive), C (Competitive), LC (Low Competitive), NC (Not Competitive). HC proposals get funded. Usually only 1-2 proposals out of around 20+ proposals make it to HC. The C proposals need to get compared and ranked, this generally happens the morning of second day. Only the top 1-2 proposals in C would get funding. LC and NC proposals do not need to get sorted.

The panel does not make the final funding decision; it only provides feedback to NSF to make the final funding decisions. NSF is fairly transparent and mostly goes with panel recommendations. If a proposal is not funded, the proposers still get detailed proposal reviews with the rationale of the panel review. In contrast, some other funding agencies such as DARPA, DOE may not even provide you with reviews/evalutions of your proposal.

NSF panels are tiring. You sit for one a half day and listen and discuss. And in the evening of the first day, you often have homework (hotelwork?) to read some extra proposals to help out in comparing/ranking proposals.  Instead of trying to multitask during the panel (by answering your email, or reading other stuff), it is much better to just participate in the discussion, and listen to the discussion of other proposals, even the ones you have not reviewed. After traveling all that distance to NSF headquarters, you should try to savor the panel as much as possible.

Panel discussions, interesting panel dynamics

Panelists can make mistakes and may have biases. Common mistakes include the following:
  • Panelists may play it safe and prefer incremental proposals over novel but risky proposals. Program managers sometimes warn about this bias, and try to promote high-risk/high-reward proposals.
  • Sometimes panelists may read too much into a proposal. They may like to write/complete the proposal in their heads, and give more credit than deserved to the proposal.
  • Panelists may be overly conformist, resulting in groupthink.
  • There can be some good arguer, a charismatic person that dominates over the other panelists. 

Lessons for pitching the perfect proposal to the panel

Make sure your proposal has a novel research component and intellectual merit. Your proposed project should "advance knowledge and understanding within its own field" and "explore creative, original, or potentially transformative concepts". You need to get at least one panelist excited for you. So stand for something.

Be prepared to justify/support what you stand for. You can fool some panelists some of the time, but you can't fool all of the panelists all of the time. Don't make promises you can't deliver. Show preliminary results from your work. It is actually better to write the proposal after you write an initial interesting (workshop?) paper on the topic.

Target the correct panel and hit the high notes in the CFP. If your proposal falls into the wrong panel, it will get brutally beaten. When in doubt about the scope and field of a CFP, contact the program director to get information. Most academic sub-disciplines/communities will have their biases and pet peeves. You want to target a panel that will get your proposal and is not adversarial to those ideas.

Write clearly and communicate clearly. Remember, the panelists are overworked. They need to review 8 proposals over a short time. It gets boring. So make it easy for the reviewer. Don't make the reviewer do the work. Spell out the contributions and novelty clearly, put your contributions in the context of the literature on that topic. If you make the reviewer work, you will leave him angry and frustrated.

Don't forget to write about the proposal's broader impact including education and minority outreach. If you omit it, it will bite you back.

All being said, there is still a luck factor. An adversarial or cranky panelist may ruin your proposal's chances, or a panelist that loves your work may make your case and improve your chances. The acceptance rate is under 10%. So good luck!

Disclaimer: These are of course my subjective views/opinions as an academician that participated in NSF peer-review panels. My views/opinions do not bind NSF and may not reflect NSF's views/stance.

paper summary: One Trillion Edges, Graph Processing at Facebook-Scale

This paper was recently presented at VLDB15.
A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, S. Muthukrishnan, "One Trillion Edges: Graph Processing at Facebook-Scale." Proceedings of the VLDB Endowment 8.12 (2015).

This paper is about graph processing. Graphs provide a general flexible abstraction to model relations between entities, and find a lot of demand in the field of big data analysis (e.g., social networks, web-page linking, coauthorship relations, etc.)
You think the graphs in Table 1 are big, but Frank's laptop begs to differ. These graphs also fail to impress Facebook. In Facebook, they work with graphs of trillion edges, 3 orders magnitude larger than these. How would Frank's laptop fare for this? @franks_laptop may step up to answer that question soon. This paper presents how Facebook deals with these huge graphs of one trillion edges.

Apache Giraph and Facebook

In order to analyze social network data more efficiently, Facebook considered some graph processing platforms including Hive, GraphLab, Giraph in the summer of 2012. They ended up choosing Apache Giraph for several reasons: it is open source, it directly interfaces with Facebook's internal version of HDFS and Hive, it is written in Java, and its BSP model is simple and easy to reason about.

(The BSP model and Pregel, which Apache Giraph derived from, was covered in an earlier post of mine. You can read that first, if you are unfamiliar with these concepts. I have also summarized some of the Facebook data storage and processing systems before, if you like to read about them.)

However, chosing Apache Giraph was not the end of the story. Facebook was not happy with the state of Apache Giraph, and extended, polished, optimized it for their production use. (And of course Facebook contributed these back to the Apache Giraph project as open source.) This paper explains those extensions.

Significant technical extensions to Giraph

Several of Facebook's extensions were done in order to generalize the platform. Facebook extended the original input model in Giraph, which required a rather rigid and limited layout (all data relative to a vertex, including outgoing edges, had to be read from the same record and were assumed to exist in the same data source) to enable flexible vertex/edge based input. Facebook added parallelization support that enabled adding more workers per machine, and introduced worker local multithreading to take advantage of additional CPU cores. Finally Facebook added memory optimizations to serialize the edges of every vertex into a byte array rather than instantiating them as native Java objects.

I was more interested in their extensions to the compute model, which I summarize below.

Sharded aggregators

The aggregator framework in Giraph was  implemented over ZooKeeper rather inefficiently. Workers would write partial aggregated values to znodes (Zookeeper data storage). The master would aggregate all of them, and write the final result back to its znode for workers to access it. This wasn't scalable due to znode size constraints (maximum 1 megabyte) and Zookeeper write limitations and caused a problem for Facebook which needed to support very large aggregators (e.g. gigabytes).

(That was in fact a bad use of the ZooKeeper framework, as outlined in this post there are better ways to do it. Incidentally, my student Ailidani and I are looking at Paxos use in production environments and we collect anectodes like this. Email us if you have some examples to share.)

In the sharded aggregator architecture implemented by Facebook (Figure 3), each aggregator is randomly assigned to one of the workers. The assigned worker is in charge of gathering the values of its aggregators from all workers and distributing the final values to the master and other workers. This balances aggregation across all workers rather than bottlenecking the master and aggregators are limited only by the total memory available on each worker. Note that this is not fault-tolerant; they lost the crash-tolerance of ZooKeeper.


Worker and Master Phase Extensions

For the worker-side, the methods preSuperstep(), postSuperstep(), preApplication(), and postApplication() were added. As an example, the preSuperstep() method is executed on every worker prior to every superstep, and can be used in k-means clustering implementation to let every worker compute the final centroid locations just before the input vectors are processed.

Similarly, Facebook added master computation to do centralized computation prior to every superstep that can communicate with the workers via aggregators. This is generally a lightweight task (easily computable without requiring much data analysis) that has a global scope (applies as input to all workers in the next supercomputing step).

Superstep Splitting

When operating on very large scale graphs, a superstep may generate a lot of data to share with other workers (e.g., in the friends-of-friends score calculation), that the output does not fit in memory. Giraph can use disk but this slows things signification. The superstep technique is for doing the same computation all in-memory for such applications. The idea is that in such a message heavy superstep, a worker can send a fragment of the messages to their destinations and do a partial computation that updates the state of the vertex value.

Operational experience

Facebook uses Apache Giraph for production applications, for a variety of tasks including label propagation, variants of PageRank, and k-means clustering. The paper reports that most of Facebook's production applications run in less than an hour and use less than 200 machines. Due to the short execution duration and small number of machines, the chance of failure is relatively low and, when a failure occurs, it is handled by restarts.

The production application workflow is as follows. The developer first develops and unit tests the application locally. Then tests the application on small number of servers on a test dataset (e.g., Facebook graph for one country). Then the application is run at scale on 200 workers. After tests, the application is ready for production use.

Evaluations

This is where Facebook shows off. They ran an iteration of unweighted PageRank on the 1.39B Facebook user dataset with over 1 trillion social connections. They were able to execute PageRank in less than 3 minutes per iteration with only 200 machines.

Conclusions

The paper gives the following as concluding remarks:
First, our internal experiments show that graph partitioning can have a significant effect on network bound applications such as PageRank.
Second, we have started to look at making our computations more asynchronous as a possible way to improve convergence speed.
Finally, we are leveraging Giraph as a parallel machine-learning platform.
These are of course related to what we mentioned recently about the trends in distributed systems research in cloud computing. (Part 1, Part 2)

Links

After I wrote this summary, I found that Facebook already has a nice post summarizing their paper.