HPTS trip report (day 2)

This is a belated wrap up of my HPTS trip report. Read the first part of the HPTS trip report here, if you haven't before.

The first day of HPTS was a very busy, productive, and tiring day. Since my body clock was still on East Coast time, I was wide awake at 6am again. I went for a longer run this time. I ran for 5 miles along the beach on the aptly named Sunset drive. Monterey Bay, Asilomar is an eerily beautiful piece of earth.

After breakfast, HPTS started with the data analytics session. That night it was announced that Chris Re (from Stanford) won a MacArthur genius award. Chris was scheduled to talk in this first session of HPTS day 2. It was a very engaging talk. Chris talked about the DeepDive macroscopic data system they have been working on. 

DeepDive is a system that reads texts and research papers and collect macroscopic knowledge about a topic with a quality that exceeds paid human annotators and volunteers. This was surprising news to me; another victory for the singularity team. (I had written about singularity before. It is a coming.)

They had demonstrated this system for extracting paleobiological facts to build high coverage fossil record by reading/curating research papers published on the topic. There has been human volunteers (are there any other kind of volunteers?) working on the paleo fossil project. It took  329 volunteers, 13 years to go through 46K documents and produce fossil record tables. Paleo DeepDive system processed through 10x documents and produced 100x extractions of information with better precision and recall that humans. All this in 45 minutes of runtime (over 80 cores, 1 TB RAM, using SQL processing and statistical operations), instead of 2 decades of using human readers.
Holy cow!

Imagine this being applied/integrated to Google Scholar searches. Wow! That would be transformative. There are already some developments toward smarter (more semantic-based) indexing/search of research papers.

I am not a machine learning guy, but from what I got DeepDive is using probabilistic bootstrapping approach to refine its guesses and make them precise. DeepDive uses distant/weak supervision; the user writes some rules crappy rules that serve as crappy training data, then DeepDive progressively denoises the data and learns more.

They are using DeepDive for other domains for accelerating science, such as drug repurposing, genomics. Another interesting application of DeepDive is for fighting with human trafficking by analyzing the dark web. Here is a talk by Chris on Deepdive system. (Not from the HPTS venue.)

After Chris's talk Todd Mostak, CEO MapD, gave a talk about a gpu-powered end-to-end visual anlaytics platform for interactively exploring big datasets. Todd was Sam Madden's PhD student. His startup evolved from his thesis. His demo looked interesting and fluid/quick.

The other sessions were on Big Data platforms, Distributed System platforms, and storage. The distributed system platform session was dominated by talks about containers and Docker.

So what was all the buzz about at HPTS'15? It was about NVRAMs. That NVRAMs are getting available wider and off-the-shelf is leading us towards new applications, design choices. Similarly RDMA was attracting excitement. Of course, there was a lot of talking/discussing about HPTS, high-performance transaction systems. And some about in-memory databases.

Overall HPTS was a great experience. I am amazed by the sheer volume and quality of interactions HPTS cultivated through hallway conversations. Keeping the workshop size limited to at most 100 participants helped. Providing ample opportunities for conversations helped. I met/talked to at least half of the 100 workshop attendees over the two days. And I am a shy person. Comparing the HPTS experience with other conference experience the difference gets starker. I hope more conferences can learn from HPTS and manage to improve their interaction volume and quality. That is why we travel many hours to go to conferences afterall.


Popular posts from this blog

The end of a myth: Distributed transactions can scale

Foundational distributed systems papers

Strict-serializability, but at what cost, for what purpose?

Learning about distributed systems: where to start?

Speedy Transactions in Multicore In-Memory Databases

The Seattle Report on Database Research (2022)

Checking statistical properties of protocols using TLA+

Anna: A Key-Value Store For Any Scale

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

SQLite: Past, Present, and Future