Twitter, open the firehose!

Twitter has 30 millions of users just in US. The users tweet about everything that interest or bother them. These tweets are public and over 50 million tweets are added to this pool everyday. It is no surprise that data mining researchers have been swarming to Twitter data to use it for mining public opinion and trends.

Data mining researchers have a problem, though. In this year's KDD conference (arguably the top conference on data mining), the researchers were unanimously vocal about the difficulties of acquiring the data they need from Twitter. The researchers need to learn Twitter API and other networking tools to acquire the data. And Twitter imposes a strict and very low-quota on the data it serves per IP. Of course Twitter does this to ensure that their system is not overwhelmed by these third party requests and can continue to serve its users. To get access to more data there is a lengthy process of whitelisting from Twitter. But, I guess, Twitter is unlikely to whitelist and serve firehose data to many users due to scalability problems.

Enter my cloud storage project idea for tweets. Wouldn't it be nice to store and serve tweets to data mining researchers worldwide? If Twitter can give us access to the firehose, we can replicate all tweets daily in a cloud storage system and index and store the tweets for easy sharing. In particular, to make the researchers' job easier, we can store tweets about subscribed terms by users in separate files. For example, a researcher mining data about US politics would be interested in tweets that has Obama in them, and should get access to these tweets quickly.

Twitter firehose would also help us to scale up our existing opinion mining projects using Twitter. One example is our Upinion project . Traditional polls offer snapshots about public opinion, and fail to identify breakpoints in public opinion. So in the upinion project, we consider the problem of identifying breakpoints in public opinion by using and categorizing emotion words in the tweets. We develop methods to detect changes in public opinion, and find events that cause these changes. The details of this work is available in our recent paper: Identifying Breakpoints in Public Opinion.


Murat said…
This is the reply I got from Twitter:

"Thank you for your interest in the firehose. We have seen a lot of inbound interest in licensing our data and in order to best field requests, please fill out the form at **************** as completely as you can. This will help us get back to you as quickly as possible, based on how your request aligns with our company goals."

I think our chances of getting firehose access is very low, the access is tied to mostly company/collaboration interests.
Murat said…
Another blog post on the same topic appeared today and hit frontpage at Hacker News.

Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

My Distributed Systems Seminar's reading list for Fall 2020

Silent data corruptions at scale

I have seen things

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

PigPaxos: Devouring the communication bottlenecks in distributed consensus

Sundial: Fault-tolerant Clock Synchronization for Datacenters

Facebook's software architecture