Part of my research is on crowdsourcing. Basically, crowdsourcing means performing micro-collaborations with many people to complete a task. You divide the task into microtasks and outsource it to people. They provide solutions to your microtasks, and you aggregate those to obtain the solutions to the microtasks, and then ultimately to your task.
Aggregating the responses from the crowd is a challenge of itself. If the questions are asked as open ended questions, the answers would come in a variety of types, and you would not be able to aggregate them automatically with a computer. (You may use human intelligence again to aggregate them, but how are you going to aggregate/validate these next level aggregators?)
To simplify the aggregation process, we use multiple-choice question answering (MCQA). When the answers are provided in choices, a, b, c, or d, they become unambiguous and easier to aggregate with a computer. The simplest solution for aggregation of MCQA is the majority voting: whichever option was chosen most is provided as the ultimate answer.
Recently, we started investigating MCQA-based crowdsourcing in more depth. What are the dynamics of MCQA? Is majority voting good enough for all questions? If not, how can we do better?
To investigate these questions, we designed a gamified experiment. We developed an Android app to let the crowd answer questions with their smartphones as they watch the Who Wants To Be A Millionaire (WWTBAM) quiz show on a Turkish TV channel. When the show is on air in Turkey, our smartphone app signals the participants to pickup their phones. When a question is read by the show host, my PhD students would type the question and answers, which would be transmitted via Google Cloud Messaging (GCM) to the app users. App users play the game, and enjoy competing with other app users, and we get a chance to collect precious data about MCQA dynamics in crowdsourcing.
Our WWTBAM app has been downloaded and installed more than 300,000 times and has enabled us to collect large-scale real data about MCQA dynamics. Over the period of 9 months, we have collected over 3 GB of MCQA data. In our dataset, there are about 2000 live quiz-show questions and more than 200,000 answers to those questions from the participants.
When we analyzed the data we collected, we found that majority voting is not enough for all questions. Although majority voting does well in the simple questions (the first 5 questions) and achieves more than 90% accuracy rate, as the questions get harder, the accuracy of majority voting plummets quickly to 40%. (There are 12 questions in WWTBAM. The question difficulty increases with each question. Questions 10, 11, 12 are seldom reached by the quiz contestants.)
We then focused on how to improve the accuracy of aggregation. How can we weigh the options to give more weight to correct answers and let them win even when they are in the minority?
As expected, we found that the previous correct answers by a participant indicate higher likelihood of being correct in this answer. By collaborating with colleagues in data mining, we came up with a page-rank like solution for history-based aggregation. This solution was able to raise the accuracy of answers to 90% for even the harder questions.
We also observed some unexpected findings from the data collected by our app. Our app collected the response time of the participants, and we saw that the response time has some correlation to correct responses. But the relation is funny. For the easier questions (the first 5), earlier responses are more likely to be correct. But for the harder questions, delayed responses are more likely to be correct. We are still trying to see how we can put this observation into good use.
Another surprising result came recently. One of my PhD students, Yavuz Selim Yilmaz, proposed a simple approach, which at the end provided as effective as the sophisticated history-based solution. This approach did not even use the history of participants, and that makes it more applicable. Yavuz's approach was to /use the interests of participants to weigh their answers/.
In order to obtain the interests of the participants, Yavuz had a very nice idea. He proposed to use the category of the apps installed in the participants phone. Surprised, I asked him how he plans to learn the other apps installed in the participant phones. Turns out this is one of the basic permissions Android gives to an installed app (like our WWTBAM app): it can query and learn about the other installed apps in the users phone. (That it is this easy is telling about Android privacy and security. We didn't collect/maintain any identifying information on users, but this permission can potentially be used for bad.)
Yavuz assigned interest categories to participants using Google Play Store's predefined 32 categories for apps (e.g. Books and Reference, Business, Comics, Communication, Education, Entertainment, Finance). If a participant has more than 5 apps installed in one of these categories, the participant was marked as having interest in that category. We used half the data as training set and found which interest categories produce the highest accuracy for a given question number. Then in the testing set, the algorithm is simply to use majority voting among the category which is deemed most successful for a given question number. Is this too simplistic an approach?
Lo and behold, this approach lifted the accuracy to around 90% across all level of questions. (This paper got the outstanding paper award in the Collaboration Technologies and Systems (CTS 2014) Conference)
Ultimately we want to adopt the MCQA-crowdsourcing lessons we learned from WWTBAM in order to build crowdsourcing apps in location-based recommendation services.
Another application area of MCQA-crowdsourcing would be performing market research. A lot of people in the industry, consumer goods, music, and politics are interested in market research. But market research is difficult to get right, because you are trying to predict if a product can get traction by asking about it to a small subset of people which may not be very relevant, representative. The context and interests of the people surveyed are important in weighing out the responses. (I hope this blog post will be used in the future to kill some stupid patents proposed on this topic ;-)