Saturday, May 26, 2012

A bit of fun...

Can we say something about large societal events studying huge amount of online data publicly available? Is it possible studying millions of tweets posted every day, predict/forecast the outcome of an election? These are very interesting and relevant questions with huge, and sometimes worrying, implications. In the last years many studies have looked at these problems with alternating results. Many caveats, limitations, issues have been pointed out, but the proliferation of online communication means and the increasingly widespread adoption of smart phones allow to overcome part them. Just to have a bit of fun we tried an experiment. We considered an extremely popular TV show here in the US: american idol. The format of the program is very simple. On wednesday the exhibitions of the contestants are aired. At the end of the program the voting window is open and for two hours people can vote for their favorite either calling, texting or clicking. The next day the least preferred is announced and eliminated. We tested a very simple idea: the number of tweets related to each contestant during the window of show+voting window is proportional to the number of votes she will get. Of course this is just a first order approximation of a complex phenomenon. We tested this idea against 9 eliminations of the Top10 contestants getting very nice results. We then went public putting the paper containing these analysis on the arxiv. We did it before the season finale (that was aired last Tuesday and Wednesday). After the final exhibitions of the top two we analyzed the new data and hours before the winner of the season was announced we update our paper with the predictions of the winner. Considering just the raw count of tweets Jessica should have won the program. However considering the subset of tweets geolocalized in the US we saw that the other contestant (Phillip) was more popular. Indeed, the girl is originally from philippines and a large fraction of tweets were coming from there. The voting is open just for people in the US. So, discarding the possibility of votes coming from abroad we claimed that Phillip was going to be the winner. This was the case. The guy won the competition. The experiment clearly show the potential of these data in predicting and monitoring large scale events (132 millions of votes were collected just for the finale). Even more the geolocalization of the signal turn out to be crucial to have a better understanding of the process. For more details http://arxiv.org/abs/1205.4467

No comments: