Searches in the Wikipedia help to predict epidemics

The use of social networks, like Twitter or Google-search engines as tools to predict the behavior of the masses is developing more and more nowadays. It began as a series of experiments in academia, but is already working with them from many companies and organizations to harness the wisdom of big data: millions of Internet users doing the same thing at the same time have to mean something.

The problem is to calibrate, adjust what it truly means a flood of tweets or searches in one direction at a time, in one place. The ultimate tool to join the party of the social data has been Wikipedia, after Harvard Medical School researchers have determined that its use is able to predict with precision, in real time, the arrival of flu outbreaks in the United States.

Since this online encyclopedia is very present in our lives, it seems logical that certain peaks or usage trends can assume that where there’s smoke there’s fire. Not surprisingly, Wikipedia is now the leading source of medical information between patients and health workers themselves. If on a given day are significantly soar search on a contagious disease, it must be assumed that there is an epidemic brewing.

The researchers David McIver and John Brownstein focused on visitors who received 35 entries from the English Wikipedia flu-related: from “common cold” to “rush” through all varieties of known viruses (H1N1, H5N1, etc.. ) and medicines such as Tamiflu. They collected information of 294 weeks in which, on average, 30,000 consultations were held daily, with peaks of 334,000 visits. And crossed the data with the statistics from the Centers for Disease Control and Disease Prevention (CDC) found that they could accurately predict the number of cases of influenza with a difference of only 0.27% over the official data.

And, most importantly, these data could provide near real-time, two weeks before the medical authorities, it takes that long to make their predictions based on their own information systems. All thanks to Wikipedia allows usage statistics of each entry are queried, and updated daily, providing plenty of data to researchers who want to use them.

“The main advantage of Wikipedia is that data are completely open and for all, so that anyone can create their own models or improve our” Matter tells David McIver, referring to Google Flu Trends (GFT) the search tool developed to predict flu outbreaks and has generated intense academic debate after starting to fail. The data that Google uses only know them and Wikipedia are open access, which allows them to do science: reuse as many times as needed to replicate results or exceed those of others.


One of the weaknesses of GFT was that he was very sensitive to the influence of the media: the flu-related searches are not only personal, but also influenced by the information tsunami, as in the case of global pandemics occupying covers and news. “Our model has shown that during times of high media attention, as the pandemic H1N1 swine flu, the 35 Wikipedia articles we studied were very successful in accurately calculating the flu conditions at the time”, McIver says.

Until now, searches on Wikipedia have been used to try to make many kinds of predictions, such as blockbusters measuring the activity at the entrance of a particular film to premiere. However, in the case of influenza is an important limitation: the geographic location can not be the incidence of disease. Google did not publish the data, but we know that uses the IP addresses of users’ computers to make predictions for specific countries and regions.

If many users consult Article in German film premiere in Wikipedia, we can assume that succeed in Germany. But when it comes to language much more scattered around the world, such as English or Spanish, predictions are complicated. These Harvard researchers openly acknowledge that it is an important limitation, and still got their flu model worked even though 59% of the consultations of the articles in English are made from outside the U.S. (11% from the United UK).


For this reason, and there have been relatively successful separate experiments using the social network Twitter, it allows geotag messages to predict epidemics in real time in specific locations to keep track of the terms “medicine”, “mouth” or “cough”.

On the other hand, Wikipedia articles are not free from the influence of the news agenda: Friday, predicting the death of the football coach Tito Vilanova, queries “parotid” (the gland that cancer had affected) multiplied by more than 100 compared to the usual daily average. Logically, a peak of visits like this will not ever have epidemiological importance, therefore, not to focus on the data from this tool (or any other) in isolation, but in conjunction with all that accessible.

Wikipedia   2

“Use of this type arising from social media or other websites to make estimates and forecasts data is still a science is in its infancy”, says McIver. He adds: “We believe that this data holds great promise because of their size, depth and ubiquity, but we are still creating models as we develop the discipline”.

According to epidemiologist, predictions about public health or diseases using this type of data should be used alongside traditional surveillance sources such as the CDC and the World Health Organization. “They are not designed to replace. The ultimate goal is find a way to unite all these different sources of data to obtain the most accurate and timely picture of public health we can get”.


Leave a Reply

Your email address will not be published.