One of the premises of Big Data is that it can be “theory free”: rather than starting with a hypothesis (“men at buffets eat more when women are present,” “more people will click this button if I move it here,” etc) and then gathering data to validate your guess, you just gather a ton of data and look for patterns in it.
The thing is, patterns emerge in every large dataset, without necessarily being representative of a wider statistical truth. Think of the celebrated rise and fall of Google Flu: researchers examined the 45 search terms that were most prevalent where the flu had spread and concluded that these were predictors of flu, but the predictive power turned out to be an illusion. Every place has 45 top search terms, all the time, and some of them will coincide with flu outbreaks, but without a causal theory that you can test, all you know for sure is that you’ve found an incident of correlation, and no way to know whether the correlation is coincidence or a newly discovered iron law.
Writing in Wired, Pomona College economist Gary Smith — author of books on statistical malpractice like Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics and The AI Delusion — runs down several examples of how theory-free data-mining got its practitioners in to trouble (including a celebrated Cornell professor who was forced to resign after telling his grad student to “Work hard, squeeze some blood out of this rock” by looking for patterns in a data-set about buffet eaters.
Big Data is still a useful statistician’s tool, and can be examined to gain intuition that leads to new hypotheses — but those hypotheses then need to be investigated with statistical rigor.
Good research begins with a clear idea of what one is looking for and expects to find. Data mining just looks for patterns and inevitably finds some.
The problem has become endemic nowadays because powerful computers are so good at plundering Big Data. Data miners have found correlations between Twitter words or Google search queries and criminal activity, heart attacks, stock prices, election outcomes, Bitcoin prices, and soccer matches. You might think I am making these examples up. I am not.
The Exaggerated Promise of So-Called Unbiased Data Mining [Gary Smith/Wired]
(Image: Big Data: water wordscape, Marius B, CC-BY)
Patrick Ball and the Human Rights Data Analysis Group (HRDAG) (previously) use careful, rigorous statistical models to fill in the large blank spots left behind by acts of genocide, bringing their analysis to war crimes tribunals, truth and reconciliation proceedings, and other reckonings with gross human rights abuses.
In the last three days of the Sri Lankan civil war, as thousands of people surrendered to government authorities, hundreds of people were put on buses driven by Army officers. Many were never seen again.
The students in David Stein’s Political Statistics class at Montgomery Blair High School in Silver Spring, Maryland have built a statistical model for predicting the outcomes of the upcoming midterm elections: the model makes assumptions about voter turnout and the way that polling data will translate into votes in 2018.
Robotics: It’s a field that used to exist only as science fiction. Now it’s science fact, and it’s not just a playground for MIT prodigies. Thanks to the ROS (Robot Operating System) framework, anyone willing to learn robotics can practice robotics. And the easiest way to learn ROS? The Complete Robotics eBook Bundle. Combined, the […]
As any successful company can tell you, it’s all about the numbers. Compiling and using data quickly and effectively is key, and the best programs for doing just that share one programming language: Python. And if you want to master them all, the Complete Python Data Science Bundle is a good place to start. Set […]
It should be no secret by now that thanks to Adobe’s ubiquitous suite of design software, graphic designers do the bulk of their work in front of a computer screen. Why shouldn’t they learn those tools of the trade the same way? If you’re looking to kickstart a design career, the Graphic Design Certification School […]