Skip to main content
Premium Trial:

Request an Annual Quote

Big Data Bummers

Big data is "suddenly everywhere" these days, say New York University Professors Gary Marcus and Ernest Davis in the New York Times.

"Everyone seems to be collecting, it, analyzing it, making money from it and celebrating (or fearing) its powers," they write. But is it time to fire all the researchers and let the machines take over? Nope. Marcus and Davis have some bones to pick with big data, nine bones at least. Turns out that there are some sticky issues that crop up when vast volumes of number get crunched.

While big data can be very useful in detecting correlations, it is not very good at distinguishing whether or not those correlations are meaningful. A data analysis might show that the murder rate in the US between 2006 and 2011 correlates well with the loss of market share of Internet Explorer. Those trends are probably not related.

Big data might help support or expand scientific inquiry but it is not a replacement, they write.

"Molecular biologists, for example, would very much like to be able to infer the three-dimensional structure of proteins from their underlying DNA sequence…. But no scientist thinks you can solve this problem by crunching data alone, no matter how powerful the statistical analysis," they write.

Big data also can be gamed. Take the example of programs used in grading student essays, which examine sentence length and word sophistication and have correlated well with the grades given by humans. If those programs replaced humans, what would stop students from simply writing essays that are full of long sentences fancy words, but otherwise poorly written?

Some big data success stories have lost their luster upon later review. The Google Flu Trends that seemed so exciting a few years back – even beating the CDC at detecting flu spread – now has made more bad predictions than good ones two years running. Search engines change all the time, and comparing data from one year against outputs from a previous edition of the engine may not be very useful.

Big data also can succumb to "the echo-chamber effect," because so much of it comes from the web, and various sources like Google Translate may into others like Wikipedia, which then may be fed back into Google Translate, and so on. That could create and magnify errors in data, and could be difficult to sift out and account for, Marcus and Davis say.

There also is a risk of "too many correlations," which can be caused by looking at a data set over and over and finding correlations that appear to be statistically significant where in reality there are none.

"Big data is here to stay, as it should be. But let’s be realistic: It’s an important resource for anyone analyzing data, not a silver bullet," they conclude.

The Scan

Study Tracks Off-Target Gene Edits Linked to Epigenetic Features

Using machine learning, researchers characterize in BMC Genomics the potential off-target effects of 19 computed or experimentally determined epigenetic features during CRISPR-Cas9 editing.

Coronary Artery Disease Risk Loci, Candidate Genes Identified in GWAS Meta-Analysis

A GWAS in Nature Genetics of nearly 1.4 million coronary artery disease cases and controls focused in on more than 200 candidate causal genes, including the cell motility-related myosin gene MYO9B.

Multiple Sclerosis Contributors Found in Proteome-Wide Association Study

With a combination of genome-wide association and brain proteome data, researchers in the Annals of Clinical and Translational Neurology tracked down dozens of potential multiple sclerosis risk proteins.

Quality Improvement Study Compares Molecular Tumor Boards, Central Consensus Recommendations

With 50 simulated cancer cases, researchers in JAMA Network Open compared molecular tumor board recommendations with central consensus plans at a dozen centers in Japan.