Skip to main content
Premium Trial:

Request an Annual Quote

Google, Goldman, and GWAS


The US National Security Agency plans to begin operating a new, 1-million-square-foot data center in Bluffdale, Utah, by -September 2013. The Utah Data Center is designed to analyze every nook and cranny of the Web as well as all the data passing through cellular networks at any given time. In roughly five years, the amount of data passing through the Internet could reach upwards of 950 exabytes — 950 billion gigabytes — a year, so the data center's designers have built the facility to handle a yottabyte of data — that's a quadrillion gigabytes.

On a smaller scale, companies like Amazon, Facebook, and Google regularly process and analyze petabytes of multi-layer datasets using big data analytics. They employ advanced machine learning techniques to identify patterns and create predictive models or Bayesian networks — visual depictions of probability distributions — that establish correlations as well as cause-and-effect relationships.

With genomics researchers bemoaning the ever-growing amount of data produced by next-generation sequencing and the challenge of integrating that data with electronic medical health records to support personalized medicine, is it time for bioinformatics to take a cue from how bigger data users think about analytics?


With currently available sequencing technologies producing several terabytes of raw data per patient, the data management and analysis challenges for the clinic are hardly trivial. To be useful at the bedside, that data must be integrated into electronic medical health records, insurance records, and pharmaceutical data. Traditional relational databases will not be enough when billions of clinical data points have to be made available for every patient that comes in and out of a hospital or clinic. To integrate and analyze information on the fly — including everything from an individual's genome and indications of a particular drug response to which doctor administered the treatment — a significant overhaul of biomedical and healthcare -informatics is required.

"People talk about the data tsunami in the bio world, but for the most part, it's not really that bad — just look at the quantities of data that people have been routinely using in financial trading or e-commerce. This is well beyond the amount of data that's being collected during clinical trials," says Colin Hill, CEO of Gene Network Sciences Healthcare. "Now next-gen sequencing is starting to push up the amounts of data, but it's still pretty early — you're talking about small amounts of data relative to the amounts Amazon and Facebook process all the time in terms of statistical analysis."

Big analytics

Hill's company is a biomedical analytics firm that has developed a system — similar to the type used in e-commerce — for analyzing and integrating large amounts of electronic health records and genomics data. Based on a variation of Bayesian networking theory, GNS Healthcare's reverse-engineering and forward simulation, or REFS, analytics platform uses high-performance computing to create causal network models and perform simulations using that data. REFS works by breaking datasets into trillions of pieces and applying probabilistic scores to each piece to establish relationships between data points, and then simulating possible outcomes.


In a paper published in PLoS Computational Biology in March 2011, researchers from GNS and Biogen reported that the REFS platform proved effective at identifying novel therapeutic intervention points with multi-layered data that included sequence variations, gene expression data, and standard clinical measurements of drug effectiveness. The researchers identified novel therapeutic intervention points in 77 arthritis patients who did not respond to a commonly used anti-inflammatory treatment regimen, TNF-blockade. By synthesizing this multi-layered data, the REFS platform identified novel therapeutic intervention points that may lead to the development of alternatives to TNF-α blocker treatments.

"This whole 'big data' thing is really about the analytics. Even though people get so excited about the tools for generating the data and storing the data and accessing and visualizing the data, why are we collecting the data in the first place? To get knowledge from the data," Hill says. "And to do that, we need a next generation of analytics so we're focused on how to turn data into models to predict the behaviors of human systems and disease so clinicians can intervene in the right way. It requires a different type of analytics from what is commonly used in bioinformatics."

Tricks of the trade

An advisory committee headed up by Eric Schadt, chair of genetics and genomics at Mount Sinai School of Medicine in New York and chief scientific officer at Pacific Biosciences, is bringing together expertise in big data information management from a range of areas, including social networking and quantitative financial trading. The goal of the committee is to map out ways to apply these varied techniques for large-scale, multilayered data analysis to a clinical, personalized genomics setting. In addition to computer scientists, the group includes Cognizant CEO Francisco D'Souza, whose company provides solutions for businesses to manage information, and a team of quantitative trading experts at Goldman Sachs, who developed the firm's quantitative trading models.

"What the traders want to do is take all the data on the planet and build models so they know when to place a bet. In a hospital setting, we want to integrate all of our genomics and medical records information to decide how best to treat that patient," Schadt says. "So we formed an advisory board of experts of people across that domain, and got people on a consulting basis like Jeff Hammerbacher, who did all the HPC compute stuff for Facebook that allows them to mine social networking data. We're leveraging many of those different domains to form best practices for us at Mount Sinai."

However, it might not be as simple as implementing, for example, Amazon's recommendation algorithms that use natural language processing and machine learning to mine user behavior patterns to predict what a customer may want to purchase. While this type of basic pattern matching has a lot in common with how genome-wide association studies are structured, the world of e-commerce is much simpler than that of GWAS. The analysis of biomedical data requires precision and contains more variation in the language used to describe and annotate critical data points in a medical record.

"There's some fundamental differences to big data when it's in e-commerce or consumer behaviors versus the world of healthcare," Hill says. "The type of data is different. ... But with genomics, you have a very different shape of data matrix that has many more dimensions."

Not only is a hodgepodge of genomics and electronic health records data equally, if not more, complex as the types of data e-commerce uses, the ways in which biomedical information is stored is also highly formalized and structured. This can make it difficult to mine and is, in a sense, a world away from the freewheeling way that Google accesses all the right information for a single query.


"You have this electronic medical records system, it's heavily transaction phased, typically built on Oracle using classical, relational-database designs, and it's very complicated to get data out in meaningful ways and actually mine," Schadt says. "So one of the things that companies like Google have mastered — that we're trying to learn from — is how to actually deconstruct that kind of data to make it way more accessible."

Facebook and Google can operate the types of huge data warehouses that a typical hospital or clinic would unlikely be able to afford. But could cloud computing help make up for hospitals' shallower coffers?

"It goes beyond just gaining access to those kinds of architectures to be able to manage and compute on data. I don't want to diminish the impact of cloud computing can have in this space, but it's absolutely just one leg of the stool and probably not the most important leg," Schadt says. "This takes fundamentally rethinking how you want to organize that information and how you want to query it."

The Scan

Genetic Risk Factors for Hypertension Can Help Identify Those at Risk for Cardiovascular Disease

Genetically predicted high blood pressure risk is also associated with increased cardiovascular disease risk, a new JAMA Cardiology study says.

Circulating Tumor DNA Linked to Post-Treatment Relapse in Breast Cancer

Post-treatment detection of circulating tumor DNA may identify breast cancer patients who are more likely to relapse, a new JCO Precision Oncology study finds.

Genetics Influence Level of Depression Tied to Trauma Exposure, Study Finds

Researchers examine the interplay of trauma, genetics, and major depressive disorder in JAMA Psychiatry.

UCLA Team Reports Cost-Effective Liquid Biopsy Approach for Cancer Detection

The researchers report in Nature Communications that their liquid biopsy approach has high specificity in detecting all- and early-stage cancers.