Skip to main content
Premium Trial:

Request an Annual Quote

ASHG: Data Science to Help Genomics Move from 'Artisanal' to 'Factory'

SAN DIEGO (GenomeWeb) – While genomics has showed promise at a small scale for matching patients to treatments, scaling that capability up so that personalized medicine may be realized for all will require a lot more data to be sifted through, speakers at this year's American Society of Human Genetics meeting said.

"How can we apply those breakthroughs in data technology to help with the transition from world of one to a world of millions?" asked Google's David Glazer, referring to the increasing number of people who have undergone genome sequencing, at ASHG.

Genomics is generating a storm of data, not just in terms of sequencing reads coming off of newer and faster machines, but also in terms of sheer research output as more journal articles showing links between variants and disease are published. At the same time, groups are working on building data standards to facilitate the sharing of clinical and genomic data.

IBM's Ajay Royyuru also noted at ASHG that between 6,000 and 10,000 articles are published a year that mention cancer — an amount that people just can't realistically read, even though researchers and clinicians need to keep up to date to find the best treatment for patients.

"This is a problem that really deserves help," Royyuru said.

The key factors of such a process, Royyuru said, is that it has to be comprehensive and objective as well as scalable and fast. Additionally, he said, it has to be transparent and show the reasoning that led it to its conclusions.

He and his colleagues at IBM are turning to supercomputer Watson to digest those papers and how their findings may relate to patients.

Through the Precision Oncology workflow he and his colleagues developed, patient sequencing data is fed into Watson, which then compares it to what's housed in databases like PubMed, the National Cancer Center's Pathway Interaction Database, and DrugBank, among others. From this, Watson develops a conceptual model of the disease and outputs a set of treatment options. It also provides the reasoning behind choosing those possible therapies that may then be presented to a tumor board, for example.

The process of generating a report takes five to 10 minutes, he said.

Additionally, Royyuru said, Watson would learn from the process as data regarding the treatment given to a patient and that patient's response are fed back in.

Currently this pipeline is a prototype that IBM is working on in conjunction with the New York Genome Center, and Royyuru added that IBM plans on recruiting additional beta testers next year.

In addition to Watson, other data technologies and expertise from the computer science field could be refashioned to analyze genomic data.

Companies like Google have experience working with large amounts of data. For instance, Glazer noted that 100 hours of video are uploaded to YouTube every minute and that the number of Gmail users is 150 times the number of US PhDs.

He and his colleagues also have begun to test their tools — like Dremel and BigQuery —on genomic data from the 1,000 Genomes Project. The first step of a principal component analysis of 1,000 Genomes Project data is to make a similarity matrix, and that takes, he said, about two hours on 60 eight-core machines.

Begin able to quickly build research questions one after the other is part of what's needed for innovation, Glazer added.

Still, he noted that to move genomics and personalized medicine from its current "artisanal" status to "factory" mode, there needs to be better standards. The Global Alliance for Genomics and Health, of which Google and organizations like BGI-Shenzhen, Genome Canada, the US National Institutes of Health, and the Wellcome Trust are a part, are working on developing such standards to improve interoperability and enable data sharing. The group also is working with the Genome in a Bottle Consortium to develop benchmarking references.

Glazer believes these efforts will lead to more innovation to analyze and explore data.