Skip to main content
Premium Trial:

Request an Annual Quote

DNANexus, Baylor Project Shows Cloud's Efficacy for Large-Scale Clinical, Research Analysis Studies


This week, DNAnexus and the Human Genome Sequencing Center at Baylor College of Medicine shared details of a cloud-based collaborative analysis project in which they processed whole-genome and whole-exome data from more than 14,000 individuals for a study that aims to understand genetics' contributions to heart disease and aging.

The data was generated at BCM, which is one of five institutions involved in the Cohorts for Heart and Aging Research in Genomic Epidemiology, or CHARGE, consortium — a group whose goal is to facilitate genome-wide association study meta-analyses and replication opportunities among multiple large and well-phenotyped longitudinal cohort studies. In addition to generating and analyzing data for the consortium, researchers at Baylor also developed Mercury, a semi-automated variant calling and annotation pipeline that's now serving as the core variant analysis for CHARGE researchers.

With help from Amazon, DNANexus and HGSC deployed Mercury on DNAnexus' Amazon Web Services-based platform and used it to analyze more than 3,700 whole genomes and more than 10,000 exomes. According to the partners, the analysis took about 2.4 million core-hours of computational time and at the peak of the analysis, the data used up more than 20,000 compute cores. They also used the cloud to share both the pipeline and the results of the analysis — around 440 terabytes of data — with the 300 researchers that are involved in CHARGE consortium projects.

Scientists from both DNAnexus and HGSC discussed the project as well as some results from their preliminary analysis of the data during separate sessions at the American Society of Human Genetics annual meeting in Boston, Mass. this week.

DNAnexus CEO Richard Daly told BioInform that the partnership shows that researchers involved in large-scale studies needn't be handicapped by the limited compute power and storage capabilities of local clusters; and, further, that DNAnexus can provide the requisite computational resources that scientists need to analyze their data in a timely and cost effective fashion and to share that information with collaborators in a secure environment.

It also shows that the process of moving large quantities of data to the cloud and deploying tools to run on the infrastructure in an optimized and cost-effective manner can be done smoothly and efficiently. Andrew Carroll, a DNAnexus scientist who worked on the Baylor project, told BioInform that the partners were able to deploy Mercury in a distributed fashion in the cloud quite quickly. He also said that they also checked the results of the cloud's analysis against those generated by running the pipeline on a local cluster to ensure that findings from both platforms matched up.

More generally, the project casts the cloud as a viable analysis infrastructure provider for population-based research and clinical studies, many of which require computational infrastructure to manage and analyze genomes at scale that "exceeds the capacity of most institutional resources," according to Jeffrey Reid, an assistant professor in BCM's department of molecular and human genetics.

He also highlighted the opportunities that the cloud offers for the genomics community to share analysis software more broadly. "Tools that I wrote when I was in graduate school never propagated out," he told BioInform. That's because it requires a lot of expertise "to take other people's research and development and tools and install them in your own environment" something which small labs might not be able to mange on their own. With infrastructure provided by companies like DNAnexus, "nobody again has to worry about those installation problems," he said. "That’s going to be transformative for the way that these tools … are delivered. It’s a very exciting moment."

DNAnexus is making the Mercury pipeline freely available to customers of its cloud platform. Reid said that HGSC team will work with DNAnexus to ensure that the updates they make to Mercury at the local level are incorporated into the cloud instance of the tool and made available to the company's customers. "We are committed to keeping the pipeline updated and evolving and doing that on both sides," he told BioInform.

Mercury, which provides tools for analyzing and annotating next generation sequencing data in research and clinical contexts, is the primary data analysis pipeline that HGSC uses internally for several sequencing-based studies in addition to the CHARGE project.

It's composed of open source software that researchers at Baylor and other institutions developed. These include things like the Burrows-Wheeler Aligner, which handles sequence data alignment and mapping; the Broad's Genome Analysis Toolkit, which is used for base quality recalibrations and local realignment; and SAMTools and Picard.

Mercury uses internally developed software called Atlas to call single nucleotide polymorphisms and insertions and deletions, and then annotates them using another Baylor-made program called Cassandra, which incorporates tools such as ANNOVAR and VCF as well as data sources like dbSNP and 1000 Genomes project.

Mercury uses internally developed software called Atlas to call single nucleotide polymorphisms and insertions and deletions, and then annotates them using another Baylor-made program called Cassandra, which incorporates tools such as ANNOVAR and VCF as well as data sources like dbSNP and 1000 Genomes project.

Reid said that his team intends to publish a paper in BMC Bioinformatics that will provide details about Mercury. They're also working on new applications to add to the pipeline including a tool for discovering mobile element insertions, as well as a quality control program for looking at the performance of capture reagents, he said. These applications will also be included in DNAnexus' instance of the tool.

"We want to [provide] as much and as useful access to the tools as we [at HGSC] have now" particularly for HGSC's collaborators, he said. "Historically they had to come to us because they often don't have a compute cluster, a team of bioinformatics people … or enough storage or compute power, to really run these things."

HGSC "will continue to support people who need us to run these things but as we [develop] more tools, we will [deliver] them into DNAnexus so that our collaborators can access [them] without necessarily having to involve us," he said.