NEW YORK (GenomeWeb) – The Centers for Common Disease Genomics (CCDG) program has created a standardized pipeline to help its scientists to analyze data from tens of thousands of genomes across multiple institutions.
The National Institutes of Health's National Human Genome Research Institute funded the initiative last year with a grant of $280 million, to be distributed over a four-year period. Its four main centers are housed at the Broad Institute of MIT and Harvard, Baylor College of Medicine's Human Genome Sequencing Center, the New York Genome Center (NYGC), and the McDonnell Genome Institute at Washington University School of Medicine in St. Louis. However, smaller sequencing efforts are happening at other institutions alongside the research at the four main centers.
One of the overarching goals of the project is to discover new variants that underlie common diseases, with an initial focus on cardiovascular, autoimmune, and psychiatric diseases, Ira Hall, one of the CCDG's principal investigators from Washington University and associate director of the McDonnell Genome Institute, said in an interview.
Each center has an area of disease expertise, which means that not all of the centers are working on all three focus areas. Researchers at WashU, for example, are primarily focusing on sequencing cohorts with coronary artery disease and type 1 diabetes, while researchers at the NYGC are generating and processing whole-genome sequence data to study psychiatric disorders and asthma. Similarly, Baylor and Broad Institute researchers have certain areas of focus.
Not unlike genetic pioneering efforts such as the 1,000 Genomes Project and the Exome Aggregation Consortium, the CCDG is working to sequence a high volume of whole genomes — over the lifetime of the project, the researchers hope to sequence between 150,000 and 200,000 whole genomes from a diverse array of sources.
"None of the funding [for the CCDG] is going towards the collection of new samples," Hall said. Instead, the consortium is collaborating with the medical community to sequence samples from already established cohorts. For example, the NYGC is studying genetic risk factors for autism and its researchers obtained the bulk of its samples from the Simons Foundation Autism Simplex Collection.
The researchers are also putting a strong emphasis on the ethnic diversity of sequencing samples they will be including, and aim to include as many samples as possible from people not of European descent. WashU, for example, has already begun sequencing as many samples as possible from people of African American descent that fit within the requirements for its cardiovascular disease study. To date, the WashU group has been sequencing samples donated from a number of institutions, including Mt. Sinai Health System, Duke University, and the University of Pennsylvania, Hall said. The researchers also aim to include samples from people of Latino, Hispanic, and Asian descent in their study to ensure the sequences are representative of the larger population.
The first hurdle the CCDG researchers had to overcome was to standardize the sequencing analysis. While raw sequencing data is generated in essentially the same way at most institutions, the set of variants used for downstream analysis can differ significantly, said Benjamin Neale, an assistant professor in the Analytic and Translational Genetics Unit at Massachusetts General Hospital and the Broad Institute.
"We haven't harmonized everything up to that point, but rather have harmonized the most compute-intensive, and by extension costly components of the pipeline," he explained.
Neale noted that a critical component to standardization was ensuring all of the institutions had the same cutoff for quality scores. "Quality scores help us determine how good the evidence is for every single base call from the sequencer," he said.
This first step of analyzing genome sequences — read alignment to the reference genome, sorting, duplicate marking, base quality score recalibration, and various other steps the precede variant calling — is where 70 to 80 percent of analysis costs come from, Hall said.
Generally in whole-genome sequencing data analysis projects, researchers have to ensure that all of the results are comparable to one another. This can be tricky when comparing sequences generated by different institutions, but most researchers have previously solved that problem by reprocessing genome datasets using the same pipeline to ensure comparability.
But that's not feasible in a large-scale project like this one, Hall said. Reprocessing tens of thousands of datasets is extremely costly and would require researchers to put in a lot of extra hours to do work that has already been done once.
"This has been a problem in the genomics community for a long time," Hall said. Most institutions spend a great deal of time and money to create their own pipelines, so it's difficult to put something entirely new in place, he added.
To begin the process of standardizing the pipeline across the institutions, Neale and Hall worked with a number of people at the four CCDG centers, the Genome Sequencing Program Coordinating Center (GSPCC) at the NIH, and several other institutions involved in smaller projects for the CCDG. This included Michael Zody, senior director of computational biology at the NYGC; William Salerno, senior bioinformatics programmer from Baylor College of Medicine; Goncalo Abecasis, professor in the department of biostatistics at the University of Michigan; Tara Matise and Steve Buyske, co-directors of the GSPCC; and others.
The group began by determining what aspects of the pipelines were different at each center, which might affect what people can and can't change, Zody said. Most of the centers were working with Illumina sequencing technologies to generate raw data, but had different raw data processing methods. Consequently, Zody and his colleagues concentrated their attention on standardizing how the centers would process and store the raw data before analysis.
"A lot of things across mapping and alignment are relatively stable," Neale said. "[Genome alignment] was one area we were able to completely standardize," Zody added.
In other areas, it made the most sense to modify software to achieve goals that the group considered a priority. "There are technical computing reasons to do things in different orders," Zody said. Marking duplicates, for example, is one area where they had to be more flexible. "[It] is a very computing-intensive process. And each center had different priorities for how they needed to be marked," he explained.
Another area where the process was a little more complicated was data storage. Several of the institutions already had relationships with particular cloud storage providers, for example. Ideally, the whole program would work with one cloud storage provider, since that is the most cost-effective way to store data, Matise said. That wasn't possible, but the institutions were able to simplify their needs to use two cloud-based services, Google Cloud Services and DNAnexus, alongside existing internal infrastructure at each respective institution, she added.
The researchers did work to optimize how they planned to store data. "We minimized the amount of information we were storing, but made sure we weren't taking anything away," Zody said. "The main goal of minimizing file size is to reduce storage costs, but reducing network transfer time is also a major consideration," he added.
While putting in much of the first year's efforts into this standardization may seem tedious, the result is that the centers can all have a higher degree of confidence that the sequences processed using these standards will be comparable, Neale said, although he noted there is still a lot of upstream variation in processing that may come into play.
The researchers finalized the pipeline standards in November, and have begun the first large wave of sequencing and data processing. They plan to publish the work on pipeline standardization later this year to ensure that it is publicly available to any researcher who wants to either use data from the project or have a baseline to start future large data processing projects.
One of the things that the group hopes will come out of the pipeline standardization is that it will help large data production processes have an easier time coordinating sequencing across different institutions. "One of the values of these large-scale projects it that it's the opportunity to establish these kinds of standards and best practices," Zody said. "While a lot of what we are doing looks similar to the 1,000 Genomes Project, we've updated it [to current sequencing methods]," he added.
Another benefit is that when an individual lab is doing a small-scale study, it could use the standardized pipeline for its sequence data, which would allow it to compare its data to the population database that the CCDG hopes to create, Hall said.