Managing and analyzing the data from the recently announced 1,000 Genomes Project will pose a number of challenges, some related to the nature of the next-generation sequencing platforms, but organizers say the results will boost both medical research and scientists’ understanding of human evolutionary history.
The study consortium, which aims to sequence at least 1,000 and up to 2,000 human genomes within three years, includes two data-related working groups. (see In Sequence 1/22/2008). A data flow group will be responsible for collecting and archiving the sequence reads, helping map them to a reference genome, and making the data available to the research community in different formats and levels of detail. Meantime, an analysis group will focus on aligning the reads, reconstructing the 1,000 genomes from the data, calling genetic variants, and interpreting the results.
A major challenge will be the “sheer volume of data,” according to Gil McVean, a professor of statistical genetics at the University of Oxford, who co-chairs the analysis group. The consortium expects the study to produce on the order of 6 terabases of data, or 60 times the sequence data that has been deposited in public DNA databases over the last 25 years.
“It’s a fantastically large amount of data to process and store and access,” McVean said. “Just the informatics challenges are pretty horrific.”
According to McVean, the analytical tasks fall into three broad areas: technology-related tasks that focus on translating the raw data into DNA sequence and mapping the sequence reads to a reference genome; calling genetic variants such as SNPs and structural variations and reconstructing individual genomes; and using the results to help disease studies and other research projects.
On the technology side, data analysis experts have to grapple with the fact that the nature of the data produced by existing next-generation sequencers is still in flux. “The data that comes out of the machines is changing pretty much month by month as the engineering improves,” McVean said. “The image analysis gets better day by day, the optical analysis gets better, the exact protocols for how you do the sequencing get better.”
Still, researchers will be thinking about ways to improve how they use the data. “There are in-house algorithms in all of these machines, but can you do any better? Most people believe that there is some room for improvement,” McVean said.
Mapping the sequence reads to a reference genome requires dealing with short reads, which “provide a very different kind of view of the genome, and this very ‘bitty’ look of what’s going on,” he said.
Short-read alignment tools already exist, such as Illumina’s Eland program, Mosaik from Gabor Marth’s group at Boston College, and MAQ from Richard Durbin’s group at the Sanger Institute. But the analysis group, in collaboration with the data flow group, will need to make a choice about which algorithm and what parameters to use, McVean said. “It’s just not possible to serve five different mappings; making use of that information is a nightmare.”
The reference genome to be used has not been decided upon, he said, but “I will push strongly to use multiple,” he added, even though “how you combine information across multiple ones is not obvious.” According to Paul Flicek from the European Bioinformatics Institute, who co-chairs the data flow group, the initial reference genome will be the NCBI 36 assembly.
Once the reads are mapped to the reference, the scientists will move on to calling sequence variants, such as SNPs, indels, and rearrangements. Structural variation is “one of the big, exciting things about this; we might really begin to get a sense of how well we can identify and call various types of structural variants,” McVean said.
But besides detecting variants, the project is also about reconstructing individual genomes, or estimating haplotypes, he said. Such reconstruction will be necessary because the consortium plans to sequence each of the 1,000 genomes only at a low coverage, so there will be gaps in each genome. However, since much of the genome exists in common haplotype “clumps,” McVean and his colleagues believe they can reconstruct individual genomes by “borrowing the information across individuals.”
“Although in any one of these individuals sequenced at 2x, you will have plenty of gaps, we think that we can do an extremely good job of filling in those gaps by comparing them to the others,” McVean explained. “And I see that as the primary academic novelty that will come out of this: how well we can do that.”
“It’s a fantastically large amount of data to process and store and access. Just the informatics challenges are pretty horrific.”
Finally, the analysis group will look into how researchers can use the results from the study to “boost the power” of disease-focused genome-wide association studies and other projects.
For example, scientists could flip through the list of variants from the 1,000 Genomes Project and see if any of them occur in a genome region identified in a genome-wide association study.
“You might find there is a mutation in here which has a fairly obvious effect on gene function or gene structure, that’s a good candidate, and you go and look carefully at that mutation,” McVean said.
But by knowing about the haplotype clumps, researchers could also use the data from the 1,000 Genomes Project to fill in the sequence gaps. “Basically, if you typed 500,000 or a million SNPs in your genome-wide association study, you can take those individuals and then fill in the variants that you did not look at in this study but which are in the 1,000 Genomes Project,” according to McVean.
Besides helping with medical studies, the results could also help researchers learn about processes such as genome instability or patterns of recombination, and human evolutionary history, he said.
On the data side, the EBI’s Paul Flicek and his colleagues at the National Center for Biotechnology Information in the US are responsible for ensuring that the data flows smoothly so the analysis group and other researchers can use it.
Both NCBI and EBI will set up short-read archives that will be synchronized, where the sequencing centers will deposit their raw reads, similar to the existing trace archives for Sanger sequence reads. The data flow group will then extract the 1,000 Genomes Project data from the archives and run it through “pipelines” that will perform different types of analyses, starting with aligning the reads to the reference genome, Flicek told In Sequence’s sister publication BioInform last week.
The goal of the pilot projects will be to find out how well the analyses work. Since the technologies are so new, “we almost have a chance to define the input and output formats for a lot of the different alignment programs or other analysis programs,” he said.
In the end, Flicek said, the data flow group wants to produce “virtual genome sequences,” consisting of observed bases with confidence scores and information about variants. However, users will also be able to access the raw data, at least linked to the archive, and the aligned data.
EBI and NCBI have already been working on file formats and standards, such as a short read format, SRF, which is nearly completed (see In Sequence 3/27/2007); FASTQ files, which are 10 times smaller than trace files and can be used with many alignment programs; and a standard for the virtual genome sequence, “the hardest” to establish, according to Flicek.
One concern his group has “is whether the data flow can keep up with the data coming off the sequencers,” he said, although he has “confidence that we can keep up with this.”
In terms of visualizing the data, “no display really scales well to 1,000 individuals,” Flicek acknowledged, but it will be relatively easy to include in a genome browser a catalog of variations with estimated allele frequencies.
“Although many people will use the individual genome sequences, I think it’s the deep catalog that becomes the most valuable,” he said.
— Bernadette Toner contributed reporting for this article.