The National Center for Biotechnology Information and the European Bioinformatics Institute are currently building the data-management framework for an international collaboration to sequence the genomes of at least a thousand individuals with the goal of building the most detailed map of human genetic variation to date.
The three-year initiative, called the 1,000 Genomes Project, is expected to cost between $30 million and $50 million and is led by the National Human Genome Research Institute, the Wellcome Trust Sanger Institute, and the Beijing Genomics Institute in Shenzen. All data from the project — from raw sequencing reads to SNP calls to the final catalog of variants — will be made publicly available through the NCBI and EBI with a mirror site at BGI Shenzen.
The project, which kicked off this week, will rely on next-generation sequencing technology to generate an unprecedented amount of genomic information — approximately 60 times more data than has been placed in public repositories in the last 25 years, according to consortium estimates.
“When up and running at full speed, this project will generate more sequence in two days than was added to public databases for all of the past year,” Gil McVean of the University of Oxford, a co-chair of the consortium’s analysis group, said in a statement.
In an interview this week with BioInform’s sister publication In Sequence, McVean described the informatics challenges of the project as “pretty horrific.”
In addition to the analysis group, which is responsible for mapping reads to the reference genome, SNP calling, haplotype estimation, and other analytical tasks, the consortium has created a data coordination, or “data flow,” group, co-chaired by EBI’s Paul Flicek and NCBI’s Stephen Sherry. This group is charged with collecting the raw data form the sequencing centers, archiving it, and making it available to the research community in a usable manner.
The first step in the project will be to create short-read archive sites at NCBI and EBI that will be similar to the current trace archives for Sanger sequencing reads. The five sequencing centers in the project — Sanger, BGI, the Broad Institute, Washington University’s Genome Sequencing Center, and Baylor College of Medicine’s Human Genome Sequencing Center — will deposit all their raw data into these archives, and the information will be synchronized between the two groups.
Flicek said that he and his colleagues will then run that data through an analysis “pipeline” that will first map the reads to the reference genome, which will initially be the NCBI 36 assembly.
There are a number of available algorithms for this task, including Illumina’s Eland program, Mosaik from Gabor Marth’s group at Boston College, and MAQ from Richard Durbin’s group at the Sanger Institute. Flicek said that the consortium plans to first “determine the ones that perform best” in order to narrow it down to a few that will “produce the type of data that people want to work with.”
McVean noted that it will be important for the consortium to pare down these mapping algorithms to avoid confusion. “We are not going to be providing five different mappings of each read using five different algorithms,” he said. “Making use of that information is a nightmare.”
Flicek said that the remainder of the analysis pipeline will likely be determined during the first year of the project, which will be a three-part pilot phase. In the first stage, the scientists will sequence the genomes of six adults — two sets of parents with their adult children — at 20-fold coverage. In the second stage, the consortium will sequence the genomes of 180 additional individuals at two-fold coverage, and in the third part, the researchers will sequence the exons of around 1,000 genes in 1,000 individuals.
“The pilot project is largely going to be about figuring out what will work,” Flicek said. “I feel like we’re in a very good position because the technologies are so new that we … have a chance to define the input and output formats for a lot of the different alignment programs or other analysis programs.”
“I think the biggest concern … is whether the data flow can keep up with the data coming off the sequencers, and we’re still testing that.”
In addition to providing the raw reads and the aligned data, Flicek said that the consortium plans to produce a “virtual genome sequence” for each individual in the project, which will include “the information about whether a base has been observed, whether it’s the same as the reference or a variant base, [and] what sort of information we know to make basically a confidence score on that — be that the number of reads or some function of the number of reads plus the uniqueness of the mapping.”
Beyond this information, “[I]f it turns out that there are other things that are useful and the analysis group wants them and builds things that do that analysis, we plan … to plug those into the pipeline,” Flicek said.
“It’s going to be interesting to see what sort of analysis methods are produced and how they scale with the amount of data and the amount of memory that’s available,” he added. “Something that takes 256 gigabytes of RAM [to] analyze 3 gigabases of sequence is not something that we can scale across the project.”
McVean said that one of the more “exciting” possibilities for the analysis group is developing methods for calling more challenging variants, such as rare SNPs, and structural variations like insertions and deletions, microsatellites, and minisatellites.
However, he added, “the project isn’t just about detecting variants, it’s all about estimating the haplotypes, getting the individual genomes out of this.” McVean said that the analysis group plans to take advantage of the high degree of similarity across genomes to estimate common haplotypes by “borrowing” information from other genomes.
“Although in any one of these individuals sequenced at 2x you will have plenty of gaps, we think that we can do an extremely good job of filling in those gaps by comparing them to the others,” McVean said. “And I see that as the primary academic novelty that will come out of this — how well can we do that.”
Flicek said that NCBI is in the “final development phase” of the short-read archive and that the two bioinformatics centers are currently using test data sets to “see how they load into the archive and how data can be pulled out of the archive and how we can exchange with them.”
One of the biggest challenges for the project, according to Flicek, will be transferring the raw data between NCBI and EBI. During its production phase, the project will be producing around 8.2 gigabases — more than two human genomes worth of data — per day.
“I think the biggest concern … is whether the data flow can keep up with the data coming off the sequencers, and we’re still testing that,” he said. “That said, based on the amount of data that the Sanger, for example, is processing, I do have confidence that we can keep up with this.”
A key step in this process will be establishing standard file formats for exchanging data. Flicek said it’s likely that EBI will use the SRF standard under development by a group of next-generation sequencing vendors, genome centers, and other organizations for the short-read archive [BioInform 03-27-07].
“The hardest thing to come up with a standard for is the virtual genome sequence,” he said, noting that the EBI’s Ensembl group has an internal standard for representing sequencing data that should be a “good start.”
Another expected challenge will be the large number of reads that don’t map to the reference genome, either because they contain sequencing errors or because they include insertions or other variations.
“One of the things that will come out of our mapping pipeline right away for basically every individual is a big pile of reads that doesn’t get mapped,” Flicek said. He added that the data-analysis group is expected to develop methods for analyzing copy number variation and structural variation to determine whether those orphan reads are sequencing errors or variations.
As for the final presentation of the data in Ensembl and other genome browsers, Flicek said that “a lot of this is going to be driven by the types of analysis that people want to do.” He noted that while some users may want to see every individual read behind each SNP call, for most researchers “it’s the deep catalog of variation that becomes the most valuable.”
The consortium expects the data to be of particular use to researchers conducting genome-wide association research, because it will enable them to “fill in the gaps” in these studies, McVean said. “Basically, if you typed 500,000 or a million SNPs in your genome-wide association study, you can take those individuals and then fill in the variants that you didn’t look at in this study but which are in the 1,000 Genomes Project.”
While acknowledging that “no display really scales well to 1,000 individuals,” Flicek said that it should be relatively “easy” to include the catalog of variations and the estimated allele frequencies on those variations within the current browser framework.
“Displaying the deep catalog is conceptually no more difficult than displaying the contents of dbSNP on the browser. It’s a variant position with allele frequencies,” he said.
Another “huge advantage” for the project is the fact that the genomes between the 1,000 people in the study are highly redundant. “As soon as we can get to just a description of the differences between the people, where we’re using a common reference genome as a backdrop, then it’s easy to pass around just the differences,” he said. “I think that’s the key.”
— Julia Karow, editor of In Sequence, contributed to this article.