An international team of researchers this week announced plans to use next-generation platforms to sequence at least 1,000 and up to 2,000 human genomes with the goal of producing a detailed catalog of genetic variants.
The three-year “1,000 Genomes Project,” led by the US National Institutes of Health’s National Human Genome Research Institute, the Wellcome Trust Sanger Institute in England, and the Shenzhen branch of the Beijing Genomics Institute in China, will not only map SNPs but also produce a high-resolution map of structural variants, such as insertions, deletions, and rearrangements.
The study is expected to add at least 6 terabases of new data to the public databases — 60 times more than all sequence data deposited in these databases over the last 25 years — and will run between $30 million and $50 million in costs, according to a consortium estimate.
While existing databases list genetic variations found in at least 10 percent of a population, the new map will include variants across the genome that are present in as few as 1 percent of humans, and variants within genes that occur in 0.5 percent or fewer people.
So far, researchers studying genetic causes of diseases have focused on either very rare variants that cause disease such as cystic fibrosis or Huntington’s disease, or more common polymorphisms that increase the risk for common diseases such as diabetes or heart disease.
“Between these two types of genetic variants — very rare and fairly common — we have a significant gap in our knowledge,” David Altshuler, an associate professor of genetics and medicine at Harvard Medical School and a researcher at the Broad Institute, and co-chair of the consortium’s steering committee, said in a statement. “The 1,000 Genomes Project is designed to fill that gap, which we anticipate will contain many important variants that are relevant to human health and disease.”
The results of the project could also help researchers better interpret data from existing and future genome-wide association studies. In these studies, researchers often find genomic regions that correlate with disease but they do not know the causal variants.
The 1,000 Genomes Project will help these researchers by producing “a ready-made catalog that you can [use to] follow up” on genome-wide association studies, Adam Felsenfeld, program director of the large-scale sequencing program at NHGRI, told In Sequence last week. “Instead of resequencing [the region], you just take [possible variants] out of the database and you start testing them.”
The steering committee, co-chaired by Altshuler and Richard Durbin, a principal investigator at the Sanger Institute, will manage the project. The consortium further consists of several groups, responsible for production; analysis; data flow; and samples and ethical, legal, and social issues.
The project will kick off with three pilot studies that are expected to last about a year, followed by a two-year production phase. The anticipated cost of $30 to $50 million, which Felsenfeld called a “ballpark estimate,” is based on “our best information of what we think the new technology platforms can do.” Using Sanger-based sequencing, the project would likely cost more than $500 million, which would be “prohibitive,” he said.
The NHGRI anticipates to fund up to about half of the project, he said, using existing funds assigned to the large-scale sequencing centers. Representatives of the Sanger Institute and BGI did not say how much of the total cost their institutions expect to take on.
Five centers will generate the sequence data: the Sanger Institute, BGI Shenzhen, and the NHGRI’s three large-scale sequencing centers, namely the Broad Institute of MIT and Harvard, Washington University School of Medicine’s Genome Sequencing Center, and Baylor College of Medicine’s Human Genome Sequencing Center.
Additional participants may join the consortium later, according to Felsenfeld, who added that the project has already had inquiries from interested parties. These potential partners will need to contribute “towards the project goal” and, if they want to produce sequence data, “bring a significant amount of sequencing capacity to the table,” he said.
Felsenfeld said the five participating centers will use all three commercially available next-generation sequencing platforms: Roche/454’s Genome Sequencer FLX, Illumina’s Genome Analyzer, and Applied Biosystems’ SOLiD sequencer.
New technologies could be added later, Felsenfeld said. “We don’t want to peg this to any specific technology if we know that there are other things bubbling under,” he said. “If [other technologies] look good, and they can deliver the [same] quality and better cost, I’m sure people will want to use them.”
The first 1,000 samples will come from the International HapMap Project and from the extended HapMap set. These anonymous samples – a total of 1,085 – were collected from several populations originating in Africa, Japan, China, Europe, India, and Mexico.
According to Felsenfeld, they are “unbiased with respect to any disease” and were collected with the “appropriate consents” to be used for a whole-genome sequencing study.
The consortium may also collect and sequence up to 900 additional samples in order to better represent certain populations, according to Lisa Brooks, program director of the genetic variation program at the NHGRI.
“At the end of the day, you want the variation represented in the catalog to be, in many ways, representative of the populations that you are interested in for future disease studies,” Felsenfeld explained.
Each of the sequencing groups will participate in the pilot projects; they expect to receive samples within the next couple of months.
In the first pilot, the scientists will sequence the genomes of six adults — two sets of parents with their adult children — at 20-fold coverage. This pilot will “help the project figure out how to identify variants using the new sequencing platforms, and serve as a basis for comparison for other parts of the effort,” according to the consortium’s statement.
In the second pilot project, the scientists will sequence the genomes of 180 additional individuals at two-fold coverage, which will allow them to learn whether this kind of data can identify sequence variants.
“If [other technologies] look good, and they can deliver the [same] quality and better cost, I’m sure people will want to use them.”
In the third pilot project, the researchers will sequence the exons of about 1,000 genes in some 1,000 individuals in an effort to determine the best way to sequence all exons in the human genome.
The centers have not yet determined how they will split up the samples during the production phase, according to Felsenfeld.
Wash U’s Genome Sequencing Center will use its 454, Illumina, and ABI next-gen sequencing platforms in the pilot studies. “We felt that since there was broad applicability for different platforms in different pilot projects that we would devote the platform with the best ‘fit’ to each project,” Elaine Mardis, co-director of the center, told In Sequence by e-mail.
She did not mention which specific platform her center has chosen for each pilot project but said that “for example, paired ends are desirable for the whole-genome projects, ... longer read length for the exon coverage, etc.”
The GSC has not yet decided which capture method it will use in the exon-sequencing pilot study, Mardis said, but it will consider both an in-house method as well as approaches based on Agilent and NimbleGen technology.
While Wash U’s sequencing capacity will be sufficient initially, the center is “planning to increase both the sequencing and data storage [and data] analysis capacity during this spring,” Mardis said.
The Sanger Institute also plans to use all three commercially available next-gen sequencing platforms in the pilots, Durbin told In Sequence. He noted that the institute has made a “substantial investment” in Illumina’s sequencing technology but said that he and his colleagues are also “potentially interested in new technologies” for the pilot projects.
Zhuo Li, who is in charge of BGI’s worldwide collaborations, told In Sequence that the institute plans to use next-generation sequencing technologies in the pilot studies, but did not specify which ones. As of this month, BGI had seven Illumina Genome Analyzers and two ABI SOLiD systems in house (see In Sequence 1/15/2007). He also said BGI plans to acquire additional sequencers and computing equipment for the project.
Neither Baylor’s HGSC nor the Broad Institute responded to requests for comment before deadline.
All participating researchers will place raw pilot data into a new short-read repository, maintained by the NIH’s National Center for Biotechnology Information, according to Felsenfeld. In addition, project data will be held and distributed by the European Bioinformatics Institute and NCBI, and will be available from a mirror site at BGI Shenzhen.
The analysis will likely pose a series of challenges. “Assembling each individual genome to the extent possible, figuring out which are SNP variants, which are structural variants — just getting those basic data is going to be a challenge with the new technologies,” NHGRI’s Brooks said.
Also, developing so-called imputation methods, which allow researchers to use sequence data from multiple individuals at low coverage to make inferences about genotypes, “is a major challenge, but it’s a real basis for the project,” she said.
Finally, she added, assembling the sequence data into haplotypes is going to be a difficult task.
More information about the 1,000 Genomes Project and about a workshop last year that laid the foundation for it can be found here.
Researchers started thinking about a project like this five years ago when the first human genome was sequenced, according to Durbin. “It’s clearly in the historical tradition of genome projects,” he said.
But the study is not the only large-scale human genome sequencing project. Earlier this month, BGI announced the Yanhuang project, which plans to sequence at least 100 genomes from Chinese individuals in order to study polymorphisms in the Chinese population (see In Sequence 1/8/2008). According to BGI’s Zhuo, that project was “initiated solely by us,” but part of it will contribute to the 1,000 Genomes Project.
Another study is the Personal Genome Project led by George Church at Harvard Medical School, which kicked off last summer with its first 10 volunteers (see In Sequence 7/31/2007). The project, which plans to use technology developed by Church’s lab and commercialized by Danaher Motion (see In Sequence 12/4/2007) to sequence the exome of participants, aims eventually to recruit 100,000 volunteers. Unlike the 1,000 Genomes Project, the PGP will correlate genotypes with medical and other phenotypic information from the participants.
Last fall, the J. Craig Venter Institute said it plans to sequence the genomes of 10 to 50 individuals this year as a pilot study for a large-scale project that aims to sequence approximately 10,000 people over 10 years. JCVI said at that time that it wanted to test a variety of new sequencing technologies for the project, including platforms from 454, Illumina, ABI, and the Church lab.