By Julia Karow
The 1000 Genomes Project, an international effort to increase the catalog of human genetic variation, is on track to sequence almost 1,000 additional samples at low coverage by early next year and is in the process of collecting new samples for the study, In Sequence has learned.
Following three pilot projects, results of which are currently being analyzed, the project now expects to sequence an additional 985 samples at 4x coverage by early 2010, and 2,000 in total. One of the pilots, designed to test the three second-generation sequencing technologies employed, showed that all three produce data that is suitable for the project.
According to Richard Durbin, a principal investigator at the Wellcome Trust Sanger Institute and co-chair of the project's steering committee, participants originally agreed last year to analyze 1,500 samples, 500 each of European, African, and East Asian origin. Each batch of 500 is made up of five population samples of 100 individuals each.
However, following discussions at the American Society of Human Genetics annual meeting last year, the project decided this year to add 500 samples from North, Central, and South America, specifically from populations of mixed ancestry "that reflect at least part of the genetic history of the people now living in the Americas," Durbin told In Sequence by e-mail.
Lisa Brooks, who manages the 1000 Genomes Project for the National Human Genome Research Institute, which provides funding for the three US large-scale sequencing centers participating in the project, explained that the project decided the data "would be more useful if samples from additional populations were included, especially Hispanic and African-American populations."
To collect new samples required for the project, Durbin said, the project earlier this year established sampling and consent structures, with oversight from its samples and ELSI subgroup. "Samples are now at various stages of the collection, cell-line establishment, and growth process, with some new DNAs now becoming available for sequencing," he said.
According to Brooks, samples from Han Chinese in Beijing (CHB); Japanese in Tokyo (JPT); CEPH samples with ancestry from Northern and Western Europe in Utah (CEU); Toscani in Italia (TSI); Yoruba in Ibadan, Nigeria (YRI); Luhya in Webuye, Kenya (LWK); African-Americans in the Southwest of the US (ASW); and samples with Mexican ancestry from Los Angeles (MXL) are currently available for sequencing.
Samples from the Chinese Dai Xishuangbanna (CDX); Chinese Han South(CHS); Kinh in Ho Chi Minh City, Vietnam (KHV); British from England and Scotland (GBR); Finnish in Finland (FIN); Iberian populations in Spain (IBS); Gambian in the Western Division of Gambia (GWD), Ghanaian in Navrongo, Ghana (GHN); Malawian in Blantyre, Malawi (MAB); African-Americans in Jackson, Miss. (AJM); African Caribbeans in Barbados (ACB); Puerto Ricans in Puerto Rico (PUR); Colombians in Medellin, Colombia (CLM); and Peruvians in Lima (PEL) will be or have already been collected and processed and will be added to the project as they become available.
The 1000 Genomes Project, which kicked off in early 2008, aims to catalog human genetic variation — including SNPs and structural variants — that are present in the genomes of as few as 1 percent of individuals, and in the genes of as few as 0.5 percent of people (see In Sequence 1/22/2008).
The study began with three pilot projects: one to sequence HapMap samples of two parent-child trios of European (CEU) and African (YRI) origin at high coverage, another one to sequence 180 unrelated HapMap samples from European ancestry (CEU), African (YRI), Han Chinese (CHB) and Japanese (JPT) populations, and a third to sequence approximately 1,000 genes and conserved elements in about 1,000 individuals at high coverage.
[ pagebreak ]
Data for the pilots was produced by the Wellcome Trust Sanger Institute, the Beijing Genomics Institute in Shenzhen, the Broad Institute, Washington University School of Medicine's Genome Center, Baylor College of Medicine's Human Genome Sequencing Center, and the Max Planck Institute for Molecular Genetics in Berlin, as well as by Illumina, Applied Biosystems, and Roche's 454 Life Sciences (see In Sequence 6/17/2009).
Brooks told In Sequence that the data production for all three pilots was completed earlier this year and that the data are currently being "cleaned carefully" prior to their final release. The analysis of the data is ongoing, she said, and a publication on the results is expected for early 2010.
During the pilot phase, the project implemented data quality control and tracking in parallel with the sequencing and analysis, "which has at times been painful," Durbin said, adding that "we believe this is getting progressively smoother going forward."
Regarding the analysis, he said, "it is looking like it helps to realign or reassemble around insertions and deletions using all the available reads, after initial mapping of reads to the reference."
The trio pilot in particular was designed to test the suitability of the three platforms — Illumina's Genome Analyzer, Applied Biosystems' SOLiD, and Roche/454's Genome Sequencer FLX — for the project, and Brooks said that "all three platforms work — they were all good."
At the Biology of Genomes meeting earlier this year, Goncalo Abecasis, a researcher at the University of Michigan who gave an update on the project, said that integrating data from several platforms has yielded better SNP call sets than data from one platform alone (see In Sequence 5/12/2009).
The low-coverage project was intended to determine whether 2x or 4x sequence coverage would be enough to identify sequence variants using imputation methods. "We don't have the final data analyzed but the preliminary analyses show that 4x should be sufficient," Brooks said.
As a result, the project is now sequencing additional samples at 4x coverage, using both Illumina GA and SOLiD technology for now, according to Brooks.
The project is also open to exploring new sequencing platforms as they become available but has no concrete plans to do so at the moment, she said.
Durbin said that the goal for the scaled-up low-coverage sequence is to complete 985 individuals — in addition to the 180 already sequenced during the pilot phase — at 4x depth through early 2010. Data for these samples is already accumulating, and "the sequence from these individuals should enable us to meet our original goals for at least some populations," he said.
Additional samples are being added "as soon as we have them available," according to Brooks.
She said that it is not clear yet whether the third pilot project — which selectively sequenced genes — will be expanded into a larger project.