COLD SPRING HARBOR, May 8 – The 1000 Genomes Project is continuing to analyze data from its three pilot projects, generated using a mix of second-generation sequencing technologies from Illumina, Applied Biosystems, and 454 Life Sciences, and plans to sequence at least 1,200 individuals – both HapMap and newly collected samples – at low coverage by the end of this year.
The aim of the project, announced a little over a year ago, is to catalog genetic variants in the human genome — including single-nucleotide polymorphisms, insertions and deletions, copy number variations, and structural variants — that occur with a frequency of at least one percent by sequencing more than 1,000 HapMap samples and maybe additional populations. The hope is that these data will provide additional markers for genetic association studies.
The 1000 Genomes Consortium, which includes a variety of academic groups and genome centers as well as the three manufacturers of the new sequencing platforms, started out with three pilot projects: sequencing 180 HapMap individuals from four populations at low coverage; sequencing two HapMap “trios” of parents and child; and sequencing 1,000 selected genes in several hundred HapMap individuals. In December of last year and in January, the consortium posted data and initial analyses of the first two pilots on its website.
At the Biology of Genomes meeting here at Cold Spring Harbor today, Goncalo Abecasis of the University of Michigan presented an update of the analysis of the first two pilot projects, which the consortium generated earlier this month and plans to make available on its website shortly. The analysis of the third project, he said, is not fully complete yet.
When asked by an audience member after his presentation which of the three sequencing platforms is best suited for different aspects of the project, he said that integrating data from several platforms has yielded better SNP call sets than using data from any individual platform. As an example, he cited one of the trio samples, which was sequenced independently at 30-fold depth coverage on the Illumina and the SOLiD platforms and yielded the best results when both datasets were combined. “Each platform has different characteristics; none of them is uniformly better than the other,” he said.
For each of the two HapMap trios – one from the CEU sample set of European origin, the other from the YRI set of African ancestry – the consortium has generated more than 100-fold sequence coverage in total, using all three sequencing platforms. The parents of the CEU trio were mostly sequenced on the Illumina Genome Analyzer, Abecasis noted, while the child was analyzed with a mix of the three technologies.
The researchers found approximately 4 million SNPs in that trio, of which 85 percent were present in the dbSNP database. By comparison, the YRI trio yielded 5 million SNPs, of which only 71 percent were contained in dbSNP. Based on these results, the scientists can calculate how many SNPs they expect in the second phase of the project – for example, up to 13 million SNPs in total for the several hundred CEU samples they plan to sequence during that phase.
For the low-coverage whole-genome pilot project, the consortium has collected sequence data on 178 individuals from four HapMap populations, and has compared these data across populations and haplotypes. In total, the scientists found more than 21 million SNPs, of which about 11 million are novel. Five million of these SNPs are shared between the four populations, and a million of these are novel.
These results, although not without errors because of the low coverage, can be used to re-analyze existing data from genome-wide association studies by imputing missing genotypes, Abecasis said. As an example, he cited two diabetes studies by the Wellcome Trust Case Consortium where the data could be used to add new potential disease loci.
Besides SNPs, the consortium has also started analyzing short insertions and deletions, he said. So far, it has called more than 400,000 indels up to 10 base pairs in length, a call set that will soon be released. In addition, the scientists have begun to analyze structural variations using several approaches, such as recording abnormal read pairs and varying read depths, and have called more than 4,000 validated copy-number variants so far.
Interesting results have also come from a comparison of the 1000 Genomes data with two unrelated new de novo assemblies of human genomes by the Beijing Genome Institute, Abecasis said. This analysis has allowed them to define several genome regions that are not included in the current human reference genome.
Processing large amounts of data, derived from different sequencing platforms, has been a challenge, Abecasis noted, and the consortium has developed common formats to analyze data from all three platforms.
By the end of this year, the consortium plans to sequence at least 1,200 individuals – both HapMap and newly collected samples – of European, East Asian, and African ancestry at 4-fold coverage. It also plans to expand the sample sets to additional populations in the future.