COLD SPRING HARBOR, NY (GenomeWeb News) – Members of the 1000 Genomes Project plan to sequence as many as 1,100 samples by the end of this summer as part of the project's main sequencing effort, attendees at the Biology of Genomes meeting heard here today.
Speaking on behalf of the project, Co-chair Richard Durbin, a bioinformatics researcher at the Wellcome Trust Sanger Institute, said that nearly 500 samples have been sequenced as part of the main 1000 Genomes project as of this month.
The team expects to sequence up to 1,900 samples by the end of this year and has committed to sequencing about 2,500 samples by the end of 2011, including samples from India and Southern Asia.
At the moment, data simulations suggest European populations harbor an estimated 19 to 20 million SNPs, including eight million SNPs with an allele frequency greater than one percent, six million SNPs at between 0.1 and one percent frequency, and 2.5 million singletons. Roughly half of the alleles appear to be private for each population, pointing to an estimated 60 million SNPs overall, Durbin explained.
Over the past few years, the 1000 Genomes Project has undertaken a set of pilot projects including high-depth sequencing of European and African trios, low-coverage sequencing of 60 individuals from each of three populations, and exon capture and sequencing of 700 samples.
Through these pilots, Durbin said, the researchers identified about four million and five million SNPs in the European and African trios, respectively, 14.5 million SNPs in the low-coverage samples, and about 12,700 SNPs in the exon sequencing pilot, as well as thousands more deletions, structural variants, and mobile element insertions.
As such, data from these pilot projects are already yielding information about common variants in the populations tested — and providing insights into strategies for the main stage of the project, Durbin explained.
Of the 14.5 million SNPs identified in the low-coverage data set, for instance, about eight million were novel, with more than one million novel SNPs turning up in the Yoruban population alone.
The high-coverage trio data, meanwhile, is offering insights into substitution rates, suggesting, for instance, that somatic substitutions in cell lines are about seven to 12 times as common as germline de novo mutations.
Data so far suggests the low coverage strategy is sufficient for finding variants when used to assess hundreds of samples, Durbin said. He also noted that the researchers are doing some local realignment and assembly when necessary to overcome limitations associated with mapping sequences to the reference genome.