By Monica Heger
This story was originally published on May 16.
The 1000 Genomes Project is well into its production phase, having completed low-coverage whole-genome sequencing of 1,094 samples, whole-exome sequencing for 997 of those, and SNP array genotyping for 1,542 individuals.
The project's organizers have also expanded the study to include additional populations from Africa, Asia, and Europe, as well as admixed populations in the Americas.
Gabor Marth, an associate professor of biology at Boston College whose lab is performing data analysis for the project, reported initial results from the first production phase last week at the Biology of Genomes meeting in Cold Spring Harbor, NY.
Last fall, the team published results of the project's pilot phase, and said it planned to expand the project to include 2,500 individuals (IS 11/2/2010).
The first production phase, which is sequencing 1,167 samples that have already been collected from 13 different populations, is expected to be completed this year. The second phase includes 633 samples from seven populations and the team has already begun sequencing of those samples. The final phase, will consist of 700 samples, and will not begin until late 2011.
To date, they've analyzed the four-fold coverage whole-genome sequencing data and detected around 39 million SNPs, more than 20 million of which were novel, as well as 4.7 million indels.
The whole-exome capture and sequencing enabled the team to analyze 50 base pairs beyond the exons. For the 458 individuals they have analyzed, they identified 400,000 SNPs, or about 1 SNP per exon.
Marth said the exome data will be particularly relevant because of the host of medical sequencing projects that are underway, such as the National Heart Lung and Blood Institute and the National Human Genome Research Institute's jointly funded Exome Sequencing Project; NHLBI's Large Scale DNA Sequencing Project; and NHGRI's Medical Sequencing Discovery Project.
For example, while the IK Genomes Project itself is studying human variation in healthy individuals, "the allele frequency estimates in normal genomes can be used in interpreting rare and common variants in medical sequencing projects," Marth added.
Because the samples whose exomes were sequenced also had low-coverage whole-genome sequencing done, the researchers were able to compare the two methods.
"At low allele frequency, the low-coverage sequencing misses a lot," said Marth, but at about 0.5 percent allele frequency, "your sensitivity is almost perfect," even with low-coverage whole-genome sequencing.
Additionally, a number of SNPs in coding regions of the genome were found only in the low-coverage whole-genome sequencing data, he said. These SNPs could be false positives because coverage is not high enough, or they could indicate "failures or biases" in the exome capture, he said.
Most of the novel variants that the team identified were rare — below 1 percent allele frequency — and had functional effects. Around 84,000 SNPs were non-synonymous, with 621 predicted to be splice disrupting, and 1,654 predicted to cause a stop codon. The researchers have only begun the analysis of indels, and do not have data on those variants yet.
Detecting structural variation was difficult, said Marth, and the team's most accurate calls came by doing a read-depth analysis. They were unable to accurately call deletions and tandem duplications, indicating that new informatics methods are needed to detect these types of structural variations with currently available sequencing platforms.
The team has also developed methods to detect mobile element insertion events. Marth did not present data on this work, but said that being able to detect these events "greatly expands our resources for population genetics."
Finally, he said that the researchers are currently integrating datasets and variant types to reconstruct haplotypes. Expanding the project to include admixed populations in the Americas will be particularly useful for haplotype reconstruction, he said.
Results from the project, as well as other large-scale sequencing efforts, have already contributed vast amounts of knowledge to researchers' understanding of human variation, Marth said.
For instance, in 2000, 98 percent of an individual's variation would not be found in dbSNP. Currently, though, only 1 percent of an individual's variation is not in dbSNP. Identifying rare variants will be much more difficult, he said, but will also contribute to the understanding of disease.
Have topics you'd like to see covered by In Sequence? Contact the editor at mheger [at] genomeweb [.] com.