This story has been updated from a previous version published on Oct. 27 to include additional information.
By Monica Heger
The 1000 Genomes Project, an international project to characterize human variation, last week published the initial results of its three pilot projects in Science and Nature.
The project began in 2008 and is scheduled to run for three years, with the goal of producing a detailed catalog of genetic variants (IS 1/22/2008). The consortium released the initial results of the three pilot projects — comprising approximately 7.3 terabytes of data — in June (IS 6/22/2010).
The researchers now plan to sequence a total of 2,500 individuals at low coverage using the Illumina Genome Analyzer and HiSeq 2000 as well as Life Technologies' SOLiD system. In addition, they will do deep whole-exome sequencing for those 2,500 individuals.
The data published in the Nature article outlines the results of the three pilot projects: whole-genome sequencing of 179 individuals to between 2-fold and 6-fold coverage; whole-genome sequencing to an average of 42-fold coverage of six individuals in two trios; and exon sequencing of 8,140 exons in 697 individuals.
The sequencing was done at nine different centers: the Wellcome Trust Sanger Institute, BGI-Shenzhen, the Broad Institute, Washington University School of Medicine's Genome Center, Baylor College of Medicine's Human Genome Sequencing Center, the Max Planck Institute for Molecular Genetics in Berlin, Illumina, Life Technologies, and Roche's 454 Life Sciences.
Researchers generated 4.9 terabases of sequence on three platforms: the Illumina GA, SOLiD, and the 454 GS FLX.
The three platforms were used in different combinations for each pilot. For instance, in the trio pilot, all six individuals were sequenced on the Illumina platform with both paired-end and single-end sequencing, while one sample from each trio was sequenced on the SOLiD using paired-end sequencing, and one sample from each trio was sequenced twice on the 454 — once with paired-end sequencing and once with single-end sequencing. Each individual was sequenced to an average 42-fold coverage.
In the exon sequencing study, only the Illumina and 454 platforms were used, including both paired- and single-end sequencing on the Illumina and only single-end sequencing on 454. Finally, in the low-coverage sequencing pilot, both paired- and single-end sequencing was used on all three platforms. In that pilot, 38 samples were sequenced on the SOLiD, 185 on Illumina, and 29 on 454.
The researchers called around 15 million SNPs, of which 55 percent were novel, over 1 million indels, and more than 20,000 structural variants.
While most of the high-frequency SNPs called were already found in dbSNP, lower-frequency SNPs and the vast majority of structural variants were not found in any public database.
Additionally, the researchers found that populations with African ancestry contributed the highest fraction of novel variants. In the low-coverage project, for instance, 63 percent of the novel SNPs came from African populations, compared to 33 percent from European populations and about 22 percent from Asian populations. Some of the novel SNPs were shared among populations.
The sequencing of many individuals to low coverage enabled the researchers to detect low-frequency variants, defined as having a minor allele frequency of less than 5 percent.
In a press briefing, Richard Durbin, a senior investigator at the Wellcome Trust Sanger Institute and co-chair of the project, said that the results have "produced a more complete catalog" of human variation than available previously. For example, he said, among the 3 million variants in any individual's genome, more than 95 percent would be found in the catalog. "This has been a real shift in how we can approach human genetics," he said.
In addition, the researchers discovered that each individual carries on average between 250 and 300 loss-of-function variants in annotated genes and between 50 to 100 variants previously implicated in disease.
"Some of the genes are commonly inactivated, so [they] may be genes that are not strongly required," said Durbin. But most of the loss-of-function variants — which include premature stop codons, frameshift mutations, and changes in splicing — are "substantially enriched for rarer variants."
That finding indicates two things: first, "many of those variants are functional, and the reason why they're not more common is that they've been selected away." It also suggests that "we are all probably carrying more or less private, certainly rare, defective copies of genes," Durbin said. While those defective copies don't necessarily lead to disease because there are two copies of each gene, he said there could be a phenotypic effect, but determining the consequences would require further research.
Another paper, published in Science, focused on analyzing copy number variation. Senior author Evan Eichler of the University of Washington said in the briefing that the focus of this analysis was to examine copy number variation at the individual level, rather than the population level.
Eichler said that his team was able to measure absolute copy number for duplications, finding a range of zero to 48 for specific regions in each genome. This has been difficult to date because most copy number variants lie in repetitive regions, which are difficult to characterize.
Eichler said that his team was able to measure absolute copy number for duplications as ranging from zero to 48 copies. This has been difficult to date because most copy number variants lie in repetitive regions, which are difficult to characterize.
Using computational and statistical methods, Eichler's team created heat maps that showed copy number for 159 genomes at 3-kilobase-pair resolution. They validated the results using fluorescence in situ hybridization, array-CGH, and qPCR. They identified 952 large CNVs greater than 50 kilobase pairs across all individuals, the majority of which overlapped with duplications. Additionally, they found 22 regions longer than 100 kilobase pairs that showed evidence of differences between Asians, Europeans, and Africans. They also identified regions that had been incorrectly labeled as diploid in the reference genome, but were in fact duplicated.
The team found that "most copy number variable genes map to historically duplicated regions of the genome," said Eichler. Also, he added, when comparing the four populations, "there is more genetic differences in those regions in terms of copy number, when compared to unique regions of the genome."
The analysis also allowed the researchers to study human evolution. In particular, they observed that since humans diverged from the great apes, a family of genes related to neural development has expanded significantly.
The next step, said Eichler, is to do functional studies of the duplicated regions. Approximately 1,000 genes lie in these regions, which have previously been difficult to study due to their repetitive elements. But now, those genes can be studied for their association to disease, he said.
Now, he said, "we can explore the functional properties of these untouched genes," including expression differences, methylation changes, and associations with phenotype and disease.