NEW YORK (GenomeWeb) — An international team of scientists from the 1000 Genomes Project Consortium has created the world's largest catalog of genomic differences among humans.
This achievement, documented in a pair of papers published yesterday in Nature, marks the completion the 1000 Genomes Project, which found more than 99 percent of variants in the human genome that occur at a frequency of at least 1 percent in the populations studied.
"The 1000 Genomes Project was an ambitious, historically significant effort that has produced a valuable resource about human genomic variation," Eric Green, director of the National Human Genome Research Institute, said in a statement. "The latest data and insights add to a growing understanding of the patterns of variation in individuals' genomes, and provide a foundation for gaining greater insights into the genomics of human disease."
Advances in DNA sequencing and bioinformatics were vital to completing the project. Over the course of the 1000 Genomes Project, the scientists developed improved methods for large-scale DNA sequencing and analysis and interpretation of genomic information, in addition to finding better ways to store such a large amount of this data.
In one of the Nature studies, the consortium's researchers expanded analysis beyond bi-allelic events, as they has done in previous studies, to include multi-allelic SNPs, indels, and a diverse set of structural variants. Their analysis included samples from the 26 populations focused on throughout the 1000 Genome Project, which included groups from Africa, the Americas, Europe, South Asia, and East Asia.
Researchers used 24 sequence analysis tools and machine-learning classifiers to separate high-quality variants from potential false positives while balancing sensitivity and specificity. Then they constructed haplotypes for the samples. To control the false discovery rate of SNPs and indels a variant quality score threshold was defined using high-depth PCR-free sequence data generated for one individual per population. For structural variants, additional orthogonal methods were used for confirmation, including microarrays and long-read sequencing, resulting in false discovery rates of less than 5 percent for deletions, duplications, multi-allelic copy-number variants, Alu and L1 insertions; and less than 20 percent for inversions, SVA composite retrotransposon insertions, and nuclear mitochondrial DNA variants.
The study discovered, genotyped, and phased more than 88 million variable sites, about 12 million of which had common variants that the scientists believe are likely shared by many of the populations. The project has now contributed or validated 80 million of the 100 million variants currently in the public dbSNP catalogue. It has also enhanced scientists' knowledge of genetic variation within South Asian (which account for 24 percent of novel variants) and African populations (28 percent of novel variants).
In the second Nature study, scientists examined differences in the structure of the genome in the 2,504 samples. They found nearly 69,000 differences, known as structural variants, including deletions, insertions, and duplications. The researchers then created a map of eight classes of structural variants that potentially contribute to disease.
"Structural variation is responsible for a large percentage of differences in the DNA among human genomes," Jan Korbel, lead author on the paper and research council investigator of the European Molecular Biology Laboratory's Genome Biology Unit, said in a statement. "No study has ever looked at genomic structural variation with this kind of broad representation of populations around the world."
Korbel and his colleagues discovered that structural variants were often more complicated than they originally thought. For example, the majority of inversions, which involve DNA sequences changing their orientation in the genome, frequently occur alongside other structural changes.
One of the more immediate uses of 1000 Genomes Project data is for genome-wide association studies, which compare the genomes of people with and without a disease to search for regions that contain genomic variants associated with that disease. Scientists can now combine genome-wide association study data with the more detailed 1000 Genomes Project data to home in on regions affecting disease more precisely without having to sequence the genomes of all the people in a study, which remains expensive.
"The 1000 Genomes Project has laid the foundation for others to answer really interesting questions," 23andMe's Adam Auton, the main study's senior author and principal investigator who until recently was assistant professor of genetics at the Albert Einstein College of Medicine in New York City, said in a statement. "Everyone now wants to know what these variants tell us about human disease."