NEW YORK (GenomeWeb) – Researchers from the Personal Genome Project have published more than a hundred phased personal genomes.
At GigaScience, researchers led by Complete Genomics' Brock Peters described how they used the company's Long-Fragment Read technology to generate sequencing and haplotype information for 184 genomes. The researchers also reported uncovering 2.6 million high-quality rare variants not present in either the Single Nucleotide Polymorphisms database or the 1000 Genomes Project Phase 3 data.
"The vast majority of genomic data that has been generated to date is without experimentally derived haplotypes," Peters, a senior director of research and project leader at Complete Genomics, said in a statement. "This represents a very unique set of data that is freely available for anyone to use through open access data publication."
As part of the Personal Genome Project, Peters and his colleagues collected blood samples from 184 participants, who consented to have their genotypic and phenotypic data made freely available. These samples were then processed using Complete Genomics' Long-Fragment Read technology, and sequenced to an average read coverage depth of 100X, which the researchers said is three-fold higher than most whole human genome assemblies.
Of these 184 genomes, 114 are available in the GigaDB repository, Peters and his colleagues noted. While the participants all agreed to make their data available, they can review it before it is released, and the researchers said the other 70 genomes are still doing through this data-release process.
To gauge reproducibility, the researchers sequenced 20 of the genomes at least twice, with independent LFR barcodes. A pairwise comparison of those doubly sequenced genomes suggested the LFR phasing was reproducible. They also sequenced seven genomes using the standard Complete Genomics sequencing approach as well.
The researchers reported that for most of the genomes in their dataset, more than 98 percent of heterozygous SNPs could be placed into long haplotypes with an average N50 of 800 kilobases. More than 85 percent of the haplotypes did not contain any errors, the researchers added, and most of the remainder harbored a single phasing error.
From this set of genomes, Peters and his colleagues also uncovered 2.6 million variants not included in the SNP database of the 1000 Genomes Project dataset. These variants could represent rare or family-specific variants or de novo mutations but also false-positive errors. Still, based on the expected number of false-positives and expected amount of de novo mutations, the researchers suggested that most of these variants are rare population variants or family-specific ones.
Peters and his colleagues noted that the variant and phasing data could be found at GigaScience and that the corresponding reads and mappings are being made available through dbGaP.