By Monica Heger
This story was originally published on April 11.
As large-scale sequencing projects such as the 1000 Genomes Project and the Cancer Genome Atlas begin generating extensive data sets from multiple sequencing platforms, parsing real variants from sequencing artifacts can be challenging. Not only do different genome centers and institutions use various sequencers, but they also use a wide range of mapping, alignment, and variant-calling tools, which can all contribute to inconsistencies in downstream analysis.
Recognizing this challenge, researchers from the Broad Institute have put together an analytical framework to ensure that scientists have a consistent and reliable way to identify real variants in sequencing data, no matter which sequencing platform or sequencing strategy they use.
"Our focus has been to develop methods that can analyze all sequence data," Mark DePristo, who led the study, told In Sequence.
In a paper published this week in Nature Genetics, the researchers describe a "coherent framework for how to go from sequence output to reliable SNP calls," DePristo added. The methods have been added to the Broad's Genome Analysis Toolkit for analyzing next-gen sequencing data.
"This is one of the first articles coming out from one of the major groups suggesting a pathway for making next-gen sequencing data useful for analysis," said Rasmus Nielsen, an associate professor in evolutionary genomics at the University of California, Berkeley, who was not involved with the project.
The Broad team's model can discern real variants from sequencing artifacts. The group trains the model to recognize the properties of variants that are known to be real, and then uses the model to decide how different or similar new variants are from those known sites, said DePristo.
In the first part of the framework, the researchers transform the raw read data — which contains platform-specific biases — into a single, generic representation. The reads are then mapped to the reference. Then they eliminate duplicates, refine alignments with a local alignment tool, and calculate a per-base error rate.
In the second phase, the group analyzes data to look for alternate alleles, including SNPs, short indels, and copy number variations. Finally, they integrate known sites of variation, individual genotypes, linkage disequilibrium, and family and population structure with the raw variant calls to separate true polymorphic sites from artifact; then determine genotypes for each sample.
The basic idea of the framework is to "learn the properties of good variation and use that to find other good variation in the genome," DePristo said.
The team tested the framework by sequencing the whole genome of a HapMap sample on the Illumina HiSeq 2000 and the whole exome of the same individual on an Illumina Genome Analyzer.
In their test of whole-genome sequencing with HiSeq, the researchers sequenced the individual to a 60-fold coverage with 101-base paired-end reads. After the initial alignment, they used their analysis tools to find that around 15 percent of the reads in homozygous indels were misaligned, but a realignment was able to eliminate around 1.8 million loci with mismatching bases. Additionally, the first phase of their framework eliminated 300,000 SNP calls, or more than one-fifth of the raw calls. More than 90 percent of those calls were false positives, the researchers determined.
Next, they applied the same analysis to exome data of the same individual generated on the GA. They used Agilent's hybrid capture, and sequenced to an average of 150-fold coverage with 76-by-101-base paired-end reads. Similar to the HiSeq whole-genome sequencing, the data-processing tools eliminated about 20 percent of the new calls, more than half of which were false positives.
Importantly, the researchers found that despite very different protocols — one method was whole-genome, while the other was whole exome, and different sequencing machines and alignment algorithms were used for each — the data was consistent between the two individuals after they applied the data-processing steps.
Finally, the team applied the steps to data generated from low-pass sequencing of individuals within the 1000 Genomes Project. In total, they looked at the whole genomes of 61 individuals of European descent, including the same individual whose genome and exome the team sequenced to higher coverage. The individuals in the data set had been sequenced by a variety of different platforms including the Illumina GA, 454 GS FLX, and the SOLiD.
Variant discovery and genotyping of multiple samples using a low-pass resequencing strategy poses an additional challenge, the authors wrote, because at any particular locus, there is little evidence from which to call a variant. Additionally, different sequencing platforms generate different artifacts, so there is not a consistent method for identifying true variants from artifact.
The Broad team found that the data-processing steps eliminated about four times as many variants in the 1000 Genomes dataset as were eliminated from the HiSeq dataset, which the authors attributed to the lower coverage in the 1000 Genomes set.
For common variant sites, though, the proportion of variants called was near 100 percent. The Broad researchers were able to identify all variants that had been observed more than five times in the samples, as well as 1.4 million new variants.
"Calling multiple samples simultaneously, even with only a handful of reads spanning a SNP for any given sample, enables one to detect the vast majority of common variant sites present in the cohort with a high degree of sensitivity," the authors concluded. However, the low-pass sequencing limited variation discovery and genotyping when compared to the deeper sequencing.
DePristo said that the goal of the study was to develop a framework that could be applied to any type of next-gen sequencing data and would allow researchers to compare data generated from different platforms.
Nielsen added that it is currently difficult to compare data from different sequencing platforms. "Different analyses look different in different packages," he said. Additionally, each platform has its own "unique signature" and the "error structure differs from platform to platform." So, he said, researchers "can't use the same method on all the different platforms."
The framework addresses that issue by taking into account the types of errors specific to each platform in order to standardize the data. "When they do the recalibration, they recalibrate specifically for each platform," Nielsen said. This will allow researchers to compare data from across sequencing platforms. He added that the framework seemed feasible for even smaller labs to use.
DePristo said that researchers at the Broad have been using the framework to analyze samples from the 1000 Genomes Project, the Cancer Genome Atlas, and a type-2 diabetes sequencing project, among other projects. In total, he said the team has processed around 8,000 samples.
"For us, the real novelty is the uniformity of the framework and its ability to process across lots of machinery," DePristo added.
Have topics you'd like to see covered by In Sequence? Contact the editor at mheger [at] genomeweb [.] com.