By Andrea Anderson
A Belgian-led team has optimized short-read sequence error filters for differentiating between errors and authentic variants in genome sequence data generated using Complete Genomics or Illumina platforms.
The researchers have released their error-filtering pipeline under the name of GenomeComb.
As they reported in the early, online version of Nature Biotechnology this week, the researchers started by comparing whole-genome sequence data for identical twins sequenced using the Complete Genomics sequencing-by-ligation platform.
Once they had characterized a dozen of these short-read error filters in the twin genomes, looking at their performance in combination with one another and at different stringency thresholds, the researchers showed that it was also possible to use the error-filtering tools to prioritize variants in cancer genomes and narrow in on genuine single nucleotide variants in genome sequence generated on another short-read platform, the Illumina GAII.
"We did an optimization for all the different filters with the twin study and then we optimized everything and applied them to different genomes," University of Leuven researcher Diether Lambrechts, co-corresponding author on the study, told In Sequence. "We also made combinations of the different filters, which was something that was not done previously."
"If you look at the papers that have been published in the past, people would use various methods, but they never combined all these methods," he explained. "And they also never optimized them."
After doing a literature search to see what kinds of error-filtering methods had been applied in previous whole-genome and exome studies, the team brought together all of the existing error filters, most of which are designed to address errors caused by the presence of poor quality sequence data or difficult-to-map repetitive DNA sequences.
"We took everything that was used previously and that was what we optimized," Lambrechts said. "So we actually took a very systematic approach."
Along with error correction strategies based on relatively straightforward read quality criteria, such as read depth, the team also assessed some filters that were more complicated.
For example, one of the filters, known as a consensus filter, involved re-mapping sequence reads and re-calling variants using algorithms that differed from those used to map and call variants in the genome initially. They then looked at which potential SNVs were identified by both methods.
To do consensus filtering in the current study, the team relied on a method called RTG2.0, developed by the San Francisco firm Real Time Genomics. Real Time Genomics researcher John Cleary is also a co-author on the study.
For the first phase of their error-filter assessment, the researchers attempted to come up with optimal combinations of filters and filter stringency settings using whole-genome sequence data from a pair of monozygotic twins who were discordant for schizophrenia.
They reasoned that because the twins' genomes should be more or less identical, most SNVs found in both genomes but not the human reference sequence were more likely to be real. On the other hand, nearly all of the variants at which the twins showed discrepancies were expected to be false positives.
"As monozygotic twins have near-identical genomes, the majority of discordant SNVs are likely to represent genotyping errors in one of the twins," the study authors explained. "For each individual filter, filter thresholds were determined such that they removed a maximum number of discordant SNVs and a minimum number of shared SNVs."
Each twins' genome had been sequenced to 37.9 and 37.3 times average coverage depth using the Complete Genomics platform and genomic DNA isolated from white blood cell samples.
From the unfiltered data, researchers initially detected more than 2.7 million SNVs that were shared between the twin genomes, which had been assembled using a Complete Genomics assembly tool. Nevertheless, even after removing uncertain calls, they were left with tens of thousands of discordant variants prior to error filtering.
Applying filters designed to weed out low quality sequence reads based on coverage depth, SNV quality scores, insertion and deletion patterns, and SNV cluster data, the team trimmed this collection of potential differences to around 9,500 — a much smaller, but still unruly set of SNVs to validate.
By adding in consensus filtering and repetitive DNA-based filters, researchers brought the number of candidate variants down to fewer than 900. They were then able to directly test the remaining 846 discordant SNV candidates by Sanger sequencing.
In the process, they verified two genuine differences between the twins' genomes: one SNV on chromosome 4 and another on the X chromosome. The group "would never have been able to find the variants" without the error filtering, Lambrechts said. Moreover, preliminary follow-up studies hint that the latter variant, which was mosaic in both twins, is found more frequently in the twin with schizophrenia.
Along with the identification of the two discordant SNVs, the researchers used the identical twin genomes to optimize error filter settings and estimate their performance.
For example, from their error analyses of the genomes, combined with follow-up Sequenom MassArray genotyping and Sanger sequencing experiments, the team estimated that a combination of error filters can decrease the error rate associated with newly identified SNVs by 290 times.
Nevertheless, they explained, the combination of filters and filter stringencies that are most appropriate for a given study will vary depending on the research questions being asked and the samples used.
Indeed, when the researchers turned their attention to genome sequences generated for matched tumor-normal samples from three individuals with ovarian cancer, they found that their combination of error filters removed authentic variants as well as errors when used at sensitive error detection thresholds.
"In the cancer genomes, you don't have so many errors and you have, also, a lot of somatic mutations that differ from the germline sample," Lambrechts noted. "We found that if we use very stringent filtering we also remove a lot of the true differences."
The genomes, which had been sequenced on the Complete Genomics platform, were generated using DNA from fresh-frozen tumor samples rather than cancer cell lines. Consequently, the samples contained some normal cells as well, Lambrechts noted, a feature that can complicate the detection of somatic mutations owing to contamination of the tumor genome by normal genome sequences.
"When you sequence the tumor sample that comes from the patient, you're actually sequencing two genomes: you're sequencing part of the normal genome and part of the tumor genome," he said. "That really contaminates the analysis, because then you are looking at two genomes — and that also leads to a lot of errors in the analyses."
Nevertheless, by using less rigorous filtering than they used for the identical twins' genomes, they were able to find a large proportion of the true differences between the tumor and normal genome. Error filtering also helped in differentiating the normal and tumor sequences from one another, Lambrechts explained.
"We've shown that by applying the filters and the methods that we used, we could remove a large portion of the errors that are coming from this contamination of normal cells," he said.
Finally, using genome sequence data for the a Yoruban woman sampled through the HapMap project who had her whole genome sequenced as part of the 1000 Genomes Project, the team looked at how its error-filtering strategies performed using not only Complete Genomics sequence data, but also on short-read data generated on the Illumina GAII.
Because the woman had had her genome sequenced on three independent platforms — the GAII, the Complete Genomics platform, and a Life Technologies SOLiD instrument — it was possible for researchers to establish a set of reference variants from her genome that were present in reads from at least two of the platforms and, therefore, believed to be genuine, Lambrechts explained.
Once they had established this reference set, the team compared it to sequences from the Complete Genomics and Illumina genomes to get a sense of the types of errors associated with each.
"We found that Complete Genomics genomes have fewer false-positive errors compared to Illumina, but they also miss quite a lot of variants, so they have more false-negative errors," Lambrechts said.
This finding is in line with the results of a Stanford University comparison of the Complete Genomics and Illumina platforms. That study was also published in Nature Biotechnology this week (see story, same issue).
Appreciating the kinds of sequencing errors associated with each platform can help in determining the types of filters that are most useful for analyzing sequence data, he added. "It can determine the set of filters that you will use, depending on which genome-sequencing technology you use."
For example, the researchers' findings suggest that filters focusing on repeat regions and indels worked very well for data generated on the Complete Genomics platform, hinting that errors in genomes generated on that platform may be a bit more likely to fall in repeat-rich regions or parts of the genome containing microsatellite or indels.
On the other hand, the approach that appeared to show the most promise for filtering out errors in Illumina genome sequence data was the independent mapping and variant calling algorithm, Lambrechts noted, suggesting that the tools typically used to map and call variants in Illumina genomes could still be improved.
The researchers have made the GenomeComb pipeline available to other members of the research community. The algorithms and tools for filtering Complete Genomics or Illumina genome sequence data can be downloaded from the GenomeComb site.
The thresholds for some of the filters, such as the coverage depth filter, can be adjusted depending on the research question at hand, Lambrechts explained. Other filters — for instance, the repeat region filter — can either be used as is or left out of filter combinations.
"The combination of filters that you use actually depends on the biological question that you are asking," he said. For example, "you could start with a limited set of filters, and do your analysis, and if you don't find a mutation [related to what you're looking at], you could then apply additional filters so you become more and more stringent."
Lambrechts explained that it may be advantageous to leave out repeat-focused filters for some cancer genomes. "You have a lot of errors [in repeat regions], but of course if you have a tumor that has mutations in the repeat regions — such as the microsatellite-instable tumors — you're not going to use that set of filters," he said.
For his part, Lambrechts predicted that the importance of different error filters will likely change over time as sequencing methods continue to advance and read quality improves.
At the moment, though, he and his colleagues are using the error-filtering method to aid in their analysis of cancer sequences. They are primarily focused on sequencing ovarian and other gynecological cancers.
"We see the tools as a prioritization method," Lambrechts said. "You can look at the genome without filtering and you have too many variants to validate, but then you can gradually apply the filters, and the variants that you identify when you apply all the filters are mostly the variants that have the highest confidence."
Have topics you'd like to see covered in In Sequence? Contact the editor at anderson [at] genomeweb [.] com.