NEW YORK (GenomeWeb) – Researchers from Vanderbilt University and other institutions have developed an improved statistical tool for calling de novo mutations in whole-genome or -exome data from parents-proband trios.
Unlike some current methods, the new tool does not rely on pre-specified mutation rates, and successfully avoids the limitations associated with that approach.
In a paper published last week in Bioinformatics, the researchers wrote that the freely available tool, called TrioDeNovo, uses a Bayesian model selection approach to evaluate evidence for de novo mutations without requiring users to pre-specify mutation rates upfront as required by existing methods such as PolyMutt — also created by some members of the TrioDeNovo development team — and DeNovoGear, one of the tools developed under the auspices of the 1000 Genomes project. These methods, according to the paper, "[entangle] de novo mutations with Mendelian inheritance in the same model such that the joint likelihood of the data depends on the pre-specified mutation rate," and that poses several challenges.
Using a pre-specified mutation rate "has a strong influence on the mutation calling," the researchers wrote, and can result in a "loss of accuracy when inappropriate mutation rates are used." Moreover, mutation rates vary across the genome with different classes of mutations exhibiting different patterns, which makes selecting a single mutation rate a less-than-optimal approach, they noted. Also, rate estimates for complex mutations such as insertions and deletions "are largely unknown" and choosing an inappropriate rate "may result in dramatically reduced mutation calling efficiency," according to the researchers. Finally, interpreting the evidence of mutations that are called using these existing tools is "less intuitive ... because of the entanglement of the pre-specified mutation rate in the data likelihood calculation," the paper states.
TrioDeNovo addresses these limitations by defining two models for each relevant genomic site. These models, generated from the genotype likelihood values found in VCF files, represent a scenario where there is a de novo mutation present in data from the offspring, and one where there is no mutation in the child's data, Bingshan Li, an assistant professor of biostatistics at Vanderbilt and one of the authors on the paper, explained to GenomeWeb.
The method computes the likelihood of the data under each of the modeled scenarios and then takes the ratio of the two likelihoods to represent the quality of the evidence that supports the presence of a de novo mutation in the genomic region in question. Once they've calculated this ratio, users can then enter different relative mutation rates for the genomic sites into the models to find additional support for the presence of the de novo mutations in these areas.
The paper describes the results of running TrioDeNovo and DeNovoGear on simulated and real datasets that were both sourced from the 1000 genomes project, highlighting the improved sensitivity and specificity that TrioDeNovo offers. Since it requires pre-specified mutation rates, Li et al. ran DeNovoGear on the simulated datasets of varying sequencing depth using three different mutation rates. For this solution, at a sequencing depth of 17x, "the maximum achievable sensitivity is 40.2 percent when 10-12 was used as the prior mutation rate, and at 51x and 68x the false positive rates increased when a prior mutation rate of 10-4 was used," the paper states. When TrioDeNovo was run on the same datasets without using the prior mutation rates, the researchers report that it outperformed DeNovoGear in all cases.
On the real datasets, TrioDeNovo also outperformed DeNovoGear, according to the paper, returning results that were more sensitive for the selected mutation rates that were tested in the study. In one experiment which focused on known germline mutations from a trio from the 1000 genomes, the researchers reported that their solution achieved a sensitivity of 100 percent compared to a maximum sensitivity of 95.8 percent for DeNovoGear.
The researchers also investigated the impact of sequencing coverage on mutation calling accuracy by applying both methods to only 75 percent and 25 percent of the reads from the trio. They report that when they used 75 percent of the reads, the results achieved for both methods were similar to those achieved when the full read set was used, however there was a difference when only 25 percent of the reads were used. Here, TrioDeNovo "achieved higher sensitivity without sacrificing specificity" than DeNovoGear did, they wrote.
The researchers believe that the method could help identifymutations associated with autism and other neurological conditions as well as rare diseases in both research and clinical contexts, Li told GenomeWeb. For its next steps, the team is working on extending the TrioDeNovo framework to take into account additional data from related family members such as siblings to further improve the accuracy of de novo variant calls, he said.