NEW YORK (GenomeWeb) – Researchers from Washington University in Saint Louis and Arizona State University have been awarded a five-year, $500,000 grant from the National Institutes of Health to expand DeNovoGear, a software package that was developed for detecting de novo mutations in sequence data.
One of the principal investigators on the project, Donald Conrad, an assistant professor in WUSTL's departments of genetics and pathology, and immunology, developed DeNovoGear along with colleagues at other institutions under the auspices of the 1000 Genomes Project. They created the software to detect novel genetic insertions, deletions, and point mutations in sequence data collected from familial and somatic tissue samples as part of the project. The developers published a paper in Nature Methods last year where they provided a detailed description of the software.
Now, Conrad, co-PI Reed Cartwright, an assistant professor of evolutionary medicine and informatics at ASU, and their teams are using the new National Human Genome Research Institute grant to improve DeNovoGear's ability to detect new mutations in familial trios using data from multiple tissues collected from a single individual and in single-cell sequencing data.
According to its abstract, the objectives for the grant are to "determine the probability that an apparent DNA sequence change is due to a de novo mutation when analyzing short-read sequencing data from families" while accounting for possible sources of error or noise such as "sequencing error, population diversity, and chromosome segregation."
Next, the researchers will expand the capabilities of the software to enable the detection of somatic de novo mutations such as those found in matched tumor-normal datasets. Lastly, "we will develop new models to handle sequencing data from single-cell sequencing, which generates different probabilities of error compared to those discussed previously," the abstract states.
Essentially, the researchers are trying to make DeNovoGear a more general mutation detector, one that can locate mutations that occur not just in the germline, Conrad told BioInform. The genesis of this idea actually dates back to the 1000 Genomes Project, according to Conrad. In analyzing those datasets, "it occurred to me that there was a really common statistical modeling problem underlying what we were trying to do," he said. Basically, researchers were trying to reliably infer mutations in sequence information collected from a set of related entities whether that data came from a single individual's tissues or from a family trio. DeNovoGear has already been applied successfully to familial data — as this 2011 Nature Genetics study shows — and the goal now is to expand this to all sorts of other applications from which a researcher could conceivably want to infer mutations, for example in evolutionary genetics and model organism studies, Conrad said.
In updating DeNovoGear, Conrad and his colleagues are focusing on "algorithms that can do fast computations on graph structures using all of the data [in these structures] simultaneously," he said. They plan to incorporate statistical tools such as hierarchical modeling, which tries to fit data from multiple samples to specific parameters at the same time.
Some of the techniques they'll use are already used in software such as Samtools and the Genome Analysis Toolkit but there are others that the group will develop from scratch, according to Conrad. For example, they'll develop new error models for handling data from single-cell sequencing experiments, because these experiments have a unique library preparation process that introduces artifacts which lead to errors in the mutation identification step if software such as Samtools or GATK are used, he said.
To provide a sense of what the other planned additions to DeNovoGear might be, Conrad referred to the NIH-funded Genotype-Tissue Expression program, a project that he is involved in and which aims to study gene expression and regulation in multiple human tissues. Although the project focuses largely on mapping expression quantitative trait loci, the datasets could provide some insights into somatic mutations since the researchers are collecting samples from multiple tissues in each study participant, Conrad said. He and his team plan to add extensions to DeNovoGear that help researchers infer mutations from the RNA-sequencing data whilst accounting for confounding biological processes such as RNA-editing or allele-specific expression.