Scientists at the Genome Institute of Singapore of the Agency for Science, Technology and Research have developed a pair of algorithms that use techniques from signal processing theory, a branch of engineering and applied mathematics, to identify genomic features in data from a variety of sequencing assays.
Both tools were developed to process signals from sequence tags — short sequences used for identification purposes — in sequencing-based functional assays such as ChIP-seq, DNase-seq, and formaldehyde-assisted isolation of regulatory elements, or FAIRE,-seq, the developers explain in a paper published last week in Nature Biotechnology. The paper also provides detailed descriptions of the algorithms and their applications to sequencing datasets from a variety of cell lines.
The Detection Filter, or DFilter, one of the two algorithms, identifies regulatory features in a variety of sequencing datasets. According to the Nature Biotech paper, it is based on a "linear detection filter," that "maximize[s] the difference between filter outputs at true-positive regions and noise regions." The second algorithm, called the Estimation Filter, or EFilter, is used to estimate mRNA levels from histone profiles. It uses "a linear least-squares approach, and incorporates two additional features to facilitate the removal of bias and the use of cell types for training that are distinct from the target cell type," the researchers wrote.
In a statement, Rob Mitra, a professor of computational biology and an associate professor in the genetics department at the Washington University School of Medicine, who was not involved with the research, said that the GIS team's work provides “an elegant solution to a ubiquitous problem: separating the signal from the noise in deep-sequencing datasets." DFilter in particular, he said, "represents a significant advance because it is widely applicable … more accurate than existing algorithms, [and] can be used to analyze virtually any sequence-tag analysis of DNA binding."
This is in contrast with the status quo — "a plethora" of assay-specific bioinformatics tools that have been developed for processing sequence-tag signals, according to the GIS researchers. Two examples highlighted in their paper are a blind deconvolution approach used for transcription factor ChIP-seq analysis and non-local means, which is used to detect genomic segments that are enriched with RNA polymerase II. The "specialized nature" of these and other similar algorithms "makes it difficult to compare, integrate, or uniformly analyze data from multiple sources," they wrote.
There are "too many different ad hoc approaches and every one seems to be treating this as 20 different problems when it's actually all one and the same problem," Shyam Prabhakar, the associate director for GIS' integrated genomics arm and a co-author on the paper, told BioInform. "[Our] goal was to … come up with one algorithm that you can throw any dataset at [and that] will intelligently figure out how to call peaks in that dataset without any expertise on the user's part."
For the GIS team, the development process began with recognizing that "many problems in the analysis of high-throughput sequencing data are merely special cases of two general problems; signal detection and signal estimation," according to the Nature Biotech paper.
It's "looking at all these profiles and saying how do you tell that this is a peak and this is not a peak, or this is a region that has this histone mark and this is a region that doesn’t, or this is a region that has this transcription factor binding [site] and the other regions don't … once you recognize that the answer is obvious," Prabhakar said. These are "standard signal processing problems" that have already been addressed in fields like engineering, he said.
Both methods use variations of a technique, known as a pre-whitening matched filter, which is widely used in mobile phones and radars. According to the GIS team, this is the first time that the technique has been adapted to the analysis of high-throughput sequencing data. They do note in the Nature Biotech paper that there are other methods that borrow from the engineering field.
One such method, published last year in a paper in BMC Bioinformatics, describes a signal processing approach that combines a "signal denoising algorithm with a false discovery rate approach" that was developed for detecting "enriched regions" in RNA polymerase II.
In an email to BioInform, one of the authors of that paper, Kun Huang, an Ohio State University associate professor of biomedical informatics and co-director of its Biomedical Informatics Shared Resource, described the GIS study as a step in the right direction.
Commenting specifically on ChIP-seq analysis, Huang noted that in spite of "the large volume and long history of signal processing work in electrical engineering and physics, most of the existing ChIP-seq algorithms are based on local statistics without explicitly resort[ing] to these tools."
That's why "we proposed the denoising approach followed by a statistical analysis," he said. "However, I always believed that more rigorous signal processing analysis methods can be developed and will lead to promising results. This work takes a step towards that direction using a signal processing approach and the results are convincing to me."
The GIS team also claims that because their methods are based on "uniform" and "formal optimal" mathematical techniques, they are more accurate than current algorithms, many of which are based on heuristics.
Graphs included in the paper show that DFilter either outperformed or performed as well as four other competing tools — MACS, Quest, F-Seq, and ZINBA — on ChIP-seq, DNase-seq, and FAIRE-seq datasets. In one experiment, EFilter was used to estimate mRNA levels from histone ChIP-seq data and its results were compared to an existing linear regression method. According to the paper, Efilter provided more accurate predictions than the competing method.
In addition to improved accuracy, the paper also highlights what Prabhakar described as some "biologically interesting discoveries," which the team made using the algorithms. These findings also serve to highlight the potential utility of the tool, particularly in clinical settings, he said.
It turns out that histone modifications "are astonishingly predictive of how much mRNA is produced by [a] gene," he said, By doing "some mathematical processing on the extent of various chemical modifications of the chromatin in and around the transcription start site … [you can] do a really good job of predicting how much mRNA is produced from that gene."
This has "huge implications" for old tissue samples where the mRNA has already degraded and can't be measured by conventional methods, he said. "If you profile chemical changes to chromatin and predict mRNA levels, then you can get all these insights into clinical samples that have been sitting around in paraffin at room temperature where the mRNA is shot," he said.