NEW YORK (GenomeWeb) – Researchers from the National Institutes of Health's National Human Genome Research Institute have assembled a computational pipeline for analyzing epigenetic data called SigSeeker that combines the outputs of multiple bioinformatics tools at each step of analysis process to improve the quality of predicted results.
Jens Lichtenberg, a post-doctoral fellow at NHGRI and one of SigSeeker's developers, presented the pipeline during the Bioinformatics Open Source Conference, one of several special interest group meetings held prior to the start of the Intelligent Systems for Molecular Biology conference in Boston earlier this month.
SigSeeker fills a need for a more integrated approach to analyzing epigenetic marks such as DNA methylation, transcription factor binding, and histone modifications. Next-generation sequencing techniques such MethylSeq and ChIPseq have made genome-wide analysis of these markers possible but existing tools use different algorithms for analysis, leading to inconsistent results, or are developed to answer specific research questions.
Rather than pit individual tools against each other to determine which offer the best results, SigSeeker combines them in its framework and applies each to the input data resulting in more accurate results than would be possible if the component methods were run in isolation. In fact, the developers claim that internal benchmarks indicate that their ensemble approach can offer up to a 300 percent sensitivity increase in detecting true positives.
In his talk and in a poster presentation at ISMB, Lichtenberg said that he and his colleagues have used the pipeline to analyze data from hematopoiesis studies at the NIH. They've published at least one paper in Genome Research focusing on the role of DNA methylation in hematopoiesis. In that study, they used SigSeeker to analyze data collected and sequenced from primary mouse blood cells. However the software can be used more broadly for any kind of epigenetic analysis in species such as human, mouse, and dog.
SigSeeker overcomes the shortcomings of other single-technique approaches "by considering the complete set of established expression and epigenetic data during the analysis process," the researchers explained in the conference abstract. Furthermore, "it allows comparisons of user-generated data as well as correlations of these data to publicly available epigenetic and expression data," they said. The outputs of each module in SigSeeker's framework are "evaluated for their statistical significance during ... each stage of the analysis process as well as in a final report."
It builds on concepts used in existing methods such as EpiGraph, according to the ISMB poster, by adding features such as functional, regulatory, and structural annotation for epigenetic peak profiles; and it is able to infer "regulatory genomic signatures that go hand in hand with the epigenetic profiles." Also, "unlike the submission- and resubmission-based EpiGraph analysis pipeline, SigSeeker is designed to support the generation and analysis of epigenetic data at a much closer level, allowing for frequent adjustments and parameter modification of various stages of the analysis," according to the developers.
SigSeeker includes modules for mapping sequencing reads, detecting enriched regions within the mapped reads, and correlating these regions with RNA expression and other datasets. Input data are supplied to each analysis package included in the pipeline and the results for the different packages are compiled into a single profile, which can be correlated with mRNA expression datasets. Statistical analyses of the report are also provided. Since several tools are used to analyze the data at each step of the process, the researchers have higher confidence in the results that are generated at the end of each stage.
The ISMB poster provided more details about the exact components of SigSeeker. For quality control, where the researchers evaluate which sequenced samples are suitable for further analysis, SigSeeker runs both the FastQC and HTQC software and then integrates results from both reports. The filtered reads from this step are then fed into three tools in the alignment phase: BWA, Bowtie, and Bowtie2. The fruits of each alignment tool are integrated and serve as input into the peak calling phase of the analysis. In terms of integrating the results from the alignment tools, the researchers look for overlap in the predictions, Lichtenberg explained. That means that an alignment prediction only counts if it is mapped by more than one tool to a particular genomic location.
For peak calling — which identifies regions of enrichment in the genome — SigSeeker combines output from five tools. In this step, SigSeeker looks at the overlap in the results generated by the different tools in terms of both location and intensity of the overlap, Lichtenberg said. This sets it apart from other integrative analysis approaches which look only at the location information, he said. The tools used for this step of the process are MACS, MACS2, cisGenome, BroadPeak, and SoleSearch.
Along with the integrated report created in this step, SigSeeker generates a list of peaks that are assigned to genes — in an ensuing partitioning step — based on genomic landmarks such as CpG islands, ancestral repeats, and promoters. As part of their exploration of the hematopoetic process in mice, SigSeeker's developers created the Systems Biology Repository (SBR), which holds existing epigenetic datasets from the Gene Expression Omnibus and other databases. This information comes in handy "when we have our peaks correlated with genes during partitioning," Lichtenberg said. "It allows us to directly correlate our epigenetic data with our transcriptomic data" and determine which genes are expressed and which are silenced.
In a final step, the listed genes are annotated with functional and regulatory information using resources such as the Gene Ontology terms and the Kyoto Encyclopedia of Genes and Genomes. The output from the annotation step is integrated into a final functional analysis report that's returned to users.
Although the pipeline was developed and used in the context of hematopoeisis, it can be used to analyze epigenetics in diseases such as cancer. So far, Lichtenberg and his colleagues have used it to analyze ChIP-seq data from human breast cancer and canine prostate cancer.
The developers are currently preparing a paper for publication in the fall that will describe SigSeeker in detail. They plan to publish in either PLOS Computational Biology or Bioinformatics, Lichtenberg said. Planned developments include the addition of RNA-seq tools such as Cufflinks and eXpress, and integration into existing bioinformatics platforms such as Galaxy.
The source code for both SigSeeker and SBR is available through the Google Code Repository.