Researchers at the University of Edinburgh have developed a web-based application for analyzing high-throughput sequence data.
Simon Tomlinson, a professor of bioinformatics at the University of Edinburgh, told BioInform that his group developed the workflow-based application, dubbed GeneProf, to provide an integrated analysis system that can handle large datasets while keeping a record of all the steps used to generate the results in a format that's "easy to understand" and repeat.
In a paper published last month in Nature Methods, the researchers explain that GeneProf is similar to other workflow-based systems, such as Pennsylvania State University's Galaxy, but is intended for non-expert users. In particular, GeneProf provides analysis "wizards" or web forms that are meant to simplify the process of constructing workflows — a task that can be difficult for researchers with little bioinformatics expertise who try to use current software, according to the developers.
These wizards "reconceive common, best-practice analysis steps as a series of logical stages" to generate workflows that researchers can apply to their datasets as they are, or they can customize these workflows to meet specialized analysis requirements, the team explained in the paper.
The tool also generates summary statistics and plots that help researchers identify "flaws" in their analysis, the researchers wrote.
A third feature of the tool is that it couples data and analyses in the form of "virtual experiments" that are supplemented by "all intermediate results and a history of the analysis procedure" and can be "directly linked in publications," the authors said.
This helps researchers avoid "irreproducible ... methodologies," they said. In addition, users can use this capability to share their analyses results and pipelines with collaborators prior to publication and with the larger scientific community after publication.
Under the hood, GeneProf incorporates several open-source packages for short-read alignment and processing and ChIP peak detection such as Picard, SAMTools, Bowtie, TopHat, and MACS. It also relies on data from Ensembl and the Gene Ontology for functional and gene annotation.
Currently, the tool is tailored to work specifically with data from Illumina's Genome Analyzer, but Tomlinson said the developers plan to expand GeneProf to handle data from all "standard" sequencing platforms in the future.
Additionally, developers can develop R-based components of their own and submit them to the system, he said.
The researchers note in their paper that GeneProf has some capabilities in common with well-known sequence data analysis packages such as Myrna, GATK, Taverna, and Galaxy. For example, GeneProf, Galaxy, and Myrna all offer alignment tools, while GeneProf, Galaxy, GATK, and Taverna all have flexible workflows, they said.
There are some differences, however. For example, GeneProf outputs data in "dynamic tables" that users can interact with while Myrna, GATK, Taverna, and Galaxy all output data in "static files." Additionally, the authors claim that GeneProf is the only tool that uses analysis wizards to help users set up workflows for analyzing their data.
Also, while GeneProf provides details of the "full workflow integrated with data, history of the individual process outputs, and intermediate data," Galaxy offers a history that integrates data and analysis but in a "logical flow" that can be "difficult to recapitulate in large histories," they said.
To further flesh out what sets GeneProf apart, in the paper's supplementary materials the researchers provide detailed results of a comparison between their system and Galaxy when both were run on a mouse dataset taken from the Sequence Read Archive.
The authors determined that GeneProf provided a simpler analysis workflow than Galaxy while generating comparable results — GeneProf required 24 analysis components while Galaxy used 65, and GeneProf required only four manual changes to its parameters, while Galaxy needed 28.
The results also showed that GeneProf was able to do some things that Galaxy couldn’t, the authors said. For example, unlike Galaxy, GeneProf was able to provide additional information about read number, lengths, and quality score, as well as provide customized genomic coverage plots.
Tomlinson noted, however, that the two systems are "complementary" to each other because they are intended for different purposes.
He explained that while both tools "could both be used to run similar analysis workflows" in practice, "Galaxy is commonly used as a toolbox for genomic analysis to run individual tasks or chain together simple tools." GeneProf, on the other hand, "has less of a focus on the analysis tools and more on the biological outputs and is designed to encapsulate entire analysis workflows, from start to finish."
GeneProf's approach, "we believe ... will appeal more to experimental biologists who are strongly driven by discovery," he said.
However, Anton Nekrutenko, a professor at Pennsylvania State University and one of Galaxy's developers, told BioInform in an e-mail that GeneProf "tries to completely replicate Galaxy functionality" rather than complement it, adding that it will be "interesting to see if it will catch up."
He also pointed out that the paper includes some erroneous statements about Galaxy. For instance, it mentions that Galaxy "has no way to explore parameter settings" and that it has "limited Genome Browser support, both of which are incorrect."
GeneProf "seems like a nice development, yet it is surprising that the authors completely brush off others and Galaxy, in particular," he said, adding that the authors "describe 'other workflow solutions,' noting that they are difficult for biologists to use. However, in making this statement it is clear that never actually used them."
Nekrutenko also noted that the authors do not include details about GeneProf's user base and how they plan "to scale up to really address large-scale analyses by large number[s] of people."
Tomlinson told BioInform that GeneProf has a dedicated cluster at the university that includes 100 compute cores and about 50 terabytes of storage. He also said that it could be run on cloud infrastructure or an external compute service if its in-house resources prove to be insufficient.
Since the system launched in December, Tomlinson said hundreds of users have signed up for accounts on the GeneProf site. He added, however, that the developers need to collect data for a longer length of time before they can provide accurate statistics on the number of users.
According to the paper, GeneProf is designed to appeal to experimental biologists and clinicians who may use the tool as a means of accessing publicly available data; researchers with some data and little analysis expertise that need basic analysis pipelines; more experienced researchers and computational biologists who can customize the pipelines to meet specific analysis needs; and algorithm developers who can build and add new workflow components.
The university's research and innovation arm is investigating the commercial potential of GeneProf. However, Tomlinson said that the development team's primary focus is incorporating additional capabilities such as pipelines to handle microRNA data and histone modification data.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.