This article has been updated from a previous version to clarify a statement from the paper regarding the software's capabilities.
Investigators from the University of Maryland's Institute for Genomic Sciences and elsewhere have recently released Ergatis, a web interface and software system that lets users build, implement, and monitor the pipelines they use to analyze genomic data.
In a paper published in a special issue of Bioinformatics related to the upcoming Intelligent Systems for Molecular Biology conference, the authors write that “Ergatis uses a modular, scalable, and extensible approach to pipeline creating and management on local or distributed compute resources.”
The authors say that Ergatis lets researchers construct pipelines from common bioinformatics analysis tools and then store the information so that the pipelines can be reused or applied to new datasets. The system also provides prepackaged pipelines for genomic analysis.
Users can build their own project pipelines out of modular analysis components. Each component is described by an XML file and a configuration file that define the necessary steps and parameters. The current version of Ergatis, v2.12, contains 162 components that can be used to form the pipelines.
Once the pipelines are created, they are saved as XML files. The system uses the Sun Grid Engine to schedule and manage each job on the cluster and track its implementation from start to finish. As an added feature, the system records detailed information for each step so that the same pipeline template can be reused to study the same sequence data or be reapplied to a different dataset. The templates, which are small enough to be sent via e-mail, contain information about the layout and each component’s configuration options.
When the analysis is complete, the system spits out a set of gene clusters that are converted and stored as BSML files. The files can then be inputted into a Chado relational database for further analysis.
In addition to enabling users to build their own pipelines, the system comes with prepackaged pipelines that researchers can use to annotate whole genomes and for comparative genomic and pangenome diversity studies. Furthermore, users can apply their pipelines locally or distribute them across a compute grid.
According to the team, the system has been used to build and run analysis and annotation pipelines for several published genomes and multiple comparative genome studies. Ergatis has been used in compute grids of up to 600 cores at the J. Craig Venter Institute and the IGS at the University of Maryland.
This week, BioInform spoke with Joshua Orvis, a researcher at the University of Maryland and a co-author of the paper. Below is an edited version of the interview.
[ pagebreak ]
What does the name Ergatis mean?
It’s just a Greek word for “worker,” since it does the work of several institutes.
What does Ergatis do and how does it do it?
In terms of what it does, it’s a web interface. We had the idea that biologists or engineers want to be able to piece together analysis tools, like a gene prediction tool and then [use] Blast to get the results of that gene prediction tool and [they want] to do it without running scripts.
We wanted [the system] to be able to run on your desktop machine or on a 1,000-node grid cluster but using the same interface. So that’s what [Ergatis] does. It has over 100 analysis tools like Blast or different gene-finding tools built into it, so that you can piece them together into pipelines and then run them.
In terms of how it does it, most of this is hidden from the users but once someone uses the interface to build their pipelines and then clicks the run button, [Ergatis] represents all the work to be done in XML files and then a workflow engine processes the files and does the work of sending jobs to the grid and monitoring them.
Does the software analyze whole genomes or other types of genomic data?
The project started at the Institute for Genome Research. TIGR’s focus at the time was whole-genome projects so that’s what it was initially used for. Most of the tools that are in [Ergatis] are for doing whole-genome annotation including prokaryotic and eukaryotic annotation. Like the paper mentions, [we have used it] in the annotation of Aedes aegypti as well as in comparative genomic studies in prokaryotes
Did you compare this system with any other systems?
There’s a lot of overlap with [systems] like Galaxy and Taverna and things like that. I don’t know that there are any that fill the exact same niche as this one. Part of it is also that even though the publication is new, the system has been in use for seven years or so. It predates a lot of them.
[ pagebreak ]
What’s the niche that it fills?
Some of the tools out there are designed for very advanced users. [They are] for building very precise, step-by-step workflows that have very specific purposes and usually it’s developers who would do those. Then there are some other ones [that] are geared completely [toward] the end user, who in many cases isn’t a developer at all, so it's bench scientists, [people] like that who want to do manipulations on FASTA files or group output files or file conversions or things like that.
This [system] is sort of in between those. It’s fairly easy and straightforward to build a pipeline but you [can] add a component like a gene finder without any knowledge of how it works or how it runs. You just say, 'I want a gene finder called Glimmer,' you put it in the pipeline and you can say where your input is, but that’s really all you have to do. Developers can add new components of their own pretty easily, but by default this [system] doesn’t assume that you have a great knowledge about how any of that works.
You said the system has been in use for seven years. Why is it just being published now?
In the beginning, it was developed as an internal tool. As we used it more and more and as we started forming collaborations with other institutes on many of our genome projects, like the higher eukaryotes, other people began to be interested in it. So we took the source code out of our internal servers and put it on SourceForge so it’s publicly available.
What does Ergatis do differently compared to other WMS?
Taverna and Galaxy are the two [systems] that I get questions about the most. I’ll do Galaxy first. The biggest difference is that Ergatis was started as a pipeline-management tool. The idea of a pipeline or workflow was only something that was recently added to Galaxy. Ergatis was built with the idea of project-centered pipelines from the beginning and the idea of large analysis tools or components that can be shared with other users.
The big difference between Taverna and Ergatis is that Taverna and some others like BioMoby are based off of the idea of web services. So you are not necessarily using tools you have in house but you are accessing services by other providers. Ergatis can use web services but it’s certainly not designed around it. Ergatis assumes its power users are people who have large clustering or grid computing ability in house.
Who is the primary target group for this software?
I would say it’s mostly for people who have a bit more computer-intensive needs than just small point-and-click interactivity. These are people who on a daily or weekly basis have larger pipelines that they need to run and manage. It requires a little bit more than just a completely naïve user.
Is the software user friendly?
It certainly is to some extent. It’s used [at the Institute for Genome Sciences] internally by both engineers and what we call analysts who are people with mostly a biological background who don’t know programming or technical things like that. So to some extent it is [user friendly] but that wasn’t its primary focus. Its primary focus was very much full logging of everything that’s done in a project as well as very large pipeline management.
[ pagebreak ]
What are some applications of this software?
One thing to mention is that it comes with pre-built pipelines. So you can build your own pipelines with it using modular tools but it also has pre-built pipelines in it. Some of the more popular ones are a pipeline that does prokaryotic genome annotation. The tool itself is quite a bit more advanced than doing gene prediction and then [using] Blast to get a gene product name. [The system] actually has about 60 components or so to do a much better job of annotation than simple regulatory gene finding in Blast steps.
There’s a lot of start site codon analysis, gene overlapping analysis, it does searches with Blast, HMMs, different levels of non-coding gene finding tools, and things like that. The user simply says, 'Here is my input genome, here’s the genus and species,' and all they have to do is run the pipeline after that. We finished each prokaryotic annotation in about six to eight hours on our grid. So it’s really robust.
Then there’s also a pipeline for comparative genomics. We have several PIs who have 50 or 60 strains of a given bacterium and this [system] has a protein-based comparative pipeline. The useful part of this is that compared to some of these other systems, [Ergatis] output can be loaded into a Chado database and then there are a great number of tools [that] can read the results out of Chado. For the comparative genomics example, there is a tool called Sybil that gives you a very nice visual representation of your genome comparison.
In your paper, you say that some aspects of the system may incur extra computational overhead. Could you elaborate on what that means for researchers who would be interested in using the system?
That’s mostly because any time you want to chain different tools together; writing custom software to do that is always going to run fastest. If you want to parse the output of something and make it an input into something else, writing a specific direct parsing script to do that is going to run faster than making things generic.
One thing we decided to do in Ergatis [was pick] a common output format. Ergatis has at least a dozen different gene prediction tools in it, each one of which has its own format. So if anybody wants to use any of those, they would need to learn the different output formats for every one of the different gene finders.
We added a step to Ergatis that transforms all those outputs into a single common format called BSML. So that step adds overhead to what would normally be just running a gene prediction tool. Normally you would just use its default output if you wanted to connect to something else but we chose to make it generic. That helps you do a lot more things in terms of making larger and more interesting pipelines but it does add overhead to running each one of the tools.
Would you at some point reconfigure the system to retain the default format for the steps or keep the conversion step?
For any of the steps, like the BSML conversion steps, you can configure the components to skip them and just run the tool and do nothing else outside of that. But by default, the components do those steps.
[ pagebreak ]
Would that be something researchers could do on their own or would they need experts to do it?
They can type in any steps they want to be skipped on their own when they build the pipeline.
Have you gotten any feedback from people who have used the software?
Yes, there is a users list maintained by SourceForge and there [are] emails every week on that users list so it’s moderately active.
What’s been the response to the software?
I think this is true when anybody takes a tool that was only internal and tries to make it usable for external users: There are things you miss or forget or users who are outside of your group have neat ideas that you never needed because of your own internal setup. They have been really useful in teasing all that stuff out and making it more generically interesting for the community rather than just our own purposes.
Do you plan to make any changes to the software based on the feedback?
We do very often. A lot of the changes now are based on that feedback.
There was a new grant that was received by researchers at our institute to make a version of Ergatis externally usable [for] the public. So that need is very actively driving its development now.
[The grant referred to was awarded by the National Science Foundation to the University of Maryland at Baltimore on 1/15/2010 and is estimated to expire on 12/31/2012. The total amount awarded to date is $1,894,381. According to the grant abstract, the funds were used to build the "Data Intensive Academic Grid," or DIAG, which includes 100 nodes for high-throughput computational analysis and five nodes for high-performance computational analysis. The abstract further states that the bioinformatics community will use Ergatis and other bioinformatics tools to access DIAG — Ed.]
What’s the timeline for that?
The first version will be for people who are part of the grant that are in other universities that took part and asked to be the first users. They will start using it within a few weeks and I believe they will be the only group of limited users for now.
You had mentioned that the software’s been in use for seven years?
Well, the first versions of it are about that old. It’s been constantly reworked and reformed.
How long did it take to develop the very first version of the software?
I guess it probably started after just a few months but the need set was quite a bit smaller back then. It didn’t have a web interface like it does now. It started as something the engineers needed to maintain their own pipelines. Then we realized that we were doing the same sorts of pipelines all the time so [we thought that] if we could make an interface that the regular users could launch themselves then it would save us work.