A year ago, researchers at Texas A&M University ran into a data analysis bottleneck soon after acquiring an Illumina Genome Analyzer to sequence multiple strains of Mycobacterium tuberculosis with the aim of pinpointing mutations associated with drug resistance.
In response, they turned to Raffaele Montuoro and his colleagues at the Texas A&M Supercomputing Facility to request access to the facility's 832-core "Hydra" cluster, an IBM Power system running AIX. Montuoro developed a parallel version of Illumina's Genome Analysis Pipeline, called pGAP, to run on the cluster, which has sped up the analysis time by a factor of four.
More recently, the Texas A&M Supercomputing Facility installed another IBM cluster — a 2,592-core, 27.14-teraflop Linux system known as "Eos." Montuoro and his colleagues will be modifying pGAP for that system as well, and are also running a number of other bioinformatics applications on the new machine.
BioInform spoke to Montuoro this week about the process of parallelizing GAP and other projects underway at the Supercomputing Facility. The following is an edited version of the interview.
Can you provide some background on why you needed to modify the Illumina software for this work?
The Illumina Genome Analysis Pipeline works in three steps. The first step is the analysis of the raw data that comes from the machine, and the raw data comes in the form of picture files. In one of our typical experiments, there are 252,800 pictures, which is about two terabytes of data.
The problem was that the Illumina software wouldn't allow you to do this analysis on a real cluster. The software was designed so that it would just run on a single box. And running on a single box, basically you'd only be able to analyze eight pictures at a time. So, depending on the size of the experiment, it would take over a day just to process the first step of this pipeline.
Then you have to work on the results, which are text files. Their size is much smaller [than the image files], but they need to be processed efficiently and they have to be sent to the third step, which is the alignment [against the reference genome of the organism].
The researchers studying Mycobacterium tuberculosis … weren't really concerned with the alignment step because they had their own software to do that, but they said, 'We wish we could use your cluster," so we looked at the program and we said, 'Well, this is really not going to work because at the most you could use a single box with 8 CPUs.'
So it took some major work to design our parallel Genome Analysis Pipeline. We started with porting the software onto our “Hydra” cluster and then we tackled the task of parallelizing. The pipeline is not just a single program. It contains many different sub-programs, and each one of them does a separate task. So you have to coordinate all these things together to run it in parallel.
After a couple of months, we presented the first version of parallel GAP, which was based on an early version of the Illumina software. And then the software changed again so we had to go back and redo the work.
What did the modifications entail?
My approach was as conservative as possible. I didn't actually modify the code that does the analysis, but I changed the way the code is executed on the machine. So there are portions of the pipeline that have been rewritten to run in parallel, across the entire cluster, but they can always be run in the way they were originally conceived to be run, on a single box.
For instance, you usually want to run the image processing in parallel, because this portion is the basic bottleneck for the analysis. So you leave that on. But for other less critical regions, you might not need to do that in parallel.
One of the other things that we updated is the interface. At the Supercomputing Facility we try to make things as simple as possible for our users, because people may not be comfortable with a cluster, so I remodeled the interface so that it's like running a single command.
[ pagebreak ]
Since the cluster is a shared resource for the entire university, we manage its workload through a job scheduling system. Our pipeline is completely integrated with IBM LoadLeveler, the resource manager on our IBM Hydra cluster. Therefore, once you ask LoadLeveler for the resources you need, the only thing you have to specify is basically, 'run pipeline.' It's a simple command, and this runs the analysis, reads the settings in your config files, and knows exactly how many processors you want to use in the whole cluster.
We also set up a pipeline between the sequencer and our cluster, so that data can be transferred automatically to the cluster’s filesystem as it is produced. Considering that an experiment takes between two-and-a-half and 11 days to complete, you really have the time to pull each individual picture file into the filesystem without waiting until the end. Otherwise it would take additional time to transfer the whole data set. In our researchers’ experiments, that would be about nine hours for 1.8 terabytes. We did this for a single [sequencer], and are expecting to do the same with other recently purchased machines.
I would imagine that every lab with an Illumina instrument faces the same bottleneck with the image analysis step.
Their software is evolving. We were focusing on the processing of the raw data because that was the bottleneck for us, and it still is.
Illumina’s new software allows processing of the raw data on the fly so you don't really need to analyze this huge amount of picture files and worry about the time that it would take. However, our researchers prefer to save their raw data because it's easier to get advances in software than in the machine itself. So as the software improves, you could potentially extract a little more information from the same picture.
Would this pipeline be available for other labs if they wanted to install it on their clusters?
We're working with our Office of Technology Commercialization to see what the limitations of the license are. When you buy the machine, you get the software and the source code, and you're allowed to modify it for your own purposes. But you can't share it with anyone else because they would have to have the same machine in order to have the license to use the code. So that's kind of tricky.
Software changes very quickly, so maybe in six months, a year from now, something else will come along, and that's fine. But the good thing for us is this helps us, right now, to go faster, so it would be amazing to be able to share it.
About how much sequencing are the researchers performing as part of this tuberculosis project?
Last year when we started working with them I think the rate was about a genome a month. At the time the machine was new so they were trying to get familiar with it. And it also takes time to get the strains and work with them.
One of the reasons they wanted us to speed up this program is because it's thousands of dollars per run, but it's also the time that you waste. So let's say that not all the runs [are successful]. There might be errors, the reactants may not be working well for some reason, the operator might have made a mistake. So you want to be able at some point after your run starts to run a preliminary analysis. And if everything looks good, then you keep going. So this would save you time and money.
How many new sequencers has the university ordered?
At this moment, we have two Illumina sequencers on campus. There is also a Roche 454, but we haven't worked on that.
We're trying to coordinate all the resources on campus, connecting them with equal access to our Supercomputing Facility.
The software is running on the facility's 832-core IBM Hydra cluster. Does it use all of those cores?
If you wish you can do that. Theoretically, the module that analyzes the raw pictures, called Firecrest, scales very well. It's almost linear. So in principle, if you have 252,800 pictures and 252,800 CPUs, you would run this in a couple of minutes.
There is one thing, however. All of the modifications that I have done for the program are as conservative as possible [because] we don't want to modify any of the numerical algorithms that could introduce errors. So the second step in the pipeline, the Bustard [base-calling] module, could not scale over 100 [CPUs], though the cut-off depends on [the number of tiles per lane in] the experiment. This holds you back a little so instead of effectively linear scaling, [it begins to taper off] after 100 CPUs.
The other issue is you really need to have a parallel file system that can give you the throughput you need to read all these pictures in parallel and write the results. So IBM's [General Parallel File System] is delivering the throughput that we need.
So the problem itself, since those pictures are all independent, is very simple in principle. The application, however, may be complicated, depending on the way the software is conceived. You need processing power and a very good parallel file system to sustain the throughput.
Now, we're talking again about the first and the second step of the pipeline, but we haven't talked about alignment because our researchers weren't interested in alignment. And for alignment, basically every research group has its own program, and there is no standard application. Illumina’s pipeline comes with the ELAND [Efficient Local Alignment of Nucleotide Data] algorithm, and this scales at the most up to eight CPUs — one per lane for the Illumina sequencer.
So the next step for us would be to try and parallelize the alignment, but there are several algorithms and they're all different. For example, some alignment algorithms try to find a perfect match. ELAND is way faster than any of them, but the reason is that it allows for [up two] mismatches. So it depends on the quality of the alignment that you're looking for. But that would be great to do as a next step.
Will you be adapting the pipeline for the 2,592-core iDataPlex that IBM just installed at the facility?
Yes. It's a different architecture and a different operating system. The Hydra cluster, which is the original one, is based on IBM AIX. It's a Power5+ system. The new one is based on Intel and it's a Linux cluster. We ported the code to AIX and it runs perfectly fine. The machine is now over three years old, but it's still a very valid architecture. But most software is commonly developed on Linux platforms, so the transition will be pretty easy.
Do you have a timeline for that?
No, we're waiting for other sequencers to be connected to our new system. I would say a few months. But that's a need that we have on campus. The new machine right now is hosting an instantiation of Galaxy from Penn State, and we have a mirror of the UCSC [Genome Browser] that we just installed, and we are trying to import into the database some new genomes so that people on campus can work pretty fast on these things.
One of the possible steps would be to integrate our parallel pipeline into this tool. People like web interfaces much more than command line interfaces. We're trying to go out and see what the needs are from the larger community, rather than focusing on a single group.
What are your next steps in the short term?
The short-term priority is to try and implement this pipeline with the other sequencers because the software is there. Then we can address other goals like speeding up the alignment step, and possibly developing core expertise on campus in bioinformatics.