For more than a decade, proponents of grid computing have promised that scientists would soon have transparent access to unlimited computational and data resources from their PCs. While that day has taken a bit longer in coming than some had predicted, it appears that it’s getting a bit closer for the bioinformatics community.
In the last several months, a number of projects have begun using shared resources like the National Science Foundation’s TeraGrid for biomedical applications, while some bioinformatics developers are eyeing TeraGrid as a legitimate alternative to in-house clusters.
NSF kicked off the TeraGrid initiative — a national network of supercomputers that includes more than 102 teraflops of computing capability and more than 15 petabytes of storage — in 2001 [BioInform 08-27-01], but the resource wasn’t fully operational until 2004.
Some bioinformatics developers who dabbled in grid computing before TeraGrid was completed found that the approach wasn’t quite ready for prime time. For example, Don Gilbert, director of the Genome Informatics Lab at Indiana University, Bloomington, said that he was eager to jump onto the grid in 2002, “but I learned that you have to have somebody else willing to set up the grid infrastructure and allow you to use their computer, which is of course part of shared computing.”
The problem, he said, is “that didn’t really happen. There were efforts to install Globus grid software and you could find it on small clusters around the university, but it never really happened that I could find a large set of computers with grid software installed four years ago or three years ago.”
Now it appears that grid computing is finally coming of age. Gilbert described TeraGrid as “basically just a big, cost-effective cluster” that should appeal to smaller labs that can’t afford massive computational resources of their own.
Gilbert recently used TeraGrid to annotate the newly sequenced Daphnia pulex genome and 12 Drosophila genomes by comparing them to nine reference proteomes with a parallelized version of tBlastn. A TeraGrid run for each genome took around 18 hours on 64 processors, according to a whitepaper on the IU Genome Informatics lab’s website.
Pleased with these results, Gilbert said that he is developing a grid toolkit for the open source GMOD (Generic Model Organism Database) project that would enable smaller labs to annotate new genomes using shared computational resources.
“My plan is basically to automate just what I’ve been doing over the last year by hand, and make it so that anybody with a little bit of bioinformatics expertise and a new genome could do the same job,” he said.
Gilbert said that the “plummeting” cost of DNA sequencing is enabling many more labs to have their favorite genome sequenced. However, he noted, “once they’ve got their genome, they haven’t gotten all the way there. They need to get a genome annotation, which involves a lot of genome informatics, and that’s where I think the GMOD project using TeraGrid and other shared computing resources could make a big impact.”
TeraGrid for Smaller-Scale Bioinformatics
While Gilbert focuses on whole-genome analysis, other efforts are aiming to bring TeraGrid resources to biologists who may be studying only a single gene. The North Carolina Bioportal (http://www.tgbioportal.org/), which launched in May [BioInform 05-26-06], is providing a user-friendly front end for the TeraGrid.
The goal of the NC Bioportal is to make grid computing “seamless” for biologists, John McGee, a project manager with the Renaissance Computing Institute, told BioInform. RENCI, a joint institute of the University of North Carolina at Chapel Hill, Duke University, and North Carolina State University, developed the portal as part of NSF’s “science gateway” initiative — an effort to broaden the TeraGrid user base by making it more accessible to non-computational scientists.
The Bioportal offers access to a number of popular bioinformatics packages, including EMBOSS, GLIMMER, HMMer, the NCBI toolkit, and Phylip, as well as a large number of standard databases. One important aspect of the system, McGee said, is a workflow package based on the open source Taverna package that helps chain these various packages together into a streamlined process with very little end-user input.
“For our users, it looks like just any other portal application,” he said. “They see a form, they fill in the input values, and they hit ‘submit.’ But what happens is when the job gets submitted to the compute and data engines, it is actually Taverna that is fired up, and the workflow that we have designed executes and runs and hits all the appropriate databases and services and executes a series of these bioinformatics programs. And after the workflow is completed, everything is made available to the user through the portal interface.”
McGee said that the workflow capability is still under development, and that it will be an area of intense focus over the next few months because it’s “incredibly significant” for adoption of the Bioportal. “It’s incredibly rare for anyone to use any one particular application in their research,” he said. “The process pretty much always involves a series of programs and munging the data between them, which is why these workflow engines are so important.”
McGee said that RENCI is also working on ways to “dynamically select where a particular job gets sent based on its characteristics and the kinds of resources that are available.” Currently, he said, “we’re shooting them out in round-robin fashion, basically, just to distribute the jobs evenly across all those compute engines.”
So far, around 178 users have registered for the Bioportal, Karen Green, director of marketing and public relations for RENCI, said. “The more seamless we make it, the more people are going to use it,” she said.
Another TeraGrid science gateway, the Life Science Gateway, is located at Argonne National Lab (http://lsgw.mcs.anl.gov/) and has ties to an early grid-based bioinformatics effort there called GADU (Genome Analysis and Database Update) [BioInform 05-05-03].
Ian Foster, director of the Distributed Systems Lab at ANL and a founding father of grid computing, has had a hand in both projects and told BioInform that they’re a sign of a “transition … toward service-oriented architectures for bioinformatics, where you access computational analysis procedures and data over the network rather than having to maintain information on your own local system.”
In the early days of grid computing, Foster said, target users were “simulation scientists who wanted to perform incredibly challenging computations to study some very specific problem.” Now, he said, “we’re finding that there is actually a much larger community of people who want access to large amounts of computing resources, but they don’t necessarily have the expertise to develop their own simulation codes and learn how to use something like TeraGrid.”
While biologists have much to gain from TeraGrid-based projects, bioinformatics developers may have a tougher time building systems to run on the grid than computational scientists in other disciplines, Foster said.
“Biology is particularly challenging as a discipline because often the data is very diverse and comes from many sources and it often has some degree of commercial value or patient privacy,” he said.
Other disciplines that don’t have these characteristics “have made more rapid progress” in grid computing, Foster said. In the case of high-energy physics, he noted, “their big advantage is they really only have a few data sources. They have the accelerators that generate data, and your whole experiment may be based on data from a single accelerator.” Astronomy, on the other hand, has many data sources, but that data is very uniform, he said.
Nevertheless, while bioinformatics may be a bit slower in making its way to the grid, Foster noted that the “strong tradition of collaboration” in the field is “driving fairly rapid progress.”
Collaboration — and funding — will be the key to Gilbert’s goal of developing his TeraGrid package for GMOD. While he intends to get a package together on his own before the end of the year, he is seeking collaborators to contribute to the open source effort.
“It’s sort of the chicken and egg thing,” he said. “If there’s enough interest, it’ll go forward, and if it goes forward, there will be more interest.”