Let’s face it: Most bioinformatics developers don’t have access to a Cray supercomputer — a fact that the company readily admits, and isn’t expecting to change any time soon. But a recent project to improve the portability of bioinformatics code between the Cray system and more common platforms could enlarge the company’s user base.
For several years, Cray has offered a set of bioinformatics routines called the Cray Bioinformatics Library (CBL) that takes advantage of the company’s unique vector processing architecture to solve some thorny problems in the areas of sequence alignment, searching, and sequence manipulation. Now, a bioinformatics team at the University of Alaska, Fairbanks, has released version 1.1 of the Portable CBL — an open source C implementation of the CBL that is designed to run on a variety of Unix and Linux platforms. Jim Long, technical leader of biotechnology computing research at the Institute of Arctic Biology at UAF, said that Cray approached him in the fall of 2002 with the idea of developing a version of the library that could run on other platforms. “The idea to get a portable version was because a lot of biologists are on limited budgets, and it would be really great if they could use this library, develop it on their low-end Linux box, get their code all debugged, get some small problems running properly, and then go get some time on a big Cray if they need it — and have it be portable so it can just go right over,” he said.
Building the Library
Bill Long (no relation to Jim), a bioinformatics software developer at Cray, began developing CBL in late 2000. As a Fortran specialist at the company, Long was very familiar with certain instructions in the Cray system “that are particularly aimed at manipulation and logical-type operations that aren’t normally on other computers,” he said. At the time, he said, as media coverage of the Human Genome Project approached its peak, “I read some articles about some of the sequence comparison stuff, and I said, ‘Well gee, you could actually do that with these special instructions.’”
The first test of his handiwork came in 2001, in a demonstration project with the National Cancer Institute that mapped the entire set of short tandem repeats in the human genome in under 10 minutes [BioInform 07-16-01] Since then, Long has continued to develop a full set of routines for Cray, and is preparing to release CBL version 2.1 in about a month.
Long said he developed the library to take advantage of the unique aspects of the Cray architecture for bioinformatics, and also to raise a bit of awareness about the company’s platform. “People didn’t know about those kinds of machines — they weren’t on the radar screens of most people in bioinformatics, and they really had no experience in programming machines like that,” he said. “The idea of the library was to give users access to the capabilities of these machines without having to learn how to do the coding.”
But the bioinformatics community has been slow to adopt the toolkit, mainly because there are so few Cray systems in use. With the list price for the company’s X1 system starting at $2.5 million, Cray is targeting a very different buyer than IT vendors hawking Linux clusters. Rather than the small labs and research groups that make up the bulk of the bioinformatics market, regional supercomputing centers and government labs are the main customers for Cray’s systems. Many of these systems are available to a broader group of researchers, but the bioinformatics community hasn’t taken advantage of that fact — a situation that could change, according to Cray’s Long. “The idea here is that we would encourage people to use the machines that are at these major research centers. The expectation for people to buy their own Cray machine is low … but having them be able to get access to this sort of capability through one of the centers that has a system is sort of the approach I’m looking at,” he said.
One reason the company opted to work with the UAF group on the portable version of the library was that “we had seen some codes that people tried to port to the Cray machines and they weren’t very successful because of the way that the codes were written. So this was an attempt to give people tools that would work in both spheres.”
Beyond the Cray
Jim Long at UAF has even bigger plans for the portable version of the library. While it was originally intended to make life a bit easier for the small number of bioinformatics developers who actually have access to a Cray, “As it turns out,” Long said, “the library runs pretty quick itself on the Linux box.”
In comparisons of the Portable CBL on a selection of Cray, Intel, and AMD processors against CBL on Cray’s SV1, runtimes of the portable version on the Cray SV1, SX6, and X1 architectures were about two to four times that of the native version on SV1, while runtimes on the Linux systems were one and a half to eight times that of CBL on SV1. The UAF team has not yet compared the performance of the Portable CBL against other bioinformatics libraries, but Long said the performance is good enough to be of interest to developers used to working on other platforms. “We’re hoping that industry will pick up on this,” he said. “It would be great if different vendors would optimize the library for their hardware.” Long cited Apple’s G5 as a particular platform of interest because the G5 chip has a vector unit. “That could really get some screaming results,” he said.
Bill Long noted that Cray doesn’t expect bioinformaticists to “abandon their clusters and switch to Cray,” but the extension of its toolkit to other platforms could help encourage use of those Cray machines that are already in place. He said there is a small but growing group of people who are starting to use CBL at the larger computing centers, “and I would like to see people do a little more of that.”
Long added that he attends bioinformatics conferences such as RECOMB and ISMB every year, “and it’s still a surprise [to me] … how many people don’t know about Cray, or about supercomputers in general, and because of that, they’ve probably never even thought about trying some really big problems.”
Some Cray Systems Used for Bioinformatics Research
• NCI’s Advanced Biomedical Computing Center: SV1
• Arctic Regional Supercomputer Center: X1
• National University of Singapore Medical Faculty: SV1
• South African National Bioinformatics Institute: SV1
• Ohio Supercomputer Center: SVI