Borrowing a page from the sequence-analysis sector, the field of proteomics is adopting specialized hardware to grapple with a mass spec data deluge.
Last week, a team of researchers led by Christopher Hogue of the Blueprint Initiative said that they had developed an approach based on field-programmable gate arrays — reconfigurable circuits that can be customized for particular processing tasks — to accelerate data analysis for mass spectrometry-based proteomics experiments.
Meanwhile, Thermo Electron is preparing to ship an FPGA-based workstation that runs the Sequest database search algorithm up to 50 times faster than a typical CPU, according to the company.
FPGA-based methods should be of interest to instrument providers looking to speed up their systems, according to Hogue, who said that Toronto’s Mt. Sinai Hospital, where Blueprint is based, “is interested in taking this particular piece of intellectual property into a commercial environment, so we’re interested in talking with mass spec vendors about the possible integration of a system like this into a mass spectrometer.”
FPGAs and similar hardware-based methods have been used for many years in bioinformatics for sequence alignment, and companies like TimeLogic, Paracel, and Compugen have commercialized standalone bioinformatics “accelerators” based on the same technology.
But the approach has so far found little use in bioinformatics beyond speeding up Blast, Smith-Waterman, and other sequence-based algorithms.
Thermo’s system, called Sequest Sorcerer, was introduced at ABRF in February, and the company plans to begin shipping the product “shortly,” according to Robert Barkovich, product marketing specialist for bioapplications software at Thermo. Built in collaboration with Sage-N Research, a data appliance company that focuses on life sciences infrastructure, Sequest Sorcerer takes about a minute to run 1,000 MS/MS spectra against the human IPI database, which is about 26 megabytes, and 10 minutes against NCBI’s non-redundant database, which is about 968 megabytes.
The Blueprint team, in collaboration with Jonathan Rose of the University of Toronto’s department of electrical and computer engineering, designed its FPGA-based system to search 3 billion base pairs — the equivalent of the entire human genome — per second. This allows the team to query MS/MS peptide fragments against DNA sequence databases, rather than against mass spectra or protein sequence databases, as Sequest does.
This approach — which translates the mass spectra into a peptide sequence and then into a “wildcard pattern of all the possible DNA sequences that encode that protein” — would be too computationally demanding to perform on a standard compute cluster, but offers a number of advantages over other mass spec analysis approaches, Hogue said.
“[It] allows us to search through DNA sequence with no annotation on it,” Hogue said. “So if you’re looking at a peptide in a genome where you question the sequences of the genes that are predicted, where you may have a lot of intron/exon boundaries and other uncertainties about the unfinished sequence, it’s better to search through DNA space than protein space. You can’t make a new discovery on a piece of gene sequence if you’re searching through a peptide database.”
Hogue said that the seeds for the method were planted several years ago, when the Blueprint researchers hit a computational wall in analyzing large amounts of MS/MS data. He said that his team tried scaling up several proteomics search algorithms on “a number of cluster-based systems, [but] it became apparent that the problem wasn’t in the software, but rather the interconnect of the cluster that was continuously stopping us from getting close to the time goal” of searching 3 billion bases a second or less.
After deciding that FPGAs were the way to go, Hogue said he spoke to some of the accelerator vendors, “but they told me at that time that they would only do an application in mass spectrometry when it was clear what the software solution was to the problem, and they would simply adapt that to the FPGA.”
While the mass spectrometry software landscape has opened up a bit in the ensuing years, at the time “it was all commercial software and nothing getting as fast as we wanted it to,” Hogue said. So the team opted to build its own system.
Faster than a Speeding Peptide
Blueprint’s system, described in the March 30 issue of Rapid Communications in Mass Spectrometry, was implemented on a Gidel ProcStar FPGA board with 2 GB of RAM.
Depending on the scoring algorithm used, the search engine can process as much as 4 megabases per second. In an example used in the paper, the system required 1.6 seconds per query against the human genome. According to the authors, the same query using a 600 MHz Pentium III would require 210 seconds, and 52.5 seconds on a 2.4 GHz processor. The search time could be reduced to less than a second on a 64-processor cluster, the authors note, adding that “two [FPGA] hardware units could deliver this performance at a cost 40 times lower than that of an equivalently capable software cluster.”
According to Gidel, pricing for its ProcStar boards starts at $2,000.
Hogue said that other labs could probably replicate the work, and that the Blueprint team would make “components” of its instructions available to other groups, but he was quick to note that FPGA-based methods are a bit more demanding than some bioinformatics developers may be used to.
“It’s a little harder to write code around hardware, and you do need specially trained people,” he said. “You can’t take a Perl scripter and get them to write FPGA code. Forget it.”
Pricing for Sequest Sorcerer is not yet available, according to Amy Zumwalt, product marketing specialist for proteomics at Thermo. However, in an e-mail message she said that “pricing will be significantly lower than the hardware and software equivalent of a 16-20 node Sequest cluster.”
Hitting the Market
Thermo’s Sequest Sorcerer is a standalone desk-side tower running on an IBM server with 250 gigabytes of data storage. It can be configured for remote access, or placed on a dedicated network, and users can access their data through Sorcerer’s own web interface or Thermo’s BioWorks user interface.
Hogue said that he’d actually prefer to see Blueprint’s FPGA system “tightly coupled” to a mass spectrometer, rather than as a standalone system, “because part of the benefit of the hardware-accelerated protein identification is really in a feedback cycle to the mass spectrometer. If you can feed back information to the mass spectrometer about the small discoveries it’s making as it’s burning sample, you can actually create what are called dynamic exclusion lists, and have the mass spectrometer carry out a smarter protocol of analyzing tandem — or more than tandem — mass spectra.”
The upshot, he said, “is that [the mass spec] can more intelligently select fragments that are flying through the quadrupoles, and analyze those, and not keep reanalyzing information from highly expressed proteins.”
Thermo’s Barkovich questioned this approach, however, noting that the company prefers to have “the data processing going on in a separate system than the data acquisition.”
Eric Andrade, managing director of Blueprint, said the team is confident that mass spec vendors will be interested in its FPGA method, citing Thermo’s licensing of the Sequest algorithm of one example of how instrumentation firms are looking to get an edge in the competitive proteomics marketplace via informatics.
But have companies marketing accelerated systems for sequence alignment missed the boat on proteomics analysis? Considering that Paracel has closed shop [BioInform 10-04-04], and that Compugen has divested its accelerator business [BioInform 07-18-03], it’s probably safe to say that the commercial market for these systems was never all that robust to begin with. But it’s possible that new application areas could give surviving efforts in the field a bit of a boost.
TimeLogic declined to discuss any details of its development efforts in this area, but Michael Sievers, the company’s senior manager of R&D, responded to a request for comment from BioInform that “TimeLogic has a 10-year history of addressing life science computing bottlenecks. We recognize that analysis of proteomics data is a growing computational burden and are developing fast, cost-effective solutions for this and other research areas.”
Martin Gollery, formerly director of research at TimeLogic and currently associate director of the center for bioinformatics at the University of Nevada at Reno, speculated that “it’s not that the commercial guys are not interested” in developing dedicated proteomics FPGAs, “it is simply that they have not had the resources to pull it off.”
David Chiang, CEO of Sage-N Research, agreed. In an e-mail message to BioInform, he said that while “FPGA technology can readily offer 10x, 100x, and in some cases 1,000x or more acceleration over PCs for certain classes of applications, the challenge has been cost — both in terms of porting costs and final product costs. I believe that is why there continues to be academic interest in FPGA-based algorithm acceleration … but relatively few products.”
Gollery added that academic development of FPGA-based systems will likely become more common in the future. Commercial vendors, he said, “can’t hope to come up with algorithms for everything that everyone wants to do.” Meanwhile, “Generic FPGA boards are cheap, whereas power, AC, personnel and floor space are expensive. So the answer is to come up with solutions yourself.”
Some academic groups beyond the Blueprint team are already working to extend FPGA-based methods beyond sequence analysis. Martin Herbordt at Boston University, for example, is developing FPGA-based systems for analyzing microarray data, modeling rigid molecule interactions, and processing repetitive structures in sequences. David Eisenberg’s lab at UCLA is using FPGAs to reconstruct biological networks, as described in a recent Nature Biotechnology paper, “In silico simulation of biological network dynamics” [2004; 22 (9): 1017-1019].
Hogue said that his team is also looking to extend the approach to “other high-compute problems in systems biology,” including “some of the bigger problems of integrating large amounts of protein pathway interaction complex and network data, and coming up with realistic simulations of that data.”
The Blueprint team is also working with Rose at the University of Toronto to improve the scalability of the hardware design.
“I believe there’s a great future in specific problems done on FPGAs, since we’re not seeing standard computers get much faster,” Hogue said.