In mid-May, Genome Technology hunkered down with four high-performance computing experts to continue its IT solutions roundtable series. After going through cost issues in the last discussion, this conversation focused on performance questions. What’s the real secret to optimizing your infrastructure? During the hour-long event, our gurus examined HPC from clock speed to ASICs to the interaction of biologists with computer scientists. As you’ll see from the following excerpts, they didn’t need much prodding from GT staff to get the debate going. Take it away, guys.
Christopher Botka, computational biologist, Bauer Center for Genomics Research at Harvard University
Kent Gilson, chief technology officer, Starbridge Systems
Michael McManus, vice president, BioIT group, Fujitsu America
John Morris, manager, technology operations — bioinformatics, Wyeth Research
Genome Technology: Starting broadly, what is the main thing people need to look at in terms of performance? Is it really all about the hardware, is it the software, or is it a combination?
Gilson: It’s a combination. One thing that most people don’t realize is it’s not about CPU speed, it’s not about megahertz or the clock — it’s more about how machines are coupled together, the architecture of the machine. Algorithms are bound by the architectural capability of the machine, like clusters, which are loosely coupled together and you’ve got to have problems that are embarrassingly parallel that don’t have a lot of communication between the processors.
It’s a combination of your hardware, your architecture, and your algorithms that you choose to actually map onto that architecture. That’s really the big problem: balancing all those issues and trying to get your science done.
Botka: I think you hit the nail on the head. You have to define the problem very well prior to making some commitment to whatever hardware, software, architecture, [or] algorithm you want to do. It seems to me that the average — even the above-average, computer-savvy — biologist doesn’t have enough depth in the architecture domain to be able to talk to whoever would be in charge of buying or designing the hardware. The interplay between the two groups to figure out what the problem is, exactly what they want to do, is the most important place to start.
In our building there’s four examples of completely different high-performance computing architectures, and they all do some things better than others. One general-purpose area that we’ve built is a cluster — that’s somewhat unusual to have a loosely tied together cluster of commodity machines as a general purpose tool.
McManus: As a scientist by training, I would agree that you need to look at the solution — what is it that you’re trying to accomplish? If it’s software based or if there’s an algorithm, you need to understand if it was written for a PC or, if you need more horsepower, can it even run on another machine.
Botka: Faster’s always better. And faster’s more better in some scenarios than others. If you don’t have to wait, you get to do the next experiment earlier.
McManus: Sometimes there are architectural issues you have to consider — in a massively parallel situation you may have to rewrite things a little bit.
Gilson: What about that interface? That seems to be the most important issue: you’ve either got to be disciplined in biology and bioinformatics and computer architecture, or the communication mechanism between them has to be just ultra-tightly coupled.
Botka: I think that’s what it has to be. You really need to talk to the people who understand the IT side of it and there has to be someone on the biology side that can [explain the science].
Gilson: It’s a bidirectional dialogue.
McManus: [There’s an opportunity] to use some visual programming tools to make it easier for the bioinformaticist or the biologist.
Botka: If there’s a way to do visual programming that would cut out the copy and pasting through Excel to get things to a production algorithm faster, that’s what would really alleviate a lot of the pressure that falls on the computational biologist to write small pieces of code. That will help the pipeline move more smoothly.
One of the weak points in that whole area is the language — a minimal descriptor language for science that can be used to develop a context for your programming environment. You have to be able to describe the biological problem so the biologists can understand it but in a small enough vocabulary that it can be coded against.
Gilson: Historically the problem with building an abstraction in a computer language is you get farther away from the actual architecture so there’s more and more work that has to be done with the compilers. The problem with compilers is that they don’t have a mechanism for taking abstraction and reducing it into the architecture.
McManus: I think the scientist doesn’t want to be bothered with any of this stuff, they just want it to work. It’s the informaticist that’s in between that sort of has to carry a bit of both camps.
Botka: I’ll change gears for a second. Say you can solve 20 or 30 of the algorithm problems fast; now you have 20 or 30 sets of results that you have to integrate somehow. It’s not a trivial task. If it took 16 times longer to do your Hmmer and your Blast, then you had that time period in between to digest and figure out how you’re going to integrate those two datasets. Now you don’t have that time anymore and you want to look at the datasets — the thing that takes most biologists a lot of time [is] once you get those results, how do you integrate your Blast and your Hmmer results? Can the visual programming environment piece be stuck on top of the data integration piece? We’ve made the first step really fast — how do we make the second step at least come close to being efficient?
McManus: On a higher level, what you’re saying is if we have access to high-performance computing and we are producing results faster, it doesn’t solve the problem if we can’t interpret or make sense of those results equally as fast.
Botka: It just pushes the bottleneck out. Maybe the logical first step is to stick the programming-free environment on the people who are actually doing the programming first, resolve any issues you have there, and after that put it onto the biologist.
Gilson: It’s the OS model. It’s the idea that before OSes were around, everybody built their own keyboard handler. The idea of capturing that scientist/computer scientist interface once and doing it right and using that new ontology that biologists can actually interact with.
You mentioned that you’ve got a number of different architectures. Can these machines talk to each other?
Botka: They were built separately hardware-wise and they’re now being merged. We have a single memory image machine multi-processor that runs our microarray analysis data storage and warehousing. We have a sort of grid-within-a-cluster model that’s a little bit different than [a regular grid] — they talk through shared storage and shared interconnect. So now we have a NAS/SAN solution that’s shared with our single memory machine. There’s another couple of machines that do things on their own that share interconnect and disk and then jobs from those machines can be directly shunted off to the cluster if they’re parallel jobs.
Genome Technology: What about communication between processors? What’s the balance between costs and communication? What are the performance issues?
Gilson: That is the million-dollar question right there. There’s always the rule of thumb that says you’ve got to commute to compute — you always need communication, and you always need a balance. You need as much communication performance as you have computational performance and an equal amount of memory bandwidth.
McManus: The creation of the algorithms to run in a distributed or grid environment is not as difficult as we’ve made it sound — but it’s true that you can’t do all algorithms that way. Some will not be amenable to a grid or distributed computing environment. The other piece of the spectrum that’s missing here: you have big, symmetric, multiprocessing iron; you have grids; you have this hypercomputer idea [from] Starbridge; but one thing we haven’t talked about is that there’s a lot of companies making custom ASICs, or application-specific integrated circuits, that are for Blast or Hmmer and you have to buy a whole machine that does one thing. You’ve got this dilemma — do you spend and have a bunch of machines sprinkled around the office that are application-specific, or do you have machines that are more generally applicable that are more of a resource like a grid, or do you have big 64-processor or larger machines that are more like mini-mainframes?
Morris: We’ve gone through the entire spectrum [at Wyeth]. We’ve made a choice to try to get away from specialized application-specific or hardware-specific solutions — that is, “one tool for one problem,” but then again, our problems are usually more diverse. [We aim to use] a general computing structure that can meet the majority of the needs. You have to try to reduce complexity not to one environment but to a manageable subset that will allow you enough breadth and diversity to meet those current needs but be flexible enough to scale for future demands.
Botka: The story is not done or barely begun for a lot of disciplines. [For] sequence analysis [and] comparative genomics, a lot of the algorithms have been in place for a long time and people probably aren’t going to change them anytime soon. But [for] things like microarray analysis or proteomics, the tools are still being developed — so building something that’s general purpose is much better for us and buys us more flexibility.
Morris: That model holds as long as your demands are variable. If you have an organization where you’re doing 100,000 Blasts a day, every day for the next three years, you might want to consider something like [an ASIC].
McManus: You have this range from the high-end, multi-processing machine to a souped-up desktop. Whatever it is you provide has to be generalizable enough that it can be used in a wide variety of circumstances, because once it gets pigeonholed, it gets old.
Genome Technology: What’s the takeaway here? What can our readers do in the next week to optimize their compute performance?
Botka: Set up a weekly meeting with their IT group.
McManus: There is a huge firewall between the bench scientist and the computational group. Even in pharma where they’re trying to be fully integrated, it’s really hard to get people to talk to one another. They just have different languages.
Morris: I like to think that the firewall has holes in it — bridging that gap is very important.
McManus: So the bottom line is, make sure that you have effective communication between you as a scientist and your informaticist and your IT people and any of your support people that are in the technology area. The only way that you’re going to get a job done is if you can communicate what you need clearly enough that the next person in the chain can act on it and ultimately work together to give you a solution.