Genome Technology assigned seven industry experts a hypothetical task: Build a computing infrastructure for your startup genomics company. They raised more questions than answers.
Jamie Cuticchia, Director, Genome Database, Bioinformatics Supercomputing Center, Hospital for Sick Children, Toronto
Nat Goodman, Senior Vice President, 3rd Millennium, Cambridge, Mass., and Genome Technology’s IT Guy columnist
Linda Kirsch, President, Scientific Sales Solutions, San Carlos, Calif.
Jochen Kumm, Head, Computational Biology/Bioinformatics, Roche Pharma, Palo Alto, Calif.
Kenneth Kupfer, Technology Manager, Bayer, Berkeley, Calif.
Jon Morris, CEO, biomedical informatics startup Genmatics, La Jolla, Calif.
Matthew Trunnel, Director of Research, Blackstone Technology Group, Worcester, Mass.
After agreeing that their fantasy firm would validate genomic drug targets, that they would probably begin by buying access to a variety of genomic databases, and that their computing budget would be unlimited, our seven panelists hashed out an agenda for installing a genomics computing infrastructure. What follows is an abridged version of their conversation, which took place in March over turkey sandwiches at the Fairmont Hotel in San Francisco.
Goodman: So, we are the committee responsible for introducing a high-performance computing system into our company. How do we start? What are the issues we need to think about as we set out on this exciting venture?
Kirsch: Should we ask ourselves what’s the most compute-intensive aspect of the process?
Cuticchia: Is it going to be a process or processes? If it’s just going to be a single process, then that may steer us toward one type of high-performance solution. If we’re going to be doing multiple processes, that may steer us toward an entirely different one. Sort of like the cluster versus the scale or multiple processor models.
Kumm: So you’re asking, are we doing one thing very well, or are we doing a little bit of proteomics and a little bit of genomics?
Morris: Do you think moving forward that we’ll be moving more and more into distributed computing environments and collaboration and disparate datasets?
Kumm: It’s true that if you have one thing that you need to do and you get dedicated hardware for that thing, and that’s all you need to do, at least you’re going to outperform most kinds of Linux cluster environments. Not always, but in certain cases.
Kupfer: We’re right down to Linux! The deliverable is going to be specific genes that satisfy some partner’s requirements for targets and that can either be validated at some level or not validated. But certainly bioinformatics is going to be the early part.
Cuticchia: Increasingly [bioinformatics is] going to be integrated through every stitch.
Goodman: So what are the starting points for this search?
Kupfer: We’re going to have to run BLAST and other homology searches, probably attuned to our own parameters. So there’s a certain scale.
Trunnel: Focusing at the application level will get us to computing infrastructure more quickly than trying to design a business model for a hypothetical company. Large-scale genomic homology searching and genome assembly are two very different computational problems, so they’re good things to bring into this.
Kupfer: So far it’s just a kind of a standard 16- or 32-CPU platform.
Trunnel: Well, let’s talk about what standard is. There is a standard genomics or high-throughput-bioinformatics computing platform. HPC is often the wrong term. A lot of computing is much better characterized by high throughput rather than high performance. High throughput is well matched to what you were describing: 16 or 32 CPUs that you can throw at a problem that divides very well based on the data — an embarrassingly parallel task that you can break down at the data level.
Kupfer: Is there a difference between serving a community of people doing distributed queries where you want to turn around the query very quickly, versus doing a large number of calculations on the large number [of nodes] and having it all be fast?
Trunnel: Where you have lots of people doing individual queries who want a rapid response, that’s a real high-performance issue. You want minimal latency. You want to get your answer as quickly as possible.
Kupfer: For target identification I would claim we’d need both. Pharma might only need high performance, but a target-ID company is expected really to leverage genomics so they’re going to need high-throughput computing to do these massive annotations, and they’re also going to need a high-performance system to service their biologists internally. They’re ultimately going to have to mine the data.
Morris: This gets beyond just the actual box you’re using. You’re talking network topology and how you’re actually doing your distribution, because presumably you’re not in four walls and one ceiling under one roof. It goes beyond just the performance.
Kupfer: That’s a challenge for a big multinational.
Morris: It’s true just across the US as well.
Kupfer: I was thinking of a couple of buildings all within a reasonably close geographic area.
Kirsch: Very few organizations are like that.
Trunnel: Even if we don’t move into the wide area there is a significant service component in designing with limitation, especially for distributed infrastructure, even if it’s just a single cluster of 50 or 100 machines. It’s different than buying one large box that rolls in from a vendor.
Goodman: Can you elaborate on that?
Trunnel: As your problem grows, if you can scale your infrastructure by throwing CPUs at it, which has been the case for analyzing genomic data, the administrative cost of ownership for 64 individual computers is potentially much higher than a single multiprocessor box that has 64 CPUs in it. On the surface, the cost of a Linux cluster may be much less than buying a 64-way high-end SMP machine, but it can take a lot more work to keep it running and to make it run efficiently. This is something that the industry is really learning the hard way. [Blackstone has] worked in a number of large scale genomics environments — 500 CPU cluster, very clever people running the cluster, running large scale BLAST searches — that are seeing about 20 percent utilization on their resource because there are scaling issues that you run into.
Morris: It would be a disservice to our employers if we didn’t actually bring this up. We may not solve this here, but we wouldn’t want to ignore it, because, as you said, you can go out and make a significant investment in your computing environment and get 20 percent performance or intolerable performance on the part of your nodes, wherever your nodes are, and your network, if you don’t take that into account.
Trunnel: One of the other things that’s been driving the industry in the last couple of years is that IT resources have been getting more expensive. It’s hard to find good IT people.
Kirsch: They’re just not available.
Kupfer: Ah, that’s the key.
Trunnel: If you have a small 10-CPU BLAST test cluster and you decide to scale it to 50, are you going to add IT staff to do that? I talked to one company that’s going from 100 to 1,000 CPUs. How many IT staff do you have? We have three. Are they busy? Yes. You’re going to make your infrastructure 10 times as big? Are you going to hire any more IT staff? Well no, we weren’t planning on it.
Cuticchia: I found that cluster computing represents a significant savings if the human resources are free. Everybody knows that a nice Linux cluster performs great when you have a couple graduate students you can chain to it. It becomes an incredibly large sucking of human resources to keep this going. Both with respect to keeping the hardware running optimally, but also the fact that bioinformatics software right now is really in its infancy for being imported efficiently through this type of environment. There’s only a handful of programs that scale well on clusters right now.
Goodman: Since we’re down here in the dirt of Linux clusters versus other things, let’s stir the dirt and see if we can find some good drugs in it. We hear that Linux clusters are a lot of work to maintain, but then we also have the examples of Incyte with their 3,000 CPUs.
Kupfer: But we’re a target-ID company. Incyte designed for assembly. I think we stay away from that problem.
Goodman: Well, look at the examples of the kinds of systems people build. Incyte built this cluster, so you’ve got an example that’s sort of pure Linux rack-mounted farm management. Then you get someone like Celera who will go out and buy 1,000 Compaq Alpha CPUs. Then you have academics who buy beige boxes from the local PC store and put together 10 to 100 CPUs on shelves and wire them together. Are all of these reasonable choices? Look at the price difference. The beige box that the academic group might use is maybe a factor of five cheaper than what Incyte probably did, which in turn is another factor of five cheaper than what Celera did.
Kumm: I’m not sure I buy into that. It’s definitely economically viable to get a rack-mounted Linux cluster for less than it would cost to buy a whole bunch of PCs and put them together, simply because there’s an economy of scale putting together nodes and for the sole purpose of being a computer cluster.
Goodman: OK, then, so just using the rack-mounted Incyte example versus the Celera example. What’s the tradeoff there? Why did one work for one group? Both are smart teams with very sophisticated, cutting-edge science.
Trunnel: It depends entirely on the applications in which you’re working. If all you care about is BLAST, you can’t beat a high-clock-speed chip. So the cheaper you can get a high clock speed, the better. And if that’s all you’re doing, then Intel-based is the way to go.
Kupfer: These are assembly problems. I thought there was a little bit more of a cooperative effect going on in an assembly problem.
Trunnel: Right, that’s an interprocess communication thing. But when you get down to the processor level, the big difference between Alpha and Intel is floating-point performance. So if you’re doing work that’s floating-point intensive, and it just happens that a lot of bioinformatics work is not floating-point intensive, then the Alpha is a clear win.
Cuticchia: There’s a whole class we’re ignoring here. There’s the Alpha versus Intel cluster approach. There’s also the classical SMP computing architecture — the Sun Starfires and the SGI Origins. In our case we had to go with that type of architecture because we had to serve up about 200 different programs to approximately 3,000 researchers in 280 laboratories around Ontario and there was just no way we could bring in a Linux cluster-type architecture to serve that. So I think when you have very well defined projects then you can start looking at Intel versus Compaq versus that.
Kirsch: And high throughput versus high performance.
Trunnel: It’s also something even more. The big argument for Unix machines in general is that they’re general purpose. You can do all sorts of different things with them. That’s why people buy Unix machines instead of buying dedicated hardware, or accelerators (which is something we haven’t talked about) as a piece of computational infrastructure. But general purpose is a big value-add. I think that clusters built appropriately are not truly general purpose. To build them well you build them with a specific application domain in mind.
Morris: As bioinformatics evolves, as the questions change, have you built yourself into a corner? If you do that are you not leaving yourself the out for the evolution three years from now as the questions that you’re asking are different or the datasets that you’re working on are different?
Kirsch: As long as you know you’ve done that. You have to make decisions at some point in time. You have to make some purchases. You have to build it. And then it changes. And so at some point you … make different decisions.
Cuticchia: While those processes are going on, Moore’s Law is being applied too. Your $10 million computer is worth $5 million, $2 million, $1 million, it’s time to put it on EBay. You have to have your technology rollout, your technology upgrade path and plan in there as well. You don’t just bring in the high-performance computer and then let it sit there.
Kupfer: If the primary job of this thing is to serve people’s requests, you’re still going to be right. People are going to have more and more complicated requests, databases will double. Both systems are going to have to be upgraded.
Goodman: Let’s look at some of the issues that are raised here: We’ve talked about Linux versus more powerful clusters or SMPs. The point is being made that the Linux clusters probably give you the most bang for the buck, but it has to be specialized for an application. And if it’s a well-defined, characterized application that happens to be pleasantly parallel, which many of them are, then a Linux cluster can be a very effective solution. It would seem that in this company we’re designing some of our applications to fit in that space so it might be that a Linux cluster could be one element of our computing infrastructure.
Kirsch: Does the ASP model offer any opportunities? Do you think it’s something that can be utilized?
Trunnel: It’s something that provides a very valuable resource. There’s no reason technically that it can’t be a viable component of a computational solution. I think there’ve been a lot of things getting in the way of the development of the business and will continue to be — people are concerned about various aspects, be it security, be it reliability, whatever.
Kirsch: Those are solvable issues, aren’t they?
Morris: The issue hasn’t been the technology. The issue has been failure or concerns on the part of adoption because of the business. Nobody wants to let their data outside of their own environment. People don’t want to have data being managed in someone else’s space.
Kirsch: But other industries have done that and done it successfully. In even more secure areas, like our money.
Cuticchia: Do you see ASPs being something that’s going to be readily adopted by the pharmaceutical companies or is just going to create a warrior like atmosphere inside the bioinformatics departments inside the pharmaceutical companies? Let’s assume that security is a given: This is the safest, most robust system in the world. Are pharmaceutical companies still going to be ready to adopt that or is there going to be resistance from IT, bioinformatics, the research group?
Kupfer: In pharma, IT and bioinformatics are two separate bodies, and in general they’re warring and fighting.
Kumm: From my point of view it’s already that. The IT group is separate and they provide that service. So if they can do it cheaper than an outside group that’s great.
Morris: But they can’t. That’s part of the problem. One of the things we’ve started to see in the last two years is, on the part of IT, more outsourcing in pharma. But you’re doing things for desktop management, you’re doing things like call-center support. You’re outsourcing training or strategic partnerships, depending what buzz phrase you want to give it, but the value is that you’re offloading it so you don’t have to hire the 150 IT specialists — you’re able to partner to have that done. I think the other challenge if you look at sending that to other environments is when you’ve got 25 or 50 or 200 homegrown in-house programs that weren’t commercially purchased, so you’ve built your own environment, you’ve got 12 different sites.
Kupfer: I think we could outsource. There are probably a couple of specific points. One of them would be use of software. We were talking so much about the hardware. What about what we’re actually going to run on these machines so that people will use those tools? I almost consider that part of the infrastructure, because really I consider the bioinformatics is actually doing the datamining, doing the analysis. Really you need two steps. You need hardware infrastructure and you need software infrastructure. And then you’re ready.
Goodman: How is that going to drive us? I think Jamie made the point that he had to run hundreds of different applications … there’s a lot of code and software that’s going to have to run on this infrastructure. How’s that going to drive us? Does that drive us to SGI, does that drive us to Sun? Do we need a mixture of different boxes because of that?
Kupfer: The answer is probably going to be that you need a mixture just to deal with the software and compiling issues. We were already at the point where we needed two systems. And my guess is that one of those systems that we were calling the high-performance system is actually going to be heterogeneous. It’s going to be a bunch of different boxes that people will tap into on the Web, where some of them are running some software and some of them are running others. And the high-throughput system is probably going to be designed more to do whatever this particular target-ID company decides is its core competence.