• Jack Collins, scientific computing manager, NCI’s advanced biomedical computing center
• James Evans, CSBi research scientist; Whitehead MIT bioimaging center
• Dmitri Mikhailov, informatics and knowledge management; Novartis Institutes for BioMedical Research
• Andres Rodriguez, CEO, Archivas
• Mark Collins, senior product manager, informatics; Cellomics
Even die-hard computer scientists can find themselves snoozing when it comes to issues of storage. But with the roomful of experts rounded up for GT’s third roundtable in the high-performance computing series, this conversation was our liveliest yet. In a conference room in Boston, they tackled head-on some of the main storage issues — how to handle metadata, global access, the truth about backups, and more — to bring you the inside information on this aspect of computing. The following pages include excerpts of their discussion.
Genome Technology: Our focus is on storage in high-performance computing. Let’s just jump right in: what’s the biggest problem people are facing right now? Is it simply too much data, or is it something more complicated?
Jack Collins: It’s having all the machines talk to each other — we have multiple vendors, multiple operating systems, and you want to be able to share the data. So you have to have some good way to have very fast access and some sort of storage area network that works and talks to all of the machines and updates your databases — a fairly centralized way that you can do your backups. If it takes more than a day to back up all your data, you’re not going to back them up much.
Evans: I would echo that. We have a wide range of hosts: IRIX, SGI, Linux, and Windows. To be able to share image data between those, it’s not just getting the data to each of them but getting it to them at a decent rate. They’re these big data sets and you don’t want to replicate them for each of the different operating systems.
Rodriguez: What’s the capacity of the rate?
Evans: We can multiplex two gigabyte fiber channels [and] potentially get about a gigabyte per second.
Jack Collins: [What] about lots of groups that want to use the same data? [You might have to] have everything sitting in one spot and then pump it out over the network to a user 60 miles away and have them interact.
Mark Collins: That whole idea of global sharing of data — that’s another challenge of data: do I replicate it everywhere (it’s going to cost me quite a few dollars to do that) or do I beef up my network? Which do I invest in, thicker pipes or replicating my data silo? That’s a tough call.
Mikhailov: I’d say in addition to vendor complexity — because you have to deal with multiple hardware from multiple vendors — we also have additional application complexity. You have different applications with different requirements for computational resources. Some applications are compute-intensive while others have data or memory I/O as main bottlenecks.
Jack Collins: One of the problems that we’re running into in proteomics is that you have a lot of applications that are written in a PC-centric [way] and they just don’t scale. So you try to scale to larger high-performance computing platforms or a cluster or something like that [and] the software’s not there to be able to do it properly.
Evans: That brings in the idea of workflow issues. [If you take] something that works really well on a UNIX platform and you want to do this one thing that’s really nice on a PC but you don’t want to have to babysit the data flow from one to another — but you get some specialized software that just happens to run only on a laptop …
Mark Collins: [Another issue is] what we need to back up and what we need to keep. We say we need to be able to expand storage to petabytes plus — so what are we keeping and what can we throw away? Certainly in image-based stuff, is the image really the raw data or is transforming that into numbers the raw data? With the rise of image-based techniques in medicine and pharma and basic research, that’s a heck of a lot of images.
Rodriguez: Prior to founding Archivas we did the archives for the New York Times, and I fought tooth and nail to get the paper to keep the raw data — they wanted to compress it and just keep the PDF files of all the images. The message is there are always new techniques for processing data and you never know what the target platform is going to be in the future. You always want to keep the raw data, so you want to design systems that will keep your raw data in the most affordable way.
Jack Collins: None of our scientists want to throw away any bit of data because they want to go back and reanalyze it. Some of the ones that are working out procedures for the FDA have to keep all of their data. The other thing is keeping track of where all of it came from and where you stored it all and what type it is so that you can go back and retrieve it.
Rodriguez: It’s all about managing the metadata.
Jack Collins: You may have it on tape, but if you can’t find it, it’s worthless.
Mark Collins: All our customers and all the people I’ve ever met in pharma say, ‘Oh, I just can’t throw anything away at all — FDA regulations say you can’t.’
Mikhailov: Months or years down the road we’re going to come up with a new algorithm and have to reanalyze [the data].
Mark Collins: The challenge is, we can probably store it [but] we have to have metadata to find it.
Jack Collins: You have to make sure you store enough data to be able to answer the question that you ask six months or three years later that you didn’t know you were going to ask when you took the data to start with.
Evans: That brings up a good point, because with image data it’s so noisy and so complex not only is the image data huge but because you understand it so poorly you’ve got to have a ton of it. So you’ve got a ton of huge data that you need to get to a CPU quickly so you need really high performance — for these 3D [images] you need multiple gigabytes per second of memory. The thing that you wait for then is I/O performance.
Jack Collins: If you actually do a lot of processing on it, you’re also creating a ton of data on the back end. So now you’re sucking it in at a phenomenal rate [and] you’re shooting it back out at a phenomenal rate. You have to be able to annotate it somewhere along the way. It’s not just static data either — it’s now dynamically changing.
Evans: We see roughly anywhere between five and 25 times data expansion from raw data.
Rodriguez: What’s your typical ratio of data that you’re keeping online versus data that’s been sent to tape systems?
Evans: Currently we have everything online — between 10 and 20 terabytes.
Mark Collins: We have about 12-plus terabytes online at our facility, and probably something like 25 terabytes backed up on tape.
Rodriguez: So you never expect to go back to that data?
Mark Collins: No. We scrubbed the data to the point that we just kept the stuff that was interesting. That’s what we find our customers do too — they’ll stick it on tape [and] back up what they’re not really interested in.
Jack Collins: We have people who have wanted to go back and get data that was 10 years old. The problem is, it’s in the archives but we don’t have a machine that reads the format anymore because the operating system is gone. You have to have some sort of common format that 10 years down the line you can still get back your data.
Rodriguez: Our argument is that you want to keep it online. Because when it’s online it’s always available. The moment you put it on a physical device that’s off the network it begins to die.
Mark Collins: It’s a tough task. If you could say we’ll keep it online, it’s as cheap as a tape and it’s as reliable in terms of disaster recovery, then you might have an argument to play with. But until you can do that [it’s no contest].
Genome Technology: How about a tip for our readers — how do they know how much storage they’re going to need, and how do they test it?
Mark Collins: It will depend on the kind of biology you’re doing, it will depend on the number of compounds you’re screening, perhaps, or the number of targets.
Jack Collins: That also depends on what kind of computation you’re going to do on the data. We have some computational chemistry programs or other database analysis programs where you’re going through and you’re running Pfam on the entire protein database. You may need a terabyte just of scratch disk space.
Mikhailov: Look at how much you’ve been using in the past — historical data’s really useful for this.
Evans: That’s definitely one of the things that we’re doing: keeping a record of what users and what groups and how quickly they collect it and what they do with it and where they keep it.
Jack Collins: There’s also the problem of new technologies. A new microscope comes in and all of a sudden you don’t get twice the amount of data, you get 10 or 20 times the amount of data. You get a new mass spec that comes in and the resolution isn’t a factor of two, it’s a factor of 10 or 100. Technology’s changing by orders of magnitude and computers are changing by Moore’s law.
And every time you add a [software or infrastructure] layer, it has to work with the previous layer. So if there’s a change in the operating system, if there’s a new patch, if there’s a new application — you’re sitting there waiting for one thing to fail so that the whole thing comes down.
Mark Collins: You can have the whole thing fall down like a house of cards because one driver of one application gets updated.
Genome Technology: What about solutions from other industries that can be implemented here?
Mark Collins: It’s an interesting point as to whether the challenge we face in the life sciences is the same challenge that banking or insurance or retail face. Wal-mart must have a huge data storage problem — they must have solved it or they would’ve gone out of business.
Jack Collins: Wal-mart probably has some solution for its data management — unfortunately, I think their budget for solving those problems is larger than what our budget is for solving those problems.
Mark Collins: Obviously Wal-mart thinks it is significantly important to their business to invest heavily in this. Maybe what we haven’t done is made the argument in life sciences that really it’s an information problem and the money you spend on the data management part of it affects the bottom line.
Genome Technology: If you could blue-sky it, what’s the breakthrough technology needed in this field?
Mark Collins: Part of it is hardware — faster, high-density disks — but the other part of it is some kind of layers of abstraction to allow us to abstract storage, abstract processing, and abstract the metadata.
Jack Collins: Access control for the users, and the intelligent metadata. There are systems out there where you can tell it what the metadata is — you type it all in. But I don’t want to do that, I don’t have time. It should discover it.
Mikhailov: I’d say technology that allows efficient data access and sharing, especially among remote sites across a wide area network.
Evans: We need scalable performance independent of the size of the storage so we can dedicate resources on the fly to new technology.