While the possibilities of next-gen sequencing may have researchers all aflutter, the IT and informatics personnel are the ones left with the data management headache. The challenges of dealing with massive amounts of data being created on a weekly basis by sequencing centers both large and small can't be remedied with the most high-tech storage technology or super-sophisticated server room alone. Instead, each center must find its own way in the terabyte jungle.
Even though the Broad Institute, for example, tripled its storage capacity over the last year, the IT team's main concern was not finding the most cutting-edge storage technology, but rather one that would allow for scalability. "We made sure that we bought storage that you could add to incrementally, so the core storage is there," says Toby Bloom, director of informatics for the genome sequencing platform at the Broad. "Am I worried about backup and fault tolerance? Yes, but I don't see that it's a hardware challenge." The Broad, which had to add air conditioning to its server rooms earlier this year because of the increase in hardware, now has power capacity first and foremost in mind, rather than the allocation of adequate disk space.
In addition to the cost of power consumption and the need for backup generator systems, the inadequacies of existing networking infrastructure can be a huge problem. "Our IT infrastructure, literally the networking in the buildings, was just not ready for this data flow," says Dick McCombie, a professor at Cold Spring Harbor Laboratory. "We had 1 gig Ethernet, and that doesn't cut it when you're moving 600 GB or 800 GB bundles of files around campus, so we had to put in new cabling and new switches."
In order to nip infrastructure issues in the bud, the Washington University School of Medicine is in the process of putting the finishing touches on a new data center designed explicitly with the challenges of next-gen sequencing in mind. In previous years, the university's Genome Sequencing Center data facility consisted of several 1,000-square-foot rooms to house file storage. And while the ever-increasing miniaturization of storage hardware decreases the physical space requirements for data centers, the power and architectural requirements, not to mention the networking and informatics infrastructure requirements, become more and more.
"We already have 10 Illumina machines and eight 454 machines, so you just need more disk space for all that stuff, and we really were starting to max out electrical and cooling," says Rick Wilson, director of WashU's Genome Sequencing Center. "We wanted to start fresh and see if we could put up a building across the street that we could build all of this requisite storage and cooling in such a way that it wasn't going to be obsolete in five years as the hardware miniaturization continues."
Tale of the tape
A consistent thread that runs through most major data centers in the throes of dealing with next-generation sequencing data is the idea of a hybrid hardware configuration. In much the same way that sequencing centers will often have a range of sequencing equipment from various vendors to meet different needs, the same approach is taken to computing hardware in order to get the most out of the technology.
"You can have disks that are really fast, but they might be a little prone to crashing, or you can have disks where maybe the access speed isn't so good but they're much more reliable," Wilson says. "What we have is a number of different types and vendors of disk units, based on the various advantages and disadvantages of hardware." They might use slower, more reliable disk units for backup and long-term storage, whereas the more high-speed hardware is used for everyday stuff and is backed up every 24 hours or so.
Before the arrival of next-gen sequencing, the Broad would back up everything to tape. But now, with an average of 10 terabytes of data coming in every week — a number that continues to rise rapidly — the IT team has had to seriously revamp its approach. "Images we're storing for 60 days, and we set a minimum time we'll store it — over time, that may shrink and then at that point, when we run out of storage, earliest goes," says Bloom. This is a notable shift in data management from pre-next-generation sequencing days when archiving all raw data was the norm.
McCombie says his group has gone back and forth on the tape issue, and at least initially they were not backing up to tape. "Then we realized that there are people looking at this data and either have or are in the process of writing new base callers, for example, and then we thought maybe we should save it," he says. "I don't know how long we'll save it and how long we'll keep doing it, but I think it's about $50 per tape. The runs are $3,000 to $5,000, so it's not much to add."
Strict standards will be tough to come by in these early days. The question of how precious a sample is, which can often be a subjective matter, makes things complicated and defies a strictly date-stamp scheduling approach to backup and deletion. In much the same way that the informatics for old-style Sanger sequencing eventually became standardized, it will take some time for next-gen informatics to settle in given the still-rapid pace of development. "The field is so very new, so there's probably going to be new base callers, new quality value assignment software," says McCombie. "Once it stabilizes and appears that if you throw something away you're not going to want to reanalyze [it] with the next new program, then I expect people will probably starting tossing data — more of the image data, which is probably about 75 percent of the run."
In terms of data management, and specifically the issue of tape backups, ultimately all agree that the importance of a good filtering and backup plan cannot be emphasized enough. "What we're really learning as we start going through some of these early projects with these technologies is what we need to keep long term, what we can keep short term, and what we throw away after each run," says Wilson. "It's human nature to want to save things as long as you can, so, we have to train ourselves to be a bit more proactive in cleaning stuff out and getting rid of things we don't need anymore."