BOSTON — A flood of data from next-generation sequencing instruments is straining the IT infrastructures of some of the world’s leading genome centers, according to speakers at Cambridge Healthtech Institute’s Bio-IT World conference, held here this week.
During the conference, IT directors from the Wellcome Trust Sanger Institute, the Broad Institute, and the Washington University Genome Sequencing Center said that the IT systems that they developed to sequence the human genome — and perfected over several subsequent years of Sanger sequencing — are insufficient to handle data from new instruments made by 454 Life Sciences, Illumina, Applied Biosystems, and Helicos.
As these new instruments are coming online, IT managers are struggling to design new systems that can capture, analyze, manage, and store several terabytes of data per day.
For some, the challenge is a familiar one. Phil Butcher, head of information technology at the Sanger Institute, said that the current situation is reminiscent of when he first joined the institute in 1993 and had “no clue what to do about all the data for the Human Genome Project.”
Now, next-generation sequencing has brought the field “back to square one,” he said. “We’re awaiting an unprecedented amount of data.”
Butcher’s counterparts at other genome centers agreed. Toby Bloom, director of sequencing informatics development at the Broad Institute, said that the arrival of next-generation sequencers has required a “huge scale-up” in the Broad’s IT infrastructure. Next-gen sequencing “impacts all aspects of informatics,” she said.
Kelly Carpenter, manager of technical services at the Wash U Genome Center, put the situation in perspective by noting that Wash U currently has 133 3730s from Applied Biosystems. One 454 instrument, he said, can generate three times as much data as the center’s current installation — the equivalent of 399 3730s — and if the Illumina instrument reaches the one-gigabase-per-run mark as planned, it will generate the equivalent of 2,660 3730s.
“These sequencers are going to totally screw you,” he said.
To Store or Not to Store?
One particular concern for genomics IT managers is storage. The Illumina (formerly Solexa) Genome Analyzer currently spits out half a terabyte of raw image data per run, which could quickly run into the petabyte and even exabyte scale when genome centers begin running these systems around the clock.
The current practice in the genomics community is to store all the raw data from all sequencing runs. This is mainly done for quality-control purposes — if the processed data is questionable, researchers can always go back and reanalyze the raw electropherograms or trace files.
So far, genome centers are following the same procedure with data from next-generation instruments, but some are questioning whether it is practical, or even necessary.
The genomics community “needs to start a conversation about data retention practices” for next-gen sequencing data, said Eugene Clark, senior software architect at Harvard Medical School.
Clark, who is designing an IT infrastructure for six polony sequencers that are generating data for the Cancer Genome Atlas, said that “the ‘keep-everything-because-we-can’ mentality is not feasible for next-generation sequencing.”
Several speakers noted that as the cost of sequencing continues to fall, it is likely to reach a point where it is actually cheaper to resequence a sample than to store the raw image files.
Researchers are certainly loath to throw away those files now — both out of custom and also because the sequencing technologies are still so new that the raw data could come in handy for troubleshooting. Some, however, are beginning to question how long that practice should — and can — continue.
Chris Dwan, senior scientific consultant at the BioTeam, told BioInform that there are “a lot of questions” in the community about whether it’s necessary to retain raw image files. Dwan is working with the Navy Medical Research Center to create an IT system to support a new genomics lab that is currently running three 454 sequencers and expects to add additional next-gen systems in the future.
Dwan noted, however, that since these sequencing technologies are still unproven, it’s likely that labs will want to hold onto as much data as they can for the next several years until the platforms are perfected — good news for storage vendors, if not for IT managers at sequencing labs.
“The ‘keep everything because we can’ mentality is not feasible for next-generation sequencing.”
As for the best architectures for storing that data in the meantime, the jury is still out. Most speakers said that they are still evaluating different systems. “We’re looking at all of the options for growing our storage capacity,” said Sanger’s Butcher.
In the meantime, genomics labs are running into another challenge: These files are too massive to transfer over most networks. As a result, many groups are relying on what Dwan dubbed “SneakerNet” — researchers are physically moving data on disks down the hall or to another building for data analysis or backup.
Clark described just such a situation at Harvard, where the polony sequencing lab is not connected to the storage network, so researchers use removable drives to move data from workstation to workstation for analysis.
Butcher noted that it’s impractical to move terabytes of data around a network for analysis, no matter how robust that network may be. “The best way to move terabytes of data is still disk,” he said.
Butcher said his team at the Sanger Institute is exploring ways of processing the data “closer to the machine,” rather than sending the data all around the network. While noting that he’s “still in the process of finding the right tools for doing that,” he said that “data localization and minimal movement around the network are key.”
Dwan said that he is taking a similar approach with the system at the Navy Medical Research hospital, with the goal of “bringing the compute to the data.”
The Broad Institute has already found one approach that appears to work for that problem. Bloom said that Broad has found Sun’s so called “Thumper” data server — the Sun Fire X4500 — is able to process and store data from the Illumina system on a single system, which eliminates the need to transfer files across the network.
Each X4500 can process images for two Illumina Genome Analyzers and store two weeks’ to a month’s worth of data, she said, noting that while the approach appears to be a viable solution to one challenge, “we have solved only some of the problems” raised by next-gen sequencing.