The flood of data from next-generation sequencing instruments is straining the IT infrastructures of large genome centers, according to speakers at Cambridge Healthtech Institute’s Bio-IT World conference, held in Boston last week.
During the conference, IT directors from the Wellcome Trust Sanger Institute, the Broad Institute, and the Washington University Genome Sequencing Center said that their existing IT systems are insufficient to handle data from the new instruments.
As these platforms are coming online, IT managers are struggling to design new systems that can capture, analyze, manage, and store up to several terabytes of data per run.
Toby Bloom, director of sequencing informatics development at the Broad Institute, said that the arrival of next-generation sequencers has required a “huge scale-up” in the Broad’s IT infrastructure. Next-gen sequencing “impacts all aspects of informatics,” she said.
Kelly Carpenter, manager of technical services at the Wash U Genome Sequencing Center, put the situation in perspective by noting that Wash U currently has 133 3730s from Applied Biosystems. One 454 instrument, he said, can generate three times as much data as these instruments — and if the Illumina instrument reaches one-gigabase-per-run routinely, it will generate 20 times as much data as the center's fleet of 3730s. “These sequencers are going to totally screw you,” he said.
To Store or Not to Store?
One concern for genomics IT managers is storage. The Illumina Genome Analyzer currently spits out at least half a terabyte of raw image data per run, which could quickly reach into the petabyte and even exabyte scale when genome centers begin running these systems around the clock.
But some are questioning whether it is practical, or even necessary, to store the raw images indefinitely.
The genomics community “needs to start a conversation about data retention practices” for next-gen sequencing data, said Eugene Clark, senior software architect at Harvard Medical School.
Clark, who is designing an IT infrastructure for six polony sequencers that are generating data for the Cancer Genome Atlas, said that “the ‘keep-everything-because-we-can’ mentality is not feasible for next-generation sequencing.”
Several speakers noted that as the cost of sequencing continues to fall, it is likely to reach a point where it is actually cheaper to resequence a sample than to store the raw image files.
As for the best architectures for storing that data in the meantime, the jury is still out. Most speakers said that they are still evaluating different systems. “We’re looking at all of the options for growing our storage capacity,” said Phil Butcher, head of information technology at the Sanger Institute.
Meanwhile, genomics labs are running into another challenge: files from next-gen sequencers are too massive to transfer over most networks. As a result, many groups are relying on what one speaker dubbed “SneakerNet” — researchers physically moving data between workstations on disks.
Butcher noted that it’s impractical to move terabytes of data around a network for analysis, no matter how robust that network may be. “The best way to move terabytes of data is still disk,” he said.
He said his team at the Sanger Institute is exploring ways of processing the data “closer to the machine,” rather than sending the data all around the network.
The Broad Institute has found an approach that appears to work for that problem. Bloom said that Sun’s Sun Fire X4500 data server is able to process and store data from the Illumina sequencer on a single system, which eliminates the need to transfer files across the network.
Each X4500 can process images for two Illumina Genome Analyzers and store two weeks’ to a month’s worth of data, she said.
A version of this article previously appeared in In Sequence sister publication BioInform.