Skip to main content
Premium Trial:

Request an Annual Quote

Facing a Tangle of Tape, Scientists Seek New Options for Storing Sequencing Data

Premium
This article has been updated to clarify comments made by Vadim Sapiro of the J. Craig Venter Institute.
 
BOSTON — This week’s Bio-IT World Conference highlighted that while the task of harvesting data from second-generation sequencing instruments has only just begun, both large and small labs are already facing some big choices over how to store the terabytes of data that these tools generate.
 
“The short version of all of this is there is going to be a lot of data,” said Harvard Medical School’s George Church in a talk during a next-generation sequencing data-management workshop that took place ahead of the conference.
 
During the session, some scientists discussed the fact that next-gen sequencing vendors have concentrated on engineering their instruments, but may have not spent the same degree of effort thinking about where the data will go.
 
“At some level that is not their job,” William van Etten, director of consulting services at the BioTeam, told BioInform. “They need to decide whether they are going to be a computing company and a mass-storage company or not,” added BioTeam’s Chris Dwan.
 
A number of scientists said in their talks that they see value in a tiered approach to data storage. As van Etten explained, there are three types of next-gen data with decreasing sizes as it moves off the instrument and into downstream analysis. The primary image data is on the scale of terabytes for one run; the secondary data set, which includes base calling and other tasks, is in the vicinity of 100 gigabytes; and the final results are another order of magnitude smaller, in the tens of gigabytes.
 
This results in enormous storage requirements even for small labs. BioTeam’s Chris Dagdigian estimated that a single next-gen sequencer requires a minimum of 40 terabytes of storage. At the other end of the scale, Matthew Trunnell, group leader in the Broad Institute’s Application and Production Support Group, said that the institute is on track to reach 2 petabytes of storage capacity by June. Without its fleet of next-gen instruments, the Broad would have only required around 250 terabytes by that point, he said.
 
“We suffer the pain of being an early pioneer,” he said. In two years, he said storage capacity at the Broad has increased fivefold.
 
The data from each step in the analysis pipeline needs to be stored for a unique period of time. Primary data should be kept “as long as you think the algorithms that do the base calling and quality scoring are going to change,” van Etten told BioInform. Trunnell said in his talk that at the Broad, the first tier of next-gen data streams off the sequencer to Sun Fire X4500 “Thumper” servers where it is kept for around four to six weeks. Then, as analysis is underway, the middle tier of storage is kept on high-speed Isilon clusters for six to nine months, sometimes longer.
 
“The base calling and quality scores, you want to keep those for as long as they are relevant to the experiment you are doing,” said van Etten. “The ultimate data sets that are two orders of magnitude smaller than the primary data, those are going to be kept forever. That is definitely a good candidate for tape whereas the middle [tier] and first [tier], I am not so sure,” he said.
 
As Trunnell explained in his presentation, the Broad is still wrestling with long-term archive possibilities. While it may not be possible to save all data and it could even be cheaper and easier to re-generate the data than to keep it, the idea of throwing away data is “very painful” and also “rare in this space,” he said.
 
Tape That
 
In his presentation, Trunnell said that the final tier of storage goes to tape.
 
Cold Spring Harbor’s Dick McCombie, who has eight next-gen sequencers in his lab, told BioInform that his group also relies on tape for long-term storage. “We keep everything,” he said.
 
Is tape a good ultimate holding pen for data? As BioTeam’s Dwan explained to BioInform, “once you get down to disk versus tape, that’s a technology question [like] VHS versus beta.”
 
“My observation is tape stays good a lot longer than disk, [and] you don’t have to power it, but you have to have a tape reader that works, that tends to be robotics with moving parts and such, whereas disks stay good if you are not using them,” he said. 
 
“One challenge about tape is ready access,” said Kyle Delcampo of Yardley, Penn.-based IT services and lab automation firm Xyntek. “It is sort of like accessing a VHS tape versus a DVD: one has chapters, the other you have to scan through the whole thing.”
 
Some labs have large racks of disks that may stay on for the duration of an experiment, after which they are quietly powered down. If anybody asks for that data, they are powered up again, said Dwan.
 
“For us tape is dead,” said Genentech’s scientific manager of research computing Reece Hart. “Tape doesn’t exist anymore.” Access is tape’s main challenge in his view. “Technically you can still put things to tape, but anything over a few terabytes is a hopeless endeavor; you could never restore that data; you can always back up anything but getting it restored is something else.”
 
During the workshop, Hart called together a ‘birds-of-a-feather session’ whose attendees identified storage as an urgent issue.
 

“Once you get down to disk versus tape, that’s a technology question, [like] VHS versus beta.”

As Hart explained, when it comes to storage as a safety net, such as for disaster recovery, users have to keep in mind the recovery point objective and the recovery time objective, or the time from which one wants to recover data and how long the actual recovery takes, respectively.
 
“It is almost just a physics problem,” he said. “You can’t restore more than a few terabytes of data in a couple of hours. So if your recovery time objective is to have data back on line in a couple of hours, that sets the limit of whether tape is going to work for you.
 
“We and most people who can afford to do so buy a secondary frame and keep it synched,” he said. Most storage vendors provide very fast ways of synching between frames of data and storage, even sometimes over distant geographic locations, he explained.
 
As Thomas Arlington, systems engineer with Isilon Systems, explained to BioInform,“of the cost of disks today is “close to what tape is.” He added that users need to heed the cost of managing tape as well as the costs of transporting tape to off-site storage sites.
 
“If you do a total cost of ownership, you will see that disk is pretty close to what tape is, but I still think tape has a need in the storage market because it enables you to take your data offsite into a third location without having to deal with bandwidth issues and it’s just another way to tier out your data,” he said.
 
BlueArc’s director of research markets James Reaney sees tape slightly differently. “You can’t rely on spinning disks to run more than three years; there are other technologies, [and] tape is one,” he said. “I would argue tape isn’t good beyond three years, but if it is sitting on a shelf are you doing confidence testing on that tape?” One idea may be to back up to disk and then power them off. “At that state they will last decades,” he said.
 
Vying for Big and Small
 
Large storage vendors such as Network Appliance and EMC are following developments in the next-gen sequencing market, as are other vendors.
 
Isilon’s Arlington said that the company’s X-Series clustered storage architecture can scale from 4 terabytes to 1.6 petabytes in a single file system. It is both faster and greener than other architectures and can save customers roughly $1,000 per five-node cluster annually in cooling expenses, he said.
 
This week, Isilon announced that the X-node, which previously used Intel dual core Xeon processors, has moved to quad-core processors.
 
Isilon’s clustered storage architecture includes a distributed file system that the company describes as “cache coherent.” Arlington said this means that every node in the cluster shares cache metadata about the cluster, and knows “where the data lives and how to access [it].
 
“If you look under the hood it is simply FreeBSD. On top of that is our proprietary operating system called OneFS,” he said. OneFS combines three layers of storage architecture — the file system, the volume manager, and RAID — in one software layer.
 
To start, customers begin with a three-node cluster with disk options for 160 gigabytes to 1 terabyte drives. They can grow from that starting point one node at a time up to 96 nodes, said Arlington, expanding one node at a time in less than 60 seconds — a concept the company calls sustainable linear scalability. Each node essentially has knowledge of the entire file system with users able to access a unified namespace.
 
“This enables you to grow on the fly with no downtime, unlike traditional storage,” which relies on “single-head architecture,” he said. As these systems reach about 70 percent use, there is a “downward spiral in performance,” said Arlington.
 
Arlington said that Isilon’s life-science customers are typically seeking scalability, ease of use, and performance, and he noted that they are not only to be found in large sequencing centers. “Our minimum requirements are 4 terabytes of usable [storage],” he said. “Any scientist I am talking to doing [next-gen] sequencing is going to be starting out with that.”
 
In an initial discussion with scientists, Arlington said he shapes an application profile, the amount of storage required, the types of applications that touch the cluster, the number of users working off the cluster, and assesses the lab’s needs and future growth. “In today’s environment they’re growing so fast,” he said.
 
One scientist in a small lab with a next-gen sequencing machine stopped by the Isilon booth at the conference and explained his storage needs were 300 percent higher than in the previous year. “If you asked [a year ago] how much storage someone would use today, they would be hundred times off,” Arlington said.
 
Performance Hunting
 
Each node in the Isilon system provides 100 megabytes per second of aggregate read throughput. Isilon chooses not to talk about performance in terms of IOPS, or input/output operations per second, which is a standard performance benchmark for servers.
 
“That community needs throughput, not IOPS,” Arlington said, adding that IOPS is a better measure for the performance of a database, as a measure of transactions per second. “Throughput is what these sequencers need, massive throughput, and that is what this cluster gives them; it’s massive writes and then massive reads.”
 
This, he said, allows Isilon to determine the performance of each node so the company can “dial in” and give the scientist a solution with a certain performance and storage.
 
If performance dips, Isilon offers scientists extension nodes to add processing power, memory, and bandwidth. “If [the nodes] are read-intensive, you can go 3:1 for every node and if it’s write-intensive, it’s a 1:1 ratio,” said Arlington.
 
However, BlueArc’s Reaney thinks IOPS are “a true measure” of performance as opposed to bandwidth in megabytes or gigabytes/second. IOPS multiplied by the block of data sent equals bandwidth, he said, but the block size can change. “Take block size completely out of the equation and IOPS is the true measure of a storage platform’s performance,” he said.
 
He advises both large and small labs to practice tiered, hierarchical storage.
 
“There’s a little bit of storage directly on the [sequencing] instrument, which is sort of a collection point,” he said. “It is streamed off almost as quickly as it is collected, put on BlueArc, [and] it just sits there and the cluster hammers on it. When they’re finished they move it to archive.”
 
BlueArc customers include Washington University, the Wellcome Trust Sanger Institute, Cold Spring Harbor Laboratory, and the European Bioinformatics Institute. BlueArc’s modular Titan system has aggregate throughput of up to 20 gigabytes/second and supports storage pools of up to 4 petabytes. It offers dynamic read caching across the cluster.
 
In McCombie’s lab at CSHL, BlueArc helped to cut the analysis time of a sequencing experiment from 10 hours to three hours with the help of a fiber channel system. The company sent the lab three systems for evaluation, with the fiber channel being twelve times as fast but only three times as expensive, Reaney said. “It’s [McCombie’s] primary storage for analysis,” he said.
 
Many scientists are finding themselves pushed into the IT world after they buy a next-gen sequencer, said Reaney. “Often they don’t have enough resources or budget to address this,” he said. “A large percentage of the potential customer base of these machines would like to see an integrated solution” to handle what happens once the data is streamed off the machines, he added.
 
An important task for next-gen labs is planning. “If you are a climate modeler, you’re used to planning a sizable fraction of your budget plan for this purpose: You don’t just go out and buy the instrument that generates the data. You buy the instrument, you buy the storage, you buy software, you buy all the pieces you need,” Reaney said. 
 
Other vendors are also eyeing the next-gen sequencing space. Deepak Thakkar, bioscience solutions manager at SGI, said the company plans to launch a scalable "appliance" for storing next-generation sequencing data "soon." He said that the system will be priced lower than other storage options and will allow users to start small and expand in line with their needs.  
 
Sharing the Space with Peers
 
Some labs are taking a social approach as well as a technical one to help alleviate their storage challenges. Vadim Sapiro, vice president for IT at the J. Craig Venter Institute, explained in a presentation at the conference that the institute has set up a grid structure for storage that includes EMC-DMX-3 and NetApp systems all the way down to tape. He recently added Isilon machines and deduplication storage from Data Domain to deal with an “insane data explosion” at the institute.
 
“We are in the process of implementing those things right now” in order to start minimizing the reliance on tape, he said. Once the Isilon and Data Domain systems are plugged into the IP network, the plan is to leverage the institute’s Internet2 connection, place a gateway device at a collaborator’s site, be able to replicate content, and do disaster recovery that way.
 
The institute has just shifted away from considering all data as “tier 1” storage with NetApp disks. “Researchers had no motivation to move data or archive data,” Sapiro said. He said he is creating a high-performance tier with a fiber channel for database and grid access, a second tier equipped with mid-range storage, primarily using serial advanced technology attachment, and a third tier with SATA and deduplication.
 
Sapiro said he plans to offer budgetary enticements for scientists to move data from more expensive to less expensive tiers. Jokingly, he said, “Those not willing to play will be put on a 'wall of shame' using a device called Kazeon, which analyzes storage for all kinds of access,” he said.
 
Sapiro said he expects the reporting system will help his department as well as the scientists assess their storage needs more efficiently.
 
The approach has already provided some valuable feedback: the first time he and his colleagues ran an access report on the system, they noticed the three top users of storage space were no longer at the institute.

Filed under

The Scan

Missed Early Cases

A retrospective analysis of blood samples suggests early SARS-CoV-2 infections may have been missed in the US, the New York Times reports.

Limited Journal Editor Diversity

A survey finds low diversity among scientific and medical journal editors, according to The Scientist.

How Much of a Threat?

Science writes that need for a provision aimed at shoring up genomic data security within a new US bill is being questioned.

PNAS Papers on Historic Helicobacter Spread, Brain Development, C. difficile RNAs

In PNAS this week: Helicobacter genetic diversity gives insight into human migrations, gene expression patterns of brain development, and more.