An informatics group at the Friedrich Miescher Institute in Basel, Switzerland, has integrated a file system designed to manage large tape libraries with a relatively new approach to disk storage to create an energy-efficient and scalable tiered storage system for second-generation sequencing data.
FMI installed the system, which uses Sun Microsystems’ Storage Archive Manager File System, or SAM-FS, in combination with high-density disk-based storage from Copan Systems, in anticipation of two Illumina Genome Analyzers that are expected to come online in the next few weeks, Dean Flanders, head of informatics at the institute, told BioInform.
FMI’s system is currently 40 terabytes, which Flanders conceded is “pretty tiny compared to somebody like the Broad Institute,” but he noted that it can “easily scale in one rack to 640 terabytes without redesign, without redoing power, or cooling consumption in the room.”
FMI plans to add another 80 terabytes before the end of the year. “We can give infinite storage with this approach and we don’t have to worry about backup or disaster recovery. It’s just built into the system,” he said.
Flanders said that one of the primary goals in designing the system was ensuring that the data could be backed up easily. Many life science research groups are generating so much data, so rapidly, that they don’t have time to back up their data properly, he said. As an example, he cited a colleague from an undisclosed “major institute in Boston” who told him that “it’s just not possible to back up” all the data generated in the lab because it is too time-consuming.
As a result, he said, his group wanted to create a system that would serve as primary storage as well as a backup system. Working with an IT services firm called HMK, FMI decided to use the SAM-FS file system, which was designed to automatically back up data to tape.
SAM-FS has been around since the mid-90s. “It’s not this sexy new technology that everybody’s looking for, but it’s a working technology,” Flanders said. “When you write a file in, within a defined period of time the file is immediately archived to tape, and then backed up. So you don’t have to worry about running a backup because the backup happens as soon as you write the file.”
However, the team balked at the idea of writing to tape because it is notoriously slow and difficult to work with. Instead, it opted for Copan’s disk-based system, which uses a technology called MAID, or massive array of idle disks, which packs a large number of disks into a small space, but has very low energy requirements because it only provides power to a fraction of the disks at any one time.
“You basically spin the disks down, spin them up when you need them, and then they turn off again,” Flanders said. “The users have a delay of 20 seconds, so they don’t even notice the difference. As soon as the disks are spun, all the rest of their data is immediate.”
This recovery time is much shorter than it would be with tape, but slightly longer than it would be with disk-based systems from companies like Isilon or BlueArc. However, the Copan system requires far less power and cooling than most other disk-based systems because it isn’t running all the time, Flanders said.
“It’s not this sexy new technology that everybody’s looking for, but it’s a working technology.”
“To add more storage, we just plug in another array and we don’t have to redesign our backup system, we don’t have to redefine our hierarchical storage management. So it’s a very, very scalable approach,” Flanders said. “It’s very dense, very scalable, and doesn’t have the power and cooling issues that you would normally have with that much storage.”
With the combined system, Copan “emulates a tape drive, or virtual tape drive, to SAM-FS, and then SAM-FS to us looks like a normal network share to a Windows computer,” he said. “People just store their files there from a normal computer and to the user it looks like they have 10 or 20 terabytes available to them, but actually they have a very limited amount on the server, and most of it is on Copan.”
Martin Cooper, director of systems engineering for the EMEA region at Copan, said that the company’s technology was designed to complement “high-performance” systems like Isilon and BlueArc.
“Copan is a purpose-built storage platform for long-term retention of data,” he said. “Once the highly transactional stage of the information’s lifespan has finished, it’s often inappropriate to retain storing that information on what we call tier one data stores.”
Copan, he said, “is a highly scalable platform to economically and environmentally store that information that’s gone from being transactional to non-transactional. So really it’s a filing cabinet for all the information you have that you don’t necessarily need to have on your desk, but you still need to get to.”
But despite Copan’s benefits, it hasn’t caught on in the life science informatics market. “People have been very risk-averse, so they don’t want to go into this area because it’s a newer technology,” Flanders said.
Reece Hart, scientific manager of research computing at Genentech, agreed. Hart said that while he is aware of Copan, SAM-FS, and other file systems such as Sun’s ZFS and considers them to be “intriguing,” the FMI installation is the “first time I’ve heard of somebody deploying this and actually talking about it.”
Hart also agreed with Flanders’ assessment of the industry’s backup and recovery issues. “We have so much data that we can't possibly restore it quickly enough if needed for disaster recovery.”
The challenge for industry, he said, “is that you can spend a lot of time and money deploying solutions that take a lot of electricity and that you end up not needing, so it’s a pretty expensive safety net to have around.”
Nevertheless, while he said he is “impressed” by what he’s heard of the FMI system, it’s unlikely that his group would adopt an approach that is still largely unproven in the market. “Genentech is a relatively large company, so we’re not usually the first to pick up early-to-market storage solutions, and we don’t usually integrate those ourselves,” he said.
Matthew Trunnell, group leader in the Broad Institute’s Application and Production Support Group, said that his group has also looked into systems like SAM-FS Copan, but hasn’t evaluated the technology yet. One challenge that he cited in the life science field is that “as an industry, we haven’t really thought about data lifecycle management,” particularly for next-generation sequencing.
“For conventional capillary sequencing, I think such a technology would work terrifically well because we understand the data lifecycle very well. We know which parts we want to keep for the long term, we know which parts of the data get thrown away — we have that very well mapped out. We’re not there yet with next generation sequencing,” he said.
For example, Trunnell noted that an automated file system like SAM-FS might not be the best choice for an instrument like the Illumina Genome Analyzer, which for each run generates around half a million files in a “relatively deep directory structure, and you almost certainly don’t want to keep all of that forever.”
Using SAM-FS, “that data may be immediately written to some permanent archive, but we wouldn’t want to do that with all of the data that we’re generating for next-generation sequencing — only with this small subset, which we’re going to have to pull out manually anyway,” he said.
“That’s been really the challenge of looking at any of these hierarchical storage systems,” Trunnell said. “In order to make use of them well we need to understand what our data lifecycle needs are, and it’s not simply the case that we need recent data immediately online, and data that’s a little bit older nearline, and the data that’s a lot older offline, which is the standard model for hierarchical storage.”
In actuality, he said, “we’re generating an enormous amount of data quickly, some of which is long-term archive, some of which can be thrown away immediately, some of which we may want to come back and analyze again over the next period, where that period is determined by the length of a research project — maybe six months to two years, for instance, but it’s very hard to assign an arbitrary age limit to these things.”
Trunnell conceded, however, that the volume of data that the Broad Institute is generating — the institute is rapidly approaching the 2-petabyte mark — is not typical for the life science research field.
“Scale really matters when looking at these systems,” he said. “On a much smaller scale, if I could afford it, I’d say sure, ‘Let’s archive everything just to be safe.’ And then SAM-FS or another hierarchical storage management system would be terrific. … But we just can’t do that.”
FMI’s Flanders said that he’s confident that the SAM-FS/Copan system will meet his lab’s needs. The institute is currently using it to store confocal microscopy data, and expects it to work just as well with data from the Illumina sequencers.
“I’m very happy with our strategy because what’s happening is that people are putting in a BlueArc, or they’re putting in an Isilon, and they go with it for six months and say, ‘Oh no, we can’t back this up anymore; oh no, how do we restore this; oh no, we’re filling it.’ We’ve already thought of these issues with our first system, so it will be nice to get it right the first time,” he said.