It’s common knowledge that proteomics leads to lots of data to stash and shuffle around. Here’s how some companies and institutes are preparing to handle it.
By Charles T. Clark
Without storage networking, the human genome would be lost. To efficiently store and shuffle the huge amounts of sequence data, companies and institutes have turned to two types of technologies: network-attached storage (NAS) and storage area networks (SAN).
Here’s the difference: NAS consists of a specialized server attached to a local area network, uses a streamlined operating system and file system, and is used to extract data from a database and serve files to clients. Its palindromic cousin, SAN, on the other hand, is a dedicated storage network of storage arrays, fibre channel or gigabit Ethernet switches, host servers, and storage management software; it can move units of contiguous stored data quickly (100 MB/sec) between servers and storage arrays. Both NAS and SAN came about to cope with the data overload in the last three years and have gradually been replacing traditional direct-attached storage, where storage is attached locally to individual servers.
But now as proteomics researchers continue to generate even more data — identifying proteins, associating specific protein biomarkers with diseases, and mapping protein network interactions — the question is, will NAS and SAN be enough?
Proteomics company Oxford GlycoSciences of Oxford, UK, uses SAN technology because of its robustness and reliability. The company wanted to keep the computing system it had dedicated to proteomics research running 24 hours a day, and it needed an enterprise model to do so. The SAN is an integral part of this enterprise-class computing, and, more than any other component of OGS’s computing model, it has allowed the company to maximize the time when the computing system is available to users. “Obviously, in our environment, we don’t want to go down, ever,” says Andrew Lyall, OGS’s director of proteome discovery. “And SAN, with virtualization software, allows you to add new hardware or upgrade software without taking the system out of service.”
The storage network allows him to shrink file systems, move them around, and read databases, while the systems are running and users are connected. “And SAN is the only technology that you can do that with,” says Lyall.
OGS’s SAN supports 30 terabytes of available storage and is made up of components from Sun Microsystems, including Sun storage arrays and an E10000 Enterprise Server. The company uses Veritas Volume Manager and the Veritas file system for virtualization — a technique that allows the user to divide a disk logically into a storage pool and assign virtual volumes to individual users on the fly.
Lyall says he heard a lot of horror stories about SANs — incompatibility between components, fragile fiber-optic cables, and exacting fibre channels that demand highly skilled personnel. So he took extra precautions when evaluating vendors two years ago. First, he hired two consultants. Next, on their advice, he sent a request for proposal to several vendors, including a detailed description of his company’s main concerns — reliability and uptime. “I gave them some indication of the kind of capacity we were after and got them to propose a solution,” he says.
Ultimately, Lyall chose a local value-added reseller to put the SAN together and take responsibility for it. And he says the system is now available 99.999 percent of the time.
At the Institute for Systems Biology in Seattle, cofounder Ruedi Aebersold generates immense amounts of data with a quantitative protein profiling technique using isotope-coded affinity tags, automated mass spectrometry, and sequence database searching.
Aebersold and his collegues had been using a centralized file storage system. But with proteomics data production increasing at the rate of 50 GB per week and a fleet of new mass specs that promise even greater output, ISB has outgrown the file storage system as well as its relational database. So ISB purchased from IBM a 64-node Linux cluster, a storage area network, hierarchical storage management software to back up the SAN, and a new SQL server relational database.
ISB looked at both NAS and SAN technology before deciding on the SAN architecture, says senior director of IT operations David Wilkins. But the department’s relational database, Microsoft’s SQL server, was certified to run over a storage area network, but not in a NAS server connected to a local area network. “Nobody except Oracle certified their databases to run on a NAS server,” says Wilkins. Besides, applications run faster on SAN and can also scale more easily to accommodate more data, he says.
“We expect this SAN is going to be a solution for at least two to three years because, with it, we have the ability to scale to the data volumes we’ll be creating,” says Wilkins.
A little bit of this, a little bit of that
Incyte Genomics, of Palo Alto, Calif., uses both SAN and NAS technologies to support its proteomics projects. The company has a multi-terabyte SAN built from EMC equipment, and a very large local area network with attached NAS appliances comprised of NAS filers, or specialized servers, says Stu Jackson, IT architect at Incyte. Each technology has a specific function. SAN gives customers high-speed access to data, while NAS can send out several files of spectral data to hundreds of clients from the company’s Linux server farm.
The difference between the two technologies is becoming blurred, says Jackson, because users can apply the same disk array for both NAS and SAN. And this capability has been very beneficial to his end-users. For instance, “You can extract data sets from a database using a SAN and then fan out access to that same file to a large number of clients using NAS technology,” he says.
There are still advantages to SAN, however. “When you look at the sort of storage management tools that are available for SANs, I would say that they are a little bit better than the sort of stuff that is available for network-attached storage because the SAN tools manage more functions more efficiently,” says Jackson. He has also found SAN to be about three times more scalable because of its fibre-channel technology. The fibre-channel standard permits thousands of nodes to be added in a mesh arrangement to a dedicated storage network.
Most labs need to consider cost when deciding to take the storage networking route and SAN technology is twice as expensive as NAS. “Prospective users don’t want to go with SAN for everything; the expense of doing that would be very, very high — especially for a large Linux cluster,” Jackson says. That’s why he recommends the judicious use of NAS along with SAN technology. Using NAS, he says, “cuts the cost per node while allowing the user to get the job done in a reasonable fashion.”
Despite the storage demands generated by protein identification, quantification, data mining, and interaction mapping, NAS and SAN are still the storage technologies of choice. In fact, in many vendors’ newer systems (see “A Network of Vendors,” p.84), a NAS captures data and a SAN stores them — all within the same storage networking system. While biologists are busy cranking out data, the IT folks are hoping the fusion of the two technologies, combined with larger-capacity storage arrays, will make their lives easier. Designing elegant experiments may be the key to addressing some of biology’s biggest questions, but “the answers lie in the data analysis,” ISB’s Wilkins says. “You have no science without the technology.”
A Network of Vendors
NAS and SAN vendors are gearing up for feverish activity in proteomics. Network Appliance, in San Jose, Calif., a NAS pioneer, has for many years set the pace both in the computer industry and in life sciences with its family of filers. So it’s no surprise that the company’s NAS devices are well represented both in genomics and proteomics. Not resting on its laurels, however, Network Appliance is positioning itself to meet the future needs of proteomics by increasing the capacity of its filers and introducing gigabit Ethernet to power its NAS technology, according to life sciences senior manager Paul Mayes.
Similarly, EMC, a NAS and SAN vendor located in Hopkinton, Mass., foresees a huge demand for its equipment in the next 18 to 24 months. Roberta Katz, the company’s life sciences business group manager, believes when the demand for NAS and SAN equipment does come, it will be huge. The data generated by proteomics, she says, “will be 30 to 1,000 times greater than was produced from the genomic mapping exercises.”
EMC is the current market leader in SAN technology and SAN management software, and is among the top suppliers of NAS equipment. So it is in a strong position to deliver a complete NAS-plus-SAN storage solution to proteomics researchers.
Hewlett-Packard also offers a complete line of NAS and SAN storage hardware, plus the software to manage both. For proteomics, HP is focusing on “fusing NAS and SAN technology,” says Roger Archibald, VP of infrastructure and NAS.
In the past, users tended to see NAS and SAN as competitive technologies. But recently, particularly in the life sciences, that attitude has changed. “People have come to realize that most environments … will need both technologies,” says Archibald.
With its fusion, HP can offer customers a common pool of storage. “With one of our NAS appliances connected to the SAN,” says Archibald, “we can serve files for applications that need files and blocks for applications, like database applications, that need block storage.”
Hitachi Data Systems also offers a NAS-SAN combo for proteomics. Hitachi’s competitive edge, according to Darryl Kent, senior global marketing director, is the architecture of the storage array that forms the foundation.
Kent asserts that his storage array family, the Lightning 9900 series, has the highest capacity of any storage array on the market: 71 terabytes in a single subsystem. So SANs based on the Lightning 9900 can accommodate the amount of data emanating from proteomics experiments.
The Lightning 9900 storage subsystem also has a fibre channel-based backend that allows it to switch data faster than any other storage unit, he says, which makes it the ideal solution to store the amount of data from proteomics experiments.
Kent Lindquist, a Hitachi territory manager, says that one company engaged in proteomics chose a 9900-based SAN specifically for this reason. “This company anticipated acquiring several hundred terabytes of data over the next three years and our 9960 had the greatest potential to store data in a single subsystem, which the company was going to plug into a fabric-based SAN.”
In addition to their in-house development, a number of these vendors have formed alliances with proteomics companies and research institutions in order to tailor their products. For example, IBM has teamed with the Institute for Systems Biology of Seattle, Hitachi with Myriad Proteomics, and EMC with Incyte.