A quick Google image search for "next-generation sequencing and Moore's Law" brings up a slew of logarithmic charts depicting the rate at which sequencing technology is outpacing both compute and storage technologies. But the current challenge of managing data storage is a result of both the advent of next-generation sequencing technology — illustrated on these charts by a sharp downward slope that begins somewhere around 2008 — and a principle called Kryder's Law, which pertains to hard-disk technology development. Kryder's Law states that storage disk density doubles annually and that the cost of storing one byte of data is halved every 14 months as a result. However, sequencing technology is evolving at a much faster pace. At present, the per-base cost of sequencing is dropping by about half every five months, and this trend shows no sign of slowing down. Also, by some estimates, there are 1,700 high-throughput sequencers in operation across the globe, sequencing more than a million human genomes — upwards of 13 petabases — per year. Factor in all of these variables, and the logarithmic graphs start looking like signs of a data-management doomsday.
"This is a substantial volume, on par with the volume of data collected in high-energy physics, astronomy, and other classic big-data disciplines. In fact, [it] is so great that if you wrote 13 petabases to DVDs, which can each store 4.7 GB of data and are 1.2 millimeters thick, the stack of DVDs would reach more than two miles tall," says Cold Spring Harbor Laboratory's Michael Schatz. "Furthermore, the worldwide sequencing capacity is growing at about four times or five times per year, which means that if this trend continues, soon the stack of DVDs could reach hundreds of miles into space."
At Cold Spring Harbor, where they have some 20 sequencers, most data comes off of nine HiSeq 2000 machines, Schatz says. The output of each machine is more than 20 terabytes of raw sequence data per month, and this number will only rise as additional machines are brought online and current instruments are upgraded. Cold Spring Harbor has roughly 1.5 petabytes of active usable storage between two large storage systems consisting of Bluearc and Isilon hardware, as well as several distributed clusters with a shared network file system and the Hadoop Distributed File System. But even with this hardware, short-read sequences and whole-genome data present a significant challenge. "The storage technology improvements certainly enable larger and larger volumes of data to be saved away, but now that the sequencing capacities are so large, data storage is becoming a much more significant fraction of the total cost of sequencing," Schatz says. "Storage is fundamentally different from computing — which can be readily shared and reused — in that storing a piece of data forever consumes a finite resource."
Until the pace of hard disk technology development catches up to the rate of DNA sequence data production, many IT administrators are looking beyond storage hardware to deal with this challenge. The psychology of data storage — changing the way researchers think and feel about what sequence data to save and for how long, or what data to delete and when — is what some see as the most practical solution in the fight to keep up with all the data.
But how to go about assigning a value to sequence data or a set of short reads to determine whether to store them indefinitely, in a compressed form, or just click delete, is not a trivial challenge. "The number one thing is to manage and look at what people are doing, and limit the amount of space researchers have. The more space that I allocate, the less efficient people will be," says David Craig, an investigator at the Translational Genomics Research Institute. "I will always be at 85 percent capacity because no matter how much space the informaticians have, they will immediately find a way to be at 85 percent. So the number one thing is to not make immediately available 200 terabytes per individual researcher, but to keep that lower because it creates best practices and you can get people to not be abusive with disk storage. That way they will code better and be forced to work in an appropriate environment."
Changing the psychology of data storage can be difficult. Putting policies in place that promote economical data storage for sequencing sometimes meets resistance. "Once you give people data, they will want to keep it. But after a while we realized that's not sustainable. However, we have had resistance from users because the data we often want to delete, they consider that raw data," says Dawei Lin, manager of the bioinformatics core at the University of California, Davis, Genome Center. "A few years back, we made a policy that if you wanted to save an image, you had to pay more, and after we began to adopt that procedure, nobody wanted to save the images. But now people are used to saving the data without images, so what is the raw data? And what raw data needs to be saved and archived, or deleted?"
Craig says he has also observed a change in the attitudes of researchers regarding what they expect to store indefinitely and what is not essential to their research. A few years ago, it was accepted practice that a researcher would want to keep everything that came off of a run, including the image files, which could easily take up 2 terabytes. But when Illumina updated its operating system software so that it automatically deleted the image files, no one complained. "We think we need to keep everything, but in four years we cannot possibly save things the way we are and keep up — it just doesn't make sense to take a 100 gigabyte genome, make backups, distribute that at four sites, and so on," Craig says. "At some level you let it go and you understand that the BAM format will have to shrink considerably or will have to be something we don't store anymore. Part of that is getting rid of the psychology of 'there will be another, better variant caller' or 'I will want to realign,' because the costs really aren't worth it."
Thankfully, putting good IT habits in place can be a simple matter of automation, and information lifecycle management can play a role in keeping costs down. David Dooling, assistant director of informatics at the Genome Institute at Washington University in St. Louis, uses a strict deletion strategy for his institute's laboratory information management system. All of the institute's sequencers generating data are tracked in the LIMS. When the run is complete, the LIMS automatically kicks off the primary analysis for processing. Two weeks after this primary analysis is completed, the LIMS deletes the run. The institute has roughly 400 terabytes dedicated to dealing with data from the sequencers during the run and primary analysis as well as more than 10 petabytes of online storage. The downstream analysis conducted by investigators roughly doubles the amount of data generated, so the total data generation each month is between 80 and 100 terabytes.
"We retain the sequence data that was the result of the run, but all of the intermediate files generated by the instrument during the run and during the primary analysis that generate all the sequence data that you feed into the analysis, that all gets deleted," Dooling says. "We use this knowledge about when things are happening and when they're no longer relevant to remove data that doesn't need to be stored. That saves a lot of space."
At the Center for Pediatric Genomic Medicine, Children’s Mercy Hospital and Clinics, director of informatics Neil Miller regularly moves the primary analysis data onto tape. However, he is also tasked with figuring out a way to save secondary analysis data on spinning disk. "We have committed to storing all the results of the secondary analysis, so that it's online all the time — including alignments, variant calls, and anything downstream of that. We're not really finding a way around the simple problem of having to store large amounts of data at all," Miller says. "After being online for a couple of years, we'll probably move into some tiered solution where we move older project off onto less expensive disk. But from my experience, projects never die. As much as you archive things, someone will come along and want to see the results for some re-analysis of it. So we just try to keep everything live."
As with other technology trends, not everyone is affected in the same way at the same time. The informatics cores at some sequencing centers are content to soldier on and do their best to save as much data as possible. "So far we are storing everything, and that's been a feasible strategy up to now; we just acquire enough spinning disks to put all the data online and for us that is the appropriate thing to do because we want to be able to compute over all of it," says Ian Foster, senior scientist at Argonne National Laboratory. "Cost is always a concern. However, for now, we believe the most cost-effective way to store and process this data is to acquire more storage hardware. Over time there will be too much data to store it all on disk, and then perhaps people will start deleting some of it. But that hopefully won't happen."
A few years ago, cloud computing was a nascent — or rehashed and remarketed, depending on who you talk to — computing technology trend looked at by many bioinformaticians with extreme skepticism, but it has slowly become a feasible compute solution for analyzing genome data. In the last year, both Illumina and Life Technologies launched services that allow customers to analyze their data in the cloud. In the same month, Google teamed up with Web-based bioinformatics provider DNAnexus to provide a freely available version of the National Center for Biotechnology Information's Sequence Read Archive on the cloud.
Despite the fervor surrounding cloud computing, some investigators say that the only application that makes sense is for the cloud to be a central repository for shared sequence data. Even with institutional or commercial-level network bandwidth, data transfer still takes a prohibitively long time for a single human genome — let alone downloading an entire data set from, for example, the 1,000 Genomes Project, which at last count is close to eight terabytes. Craig, who participates in the project, says that it can take weeks for various groups to upload to and download from the servers. It is, then, rather impractical.
If deletion isn't an option and the cloud isn't viable, one alternative is to make the file sizes smaller. However, some common compression techniques like lossy compression — where files sizes are reduced by discarding some of the data like saving a picture at a lower resolution — are not ideal for research. "Compression techniques are critical to make the best use of every available byte, and filtering techniques are similarly important for discarding non-informative data," CSHL's Schatz says. "The tradeoff is that lossy compression and filtering are not always viable options, especially for precious samples that can never be replaced."
While there are some commercial compression solutions out there, most of the useful compression solutions come from the academic community. Genomic Squeeze or "GSQueeZ" is a technique that was developed at TGen to encode genomic sequence-quality data into a compact binary format that can result in substantial storage and processing savings compared to conventional plain text formats like FASTQ and CSFASTA/QUAL. GSQueeZ preserves the order of the data and the indexed structure to allow for selective access to various parts of the file. Binary files can also be searched for information about the number of reads, base composition, and platform.
The CRAM format was developed by a group at the European Bioinformatics Institute led by Ewan Birney as an alternative to the BAM format, the compressed binary version of the Sequence Alignment/Map, or SAM, format. CRAM is based on efficient compression of DNA sequences by storing only the differences between the aligned and reference sequences. The method is also tunable, and allows for storage quality scores and unaligned sequences to be tweaked to conserve information or to minimize storage costs.
Focusing in on the BAM format as the primary storage vehicle — and not retaining FASTQ, BCL, or image files — is one of the major aspects of controlling the torrent of data, Craig says. While there are ongoing efforts to develop highly efficient compression algorithms for BAM files, the file format itself is not always used in the most efficient way. Information is often stored in a BAM file, like the date of the run or the name of the lab technician, but that additional information does not compress nearly as well as the reads themselves. "Our focus has remained on optimizing BAM as a primary means of storage and just doing a better job by not keeping stuff that's not necessary," Craig adds.