One of the most frequent questions I’m asked when consulting with proteomics groups is how to store all their data. Grants written to establish proteomics projects often refer to the “vast” — or sometimes “enormous” — storage requirements. Various superlatives are used to describe the proposed archive, as well as estimates of how many terabytes or petabytes per year (or month or day) will be required. For the non-IT geeks who might be reading this column by accident, a terabyte is a thousand gigabytes and a petabyte is a million gigabytes.
One of the happiest realizations to come out of the various genome projects is that there really aren’t that many genes required by a human (or mouse or cow or chicken). Just to compare, the complete human genome is about the same size as the video file needed to store an episode of Buffy the Vampire Slayer. The human proteome requires about the same storage space as a long pop song (“Stairway to Heaven”). A bacterial genome is only the size of a short pop song (“Lose Yourself”) and its proteome is about the size of a digital photograph (such as the one of your humble author at the bottom of the next page).
If the underlying information would all fit on an iPod without taking up enough space to notice, where does the idea that we need “vast” storage space come from? The main reason for outlandish storage estimates has to do with the widespread use of mass spectrometry in proteomics. These machines can be gushing geysers of pseudo-random numbers. If a tandem mass spectrometer is set up for maximum sensitivity, it is difficult to avoid generating a stream of spectra from the heterogeneous chemical noise generated by any active ion source. Often, only five to 10 percent of the spectra stored will be generated from peptide parent ions. The other 90 to 95 percent will be uninterpretable, having been caused by the instrument triggering off this noise.
If this chemical noise were not bad enough, many instruments also generate files whose storage volume is completely out of proportion to their information content. A good spectrum may contain as much as 500 bytes of information, about 50 masses and corresponding intensities. Current data systems may require several megabytes to store this information because they create a high-fidelity record of the analog output of the instrument. Considering that such a system can easily be configured to run 24/7, chewing up megabytes per second, the storage requirements can really start to add up.
The maintenance of hi-fi analog output is necessary if you are designing an instrument or developing signal-processing software. For molecular biologists, geneticists, and clinical investigators — the end users of the information — these records are irrelevant (and even a little scary). These users need crisp answers such as four- to 16-byte accession numbers and possibly a few bytes to record the locations of post-translational modifications.
Understanding the Habit
When data system manufacturers are confronted with the logical dissonance of requiring the storage of 100 megabytes to confirm a few bytes of accession number, they will frequently dodge behind the United States Code of Federal Regulations (21 CFR part 11). Compliance with these regulations is imposed by the Food and Drug Administration in pharmaceutical industry laboratories so that an auditor can figure out why, when, and what was done by whom. Confronted with the awesome authority of 21CFR11, there is a tendency to grudgingly accept it as justifying the perpetual retention of every bit of electronic flotsam and jetsam all too easily dumped to disk. Scientists assume such a law is written in language so Byzantine that it would make a credit card contract seem simple by comparison.
Regardless of what one might think of members of Congress, they (or at least their staffs) can be pretty lucid at times. While some sections of these regulations are a bit vague, the document is much easier to read than most scientific papers. Prying a little deeper into 21CFR11, section 11.10 describes the requirements for retaining electronic records. Specifically, point (c) says records should be protected to “enable their accurate and ready retrieval throughout the records retention period.”
Put another way: retain the information as long as you legitimately need it, but no longer. People in Congress certainly understand how valuable a shredder is once information has reached its “best before” date. With this in mind, estimating practical storage requirements is easy once you have thought through a reasonable data retention policy.
In my opinion, raw analog data should be retained online until a digitized record — a table of peak masses and intensities — has been constructed. The analog data can then go into a nearline store. To calculate the size of the nearline storage, only two things are needed: 1) how long it takes for peak tables to pass your QA/QC tests; and 2) how many spectra are being generated per day. Once a peak table passes your tests, the associated analog data can be placed offline for a limited period (no more than a year) or, better yet, simply discarded.
At this point someone in the room (usually the same guy who brought up 21CFR11) always says that users may want to reanalyze the data in the future to keep up with changes in gene models. As long as you keep the much smaller peak tables, you can redo the bioinformatics analysis against an updated genome at any time. You can also filter out most of the noise by storing only the peak tables needed to support your experimental conclusions.
Following this advice, one kilobyte per annotation will store everything that you need, allowing me to mash up a proteomics version of Apple’s slogan for the iPod: “15,000 songs. 25,000 photos. 60,000,000 peptides.” All of that without banks and banks of spinning RAID arrays, and enough room left over for some tunes.
Ron Beavis has developed instrumentation and informatics for protein analysis since joining Brian Chait’s group at Rockefeller University in 1989. He currently runs his own bioinformatics design and consulting company, Beavis Informatics, based in Winnipeg, Canada.