It is a beautiful spring day and I have found myself in Washington, DC, with nothing to do for a few hours. The cherry trees are blooming, so I've decided to spend some time sitting in Lafayette Park watching the tourists watch the White House, while smoking a good cigar. The surroundings remind me of a rather unfortunate statement made by someone a little further down Pennsylvania Avenue:
— United States Senator Ted Stevens (R-Alaska)
This characterization of the Internet as a "series of tubes" has been the subject of many jokes and sly references by TV talk show hosts. It has become possibly the most heavily flogged Internet-as-described-by-politicians gaffe since Al Gore created his own unfortunate truth by mentioning his perceived central role in its creation. It conjures up the idea of politicians with a view of technology informed more by the 18th rather than the 21st century.
My own experience with politicians has been at odds with this view. I have found them to be surprisingly good at coming to a working knowledge of the most abstract scientific issues, so long as they are properly briefed. In this case, I agree with Princeton's Edward Felten in his defense of Senator Stevens: after hearing experts prattle away about the need for bigger "pipes" for the Internet, who can blame the senator for using a synonymous term in his own metaphor? After all, Webster's defines a tube as "a hollow cylinder or pipe" and a pipe as "a cylindrical tube." The diminutive connotations associated with "tube" would be a better cognitive fit for something constructed by computer scientists in a climate-controlled switch closet as opposed to a big "pipe," which in the senator's experience would be constructed by oil workers in the wilderness of Alaska.
Proteomics in Particular
Proteomics has developed its own pipe-related terminology: the proteomics pipeline. This term has been so successful that structural genomics has been attempting to create its own pipelines to close the perceived pipeline gap that has emerged between the two fields. Everyone these days seems to be discussing their pipelines, particularly the bottlenecks generated when analyzing large volumes of experimental data.
A proteomics informatics pipeline isn't quite like the Internet's tubes. It is based on the notion that data from laboratory measurements can be processed in a sequential fashion as a series of tasks. A list of these pipeline tasks might be:
2) Feeding the new files into a search engine to generate files associating the spectra with peptide sequences
3) Parsing these files to generate new files containing the statistical confidence of each assignment
4) Parsing these files to create even more files that contain the proteins that best represent those peptides — and so on until a summary file can be made to hand back to a biologist.
Pipeline processing is often performed using a style of automaton my friend Wade Hines calls a "shoemaker's elf" after a Brothers Grimm character. In the story, a shoemaker leaves the raw materials necessary to make a pair of shoes out in the evening and by dawn elves have turned the leather and thread into the finest shoes imaginable. By analogy, all that an experimentalist needs to do is to leave a set of raw data files in a directory in the evening and when he returns in the morning they have been converted into Nature papers by the clever elves that haunt his pipeline. Many such pipelines have been constructed, each with its own similar-sounding acronym — an unfortunate consequence of the peculiar profusion of p's that populate, punctuate, and plague the practice of proteomics.
So why all of the fuss about bottlenecks? Simply put, the elves (a.k.a. demons) can only work so fast. The serial design of the pipeline means that files tend to build up at the input to the slowest algorithm. An evil, file-eating troll takes over where a busy elf used to preside. Data tends to build up in the pipeline, generating difficult-to-diagnose cascading failures that cause the whole system to behave unpredictably.
The serial design that makes it easy to implement these pipelines also makes it hard to change individual elements when newer, faster algorithms become available. The delicate file-parsing routines that weld the pipeline together must be returned to accommodate any new component. Adding more computers to assist the slow algorithm can help, but in an eco-conscious world it is necessary to consider that a dual-processor server may generate as much as six tons of greenhouse gases in a year. And you can't simply shut down the pipeline to make repairs, as the lab data continues to flow, creating a messy backup of files that may take months to mop up.
What is the alternative? I would suggest getting yourself a copy of Finite State Automata for Dummies, a sketch pad, and a cigar (OK, you don't really need the cigar). Occupy a nice park bench and start drawing a state-machine-based system that uses a simple relational database to maintain its state information. Ensure that the system can run even if most of the components fail, and that incomplete jobs can be rescheduled after defined time-outs. And most importantly, make sure it is more like a series of tubes filled with happy elves than a pipe crammed full of undigested files and angry trolls.
Ron Beavis has developed instrumentation and informatics for protein analysis since joining Brian Chait's group at Rockefeller University in 1989. He currently runs his own bioinformatics design and consulting company, Beavis Informatics, based in Winnipeg, Canada.