Skip to main content
Premium Trial:

Request an Annual Quote

Progress and Promise

Premium

Even though many in the proteomics community agree that the computational side of protein research is a few laps behind genomics in terms of data crunching, storage and management are not the big concerns. While the average proteomics lab can produce upwards of 2.5 terabytes of data per month, spitting out results in various file formats, it is the ever-increasing capacity of disk drive technology that makes efficient storage and handling of such data a relative non-issue. Instead, the biggest challenge, says Gordon Anderson, chief engineer at Pacific Northwest National Laboratory, is developing algorithms to process the raw data coming off of mass spectrometers. Anderson also facilitates PNNL's management of the National Center for Research Resources-funded Proteomics Resource. "Nobody cares about the data. What they care about is the information content of that data, and that's where our challenge lies," says Anderson. "We have to develop high-confidence algorithms that can identify proteins, say something about their abundance, and have a confidence value that's trusted that we can pass on to a biologist."

The proteomics community has indeed made considerable progress with data processing and reporting standardization, largely thanks to efforts like HUPO's Proteome Standards Initiative, but protein research is, by its very nature, a bit of a moving target. "To some level the problem is a lot easier in the case of just sequence data because the genome's static; you sequence it and put a file on a site," Anderson says. "But the proteome is quite a different problem. It is very dynamic and a function of the gross state of that organism, the state of its life cycle, so it's a much more difficult problem from the standpoint of being able to define and store one static piece of information that describes it."

Protein scoring

Protein identification algorithms such as Sequest and Mascot produce scores that reflect the similarity between experimentally acquired spectra versus theoretically predicted spectra in a sequence database. These scores allow scientists to differentiate between correct and incorrect peptide sequence assignments. The challenge lies in trying to eliminate false positive rates, because many current scoring methods employed by these search algorithms result in overlap between correct and incorrect peptide scores.

A team at the University of Washington, including Michael MacCoss, an assistant professor of genome sciences, and Bill Noble, an associate professor of genome sciences and computer science, recently released Percolator, an algorithm that improves the rate of confident peptide identifications from large amounts of tandem mass spectra. While there exist many database search algorithms that work well, there is room for improvement.

Percolator is an automated, machine-learning algorithm that can be trained to tell the difference between correct peptide-spectrum matches and incorrect ones. It improves the rate of peptide identifications from a collection of tandem mass spectra using a semi-supervised machine-learning approach. Whereas search tools such as Peptideprophet may only use four features or vectors to sort through peptide spectrum matches, Percolator incorporates more than 20 different factors, including the length of the peptide, accuracy of the precursor ion mass, and tryptic specificity, in order to discriminate between the right and wrong answer. Percolator can also be appended to any database search algorithm and is currently available for download from the Noble lab website. MacCoss and Noble have more tweaks on the way for release later this summer, including support for post-translational modifications.

Hardware acceleration

There has been a steady stream of high-speed hardware platforms offered to the genomics community for accelerating search algorithms. This same technology is steadily gaining traction in the proteomics community as well in the form of customizable field-programmable gate array (FPGA) chips. "One of the things we've been looking at is FPGAs," says Anderson. "A lot of our analysis is embarrassingly parallel in that we don't need high-speed interconnects between these computers. We just need a lot of CPUs that can operate autonomously, and some of the core components of an algorithm could benefit from reconfigurable computing."

Recently, a research group composed of two proteomics scientists and a computer scientist at Mississippi State University ported a string-set matching algorithm used in computational biology called the Aho-Corasick algorithm to Xilinx's Virtex-4 FPGA board. In this case, the researchers tested the algorithm for matching peptide sequences against a genome for the purpose of annotation in a process known as proteogenomic mapping. While Aho-Corasick has been ported to application-specific integrated circuits, the development complexity and costs can be prohibitive, according to the researchers. This makes the reconfigurable and energy-efficient FPGAs a great alternative. Compared to a normal software implementation on a Windows XP workstation with a 2.67 GHz Intel Core Duo processor with 2 GB of RAM, the FPGA-enabled run was on average 20 times faster.

But is acceleration as much a priority in proteomics as it is genomics? One member of the team, Susan Bridges, a professor of computer science and engineering at Mississippi State, says that while acceleration may not be a major concern at present, that will change with time. "I think [acceleration] is important and will become more important as many of the things that are now in a laboratory become part of standard medical testing," says Bridges. "It may not be important right now for day-to-day operations in the lab, but it's getting there — especially when these things move to commercialization, when it's something done on a routine, diagnostic basis."

Teaming up

Bridges and her colleagues feel that their work is also notable as an example of a successful collaboration between computer scientists and proteomics researchers, a key to really getting algorithm development caught up to genomics. But these partnerships are not as simple as one might think. "It is hard to make [these collaborations] happen due to the huge learning curve just to learn the language of the biology and the technical aspects of a proteomics experiment and mass spectrometry," says Bridges. "It's also difficult for biologists to express the problems they're trying to solve in a way that's accessible to computer scientists. So there's a big mismatch in the way they tend to approach problems, as computer scientists have a very logical approach, whereas biologists are accustomed to all the messiness."

MacCoss' work with Noble is another example of a successful collaboration between mass spec experts and computer scientists. But MacCoss too says there is more to be done to make these types of collaborations a regular event. "The people who are really good at these computer science problems haven't yet begun to truly understand much about this mass spectrometry data," he says. "That's why Percolator was such a great project, because it was truly a joint effort between my lab, which was a mass spectrometry and proteomics lab, and a computer science machine- learning lab." MacCoss says that everything clicked after he and Noble began meeting often enough to where each began feeling comfortable generating ideas about the other's area of expertise.

While these collaborations between proteomics researchers and computer scientists are good examples of the kind of synergy needed to give computational proteomics a boost, collaborations between two or more mass spec or proteomics labs is not as trivial as it may seem. The sheer size of mass spec datasets often makes it difficult. If two collaborators want to share a mass spec data set that is larger than a few hundred gigabytes, shipping a hard drive back and forth from lab to lab or using an FTP site is not always the most secure or reliable way, says Philip Andrews, professor and director of the National Resource for Proteomics and Pathways at the University of Michigan. "If you look at the raw data sets that were available a couple of years ago, most of those were on FTP sites set up by a postdoc or a grad student in a lab, running on a desktop or a server," says Andrews. "But when they left, there was no one there to support it, so you've got a ghost URL that no longer has the data behind it, so there's that persistence issue for data on the Web."

Launched just over a year ago, Tranche is an open-source, peer-to-peer network solution that offers the proteomics community a unique venue for data sharing and project collaboration. Originally developed in Andrews' lab, Tranche is a file-sharing tool and data repository based on standard P2P concepts but with an added encryption layer used by online banking sites. This encryption allows users to keep track of who has deposited a particular data set to the network, thus ensuring a file's pedigree, says Andrews.

In late April, Tranche was chosen to host and make publicly accessible all of the National Cancer Institute's mouse model proteomics data collected by the Mouse Proteomics Technologies Initiative. Andrews says he hopes that NCI's adoption of Tranche will make the proteomics community more aware of the tool. "Both scientists and developers have been enthusiastic when they find out about Tranche and typically start using it right away," says Andrews. "We only have so much time that we can dedicate to putting out information about Tranche so we have been relying on word of mouth, but hopefully that will change."

The Scan

Panel Recommends Pfizer-BioNTech Vaccine for Kids

CNN reports that the US Food and Drug Administration advisory panel has voted in favor of authorizing the Pfizer-BioNTech SARS-CoV-2 vaccine for children between 5 and 11 years old.

Sharing How to Make It

Merck had granted a royalty-free license for its COVID-19 treatment to the Medicines Patent Pool, according to the New York Times.

Bring it Back In

Bloomberg reports that a genetic analysis has tied a cluster of melioidosis cases in the US to a now-recalled aromatherapy spray.

Nucleic Acids Research Papers on SomaMutDB, VThunter, SCovid Databases

In Nucleic Acids Research this week: database of somatic mutations in normal tissue, viral receptor-related expression signatures, and more.