Skip to main content

The Arithmetic of Proteomics


Earlier this year I attended a workshop meeting in Paris that was convened by the most influential journals in the field of proteomics. Oddly enough, the topic was bioinformatics. I say “oddly” because the assembled people in the room were names more normally associated with breakthroughs in instrument design or experimental techniques.

They were all members of the editorial boards of Proteomics, the Journal of Proteome Research, and Molecular and Cellular Proteomics, as well as representatives from major proteomics equipment and software companies, such as Applied Biosystems, Thermo Electron, and Matrix Sciences. We set about our task in a beautiful, historic, 18th-century mansion, the Maison de la Chimie, on the Left Bank in central Paris, with Ralph Bradshaw serving as the chair and taskmaster. The goal was simple: to establish criteria for reporting proteomics-related data for publication.

This meeting was meant to build on the recommendations of a previous working group, organized by Steve Carr to put together a set of guidelines for publication. The ensuing debate made it clear that more work (and a much wider consensus) was necessary to produce something both useful and enforceable. It was also clear that the most controversial parts of the initial guidelines had nothing to do with laboratory technique or instrument usage. Instead, the hot-button issues were related to informatics: how to group protein homologues; how to report sequence accession numbers; and how to deal consistently with the uncertainties involved in comparing large amounts of very noisy data to large and varied lists of protein sequences.

Proteins Lost & Found

The emphasis on these informatics-related issues comes largely out of the trajectory of proteomics over the last few years. More and more papers have been published with large tables of supplementary material listing large numbers of peptides “identified.” The bragging rights associated with this data have become increasingly focused on how many peptides were identified, or how many proteins were found in samples ranging from blood plasma to egg whites. A host of improvised methods has been applied to determining which peptides and proteins to include on these lists, leading to some heated exchanges between doyens (and doyennes) in the field at workshops and conferences.

As Alex Nesvizhskii from the Institute for Systems Biology is fond of pointing out, the real trick to proteomics is not so much collecting the data, it is trying to find the fraction of the data relevant to the question you are trying to answer. It was Alex’s position (shared by your humble author) that held sway in Paris. All of the proposals were motivated by the fundamental belief that improving the quality of reported identification and quantification results was much more important than attempting to record everything.

Proteomics is certainly not the first high-throughput field that has been forced to deal with the quality issue: cDNA microarrays, quantitative PCR, and shotgun DNA sequencing have all been struggling with quality for most of this century.

The counting of homologous proteins has become particularly contentious. In a proteomics experiment it is common that only a few tryptic peptides associated with a protein sequence are positively detected. If a protein belongs to a large family of paralogous genes, then those peptides may actually be shared among many protein sequences. Similarly, if a proteomics search engine was run against a large collection of protein sequences from many organisms, then multiple orthologous sequences may also have the same small set of peptides.

Unfortunately, this situation has led to a disturbing number of publications that list human, mouse, and rat protein accession numbers for the same set of identified peptides, even though the experiments were done on human cell lines. This may have been a reasonable practice in the 20th century, when genomic information was very incomplete. However, now that most model organisms have sequenced and annotated genomes, this practice unnecessarily inflates the raw count of proteins “found” — and any bioinformatics professional should be properly skeptical of the value of this type of redundant reporting.

Paris Speaks

The consensus view of the Paris meeting was to apply Occam’s Razor whenever possible. If you have a sample from Homo sapiens, then report H. sapiens sequences. Only count a particular set of peptides once, assigning them to the best-fit protein sequence. Whenever possible, use statistical significance measures rather than arbitrary heuristics to justify the reporting of data. Consistently report protein sequences using stable accession numbers obtained from curated data sources.

There was the recognition that stable accession numbers could be a problem for some groups: the genomic sequences of some important organisms appear to have taken up long-term residence on lab FTP sites, even though they have been complete for quite a while.

The object of the exercise was not to be popular, and there is no doubt that trying to apply a set of standards to a previously wide-open field will instigate some aggressive feedback from the community. Starting that dialogue is the purpose of going through the exercise, as Ralph Bradshaw points out in his cover letter to the current draft of the recommendations:

“The purpose of this endeavor was to formulate standards that would give practitioners and editors alike an appreciation of what a diverse group of stakeholders in this activity considered to be the necessary level of information to reasonably insure the integrity of assignments and thus preserve the accuracy of the scientific record.”

The complete set of recommendations can be found on participating journal websites with contact information and e-mail addresses for comments and suggestions. These recommendations will be incorporated into the instructions for authors and reviewers in proteomics journals for 2006. It is the intention of the Paris group to maintain standing committees to update the consensus recommendations on a yearly basis. But if you want to make yourself heard, don’t wait: do it now.

Ron Beavis has developed instrumentation and informatics for protein analysis since joining Brian Chait’s group at Rockefeller University in 1989. He currently runs his own bioinformatics design and consulting company, Beavis Informatics, based in Winnipeg, Canada.


The Scan

Call to Look Again

More than a dozen researchers penned a letter in Science saying a previous investigation into the origin of SARS-CoV-2 did not give theories equal consideration.

Not Always Trusted

In a new poll, slightly more than half of US adults have a great deal or quite a lot of trust in the Centers for Disease Control and Prevention, the Hill reports.

Identified Decades Later

A genetic genealogy approach has identified "Christy Crystal Creek," the New York Times reports.

Science Papers Report on Splicing Enhancer, Point of Care Test for Sexual Transmitted Disease

In Science this week: a novel RNA structural element that acts as a splicing enhancer, and more.