NEW YORK (GenomeWeb) – A group from the Spanish National Cancer Research Centre has published a critique raising questions about two recent studies in which researchers generated the first relatively comprehensive maps of the human proteome.
The critique, published this week in the Journal of Proteome Research, suggests that the proteome maps, which were detailed this May in two separate papers in Nature, may have substantially overstated the number of quality protein identifications supported by the studies' data.
The maps were generated by two independent groups, one led by Johns Hopkins University researcher Akhilesh Pandey and the other led by Technical University of Munich researcher Bernhard Kuster, both of whom declined to comment to ProteoMonitor regarding the JPR letter but said that they planned in the future to present their own analyses of the critique and their map data.
The JHU-led study identified proteins coded by 17,294 genes, or roughly 84 percent of the 20,493 human genes annotated in UniProt as protein coding. This number includes proteins to 2,535 genes for which there was previously no protein evidence.
The TUM-led project detected proteins to 18,097 human genes, approximately 88 percent of the protein-coding genome. It also detected 19,376 of the 86,771 protein isoforms currently listed in UniProt.
Compared to other large-scale proteome characterization efforts, the two studies represented significant leaps forward. For instance, as of last year, the Human Proteome Organization's Chromosome-Centric Human Proteome Project, which involves dozens of researchers around the world, claimed to have identified roughly 14,500 proteins. Meanwhile, human proteome characterizations undertaken by single labs in recent years have typically topped out in the range of 11,000 to 12,000 protein identifications.
It was this large jump in protein IDs that first drew the suspicions of the Spanish group, CNIO researcher Michael Tress told ProteoMonitor.
Tress and his colleagues are currently at work annotating the human genome and, as such, he said, they "have been looking at quite a lot of proteomics data from different sources."
The two Nature studies, he said, represented "kind of a quantum leap beyond the number of [proteins] that had been identified" in any of the other datasets they had looked at.
As a quick, rough check of the maps' data quality, the CNIO team looked at how many olfactory receptor proteins were represented in the two datasets. As transmembrane proteins, olfactory receptors have traditionally been difficult to detect via shotgun mass spec experiments. Therefore, Tress said, the thinking was that if the two maps claimed to have identified a large number of these proteins, it was a sign that their identification numbers were likely inflated.
Searching the two sets, the CNIO researchers found peptide evidence for 108 olfactory receptors in the Pandey dataset and for 200 in the Kuster dataset. Adding to their suspicions, when they examined the tissue distribution of these IDs, they found that the two studies most commonly found these olfactory receptors in lung – a seemingly curious finding given their role in taste and smell.
However, as Thermo Fisher Scientific Senior Applications Scientist Ben Orsburn noted this week on his blog, Proteomics News, such a finding is not necessarily strong evidence that the identification numbers in the two maps are inflated.
For instance, Orsburn said, given the shoddiness of many protein annotations, there is a reasonable chance that some of these proteins annotated as olfactory receptors are not, in fact, olfactory receptors. Additionally, he noted, it is not uncommon for proteins to have more than one function, or to have different functions in different tissues, making it not implausible that the researchers found large numbers of them in locations like lung that would seem to have little to do with their taste and smelling function.
Nonetheless, several independent proteomics experts contacted by ProteoMonitor did suggest that the CNIO critique likely had merit. In particular, these experts observed the difficulty of keeping down false positive identifications in analyses combining multiple datasets, and noted that this was likely a contributing factor to the large numbers of protein IDs claimed by the two studies.
For instance, Paul Rudnick, formerly a researcher at the National Institute of Standards and Technology and currently the owner of proteomics informatics firm Spectragen-Informatics, told ProteoMonitor that while the studies' "analytical methods and targeting of specific tissues are a good way to find proteins expressed in a small number of tissues or cell types, " he believes that their numbers "need a second look."
This skepticism, he said, is based on his own work with large datasets in which he has observed that "basically, once you reach a certain level of coverage, the only new identifications that appear are likely false positives."
This, he said, can create as much as a 10-fold difference between the false discovery rate at the peptide-spectrum match level – which was used in the Nature studies – and the protein identification level.
Rudnick added that the short length of the peptide sequences deemed acceptable in the Pandey study also raised for him questions about the protein IDs.
"I would have increased the minimum length for an accepted peptide sequence to eight [amino acids] from six as was used in the Pandey paper," he said. "The short peptides can account for many of the problems with the protein inference."
He went on to say that Ron Beavis of the Global Proteome Machine Database has re-processed both Nature datasets and obtained protein identification numbers roughly 20 percent to 25 percent lower than those in the original papers.
"To me, this discrepancy indicates a problem that needs follow-up," he said, adding that he felt that "more appropriate filters should have been enforced at the review stage for these publications."
Swiss Federal Institute of Technology Zurich researcher Ruedi Aebersold told ProteoMonitor that to an extent the questions surrounding the Nature maps are a matter of what standards a group chooses for making its protein IDs.
Such lists, he said, "depend on some assumptions, particularly at the protein inference stage." And, depending on the assumptions made, a group may end up with a more- or less-restrictive set of IDs.
"To illustrate, for the PeptideAtlas project we have used the principle of Occam's razor – we define the minimal set of proteins explainable by the available peptide evidence," he said. "This is likely too constrictive."
On the other hand, he added, "others are using the most expansive assumption – they state the most expansive set proteins that could be explained by the available peptide data, and this is likely too expansive. Neither is right or wrong. They just reflect the assumptions made. The root problem of protein inference is the fact that many peptides – presumably correctly identified – can be associated with more than one protein."
This problem, Aebersold noted, may not have a "single right solution."
Indeed, Tress said, one purpose of his group's critique, and one of the opportunities raised by the Nature papers, is to establish standards in the field for dealing with such datasets.
And to an extent this process is already ongoing. For instance, Aebersold said, groups like the CH-HPP have set forth principles on how such large datasets should be analyzed to achieve consistency across studies.
Albert Heck, chair of the Biomolecular Mass Spectrometry and Proteomics group at Utrecht University, told ProteoMonitor that the CNIO team's criticisms were most likely correct, but, he said, it wasn't necessarily something to "make a fuss of."
"The early genomes were full of mis-annotations and have seen thousands of corrections since [their initial release]," he said. "This is to be expected."
Moreover, he said, the Pandey and Kuster teams should be lauded for making all their data publicly available so that they can be scrutinized by other groups.
Aebersold agreed, noting that "the good thing about the situation is that the data are generally high quality and they are accessible… [which] allows specialists like the [CNIO] group to reevaluate the interpretation of the data using different analysis strategies."
Aebersold also questioned more generally proteomics' emphasis on generating large numbers of protein identifications, noting that a significant portion of the field has, in his opinion, focused "way too strongly" on this goal.
While he acknowledged that it was good to have benchmarks to track the field's progress, such lists, he noted, are difficult to compare in a one-to-one manner given the different experimental variables and assumptions involved.