Researchers at the Institute for Systems Biology have published a study summarizing the current state of efforts to fully characterize the human proteome.
Published this month in the Journal of Proteome Research, the study presented an analysis of the most recent build of the ISB's PeptideAtlas database, a collection of human LC-MS/MS proteomics data generated by researchers around the world.
Using a one percent false discovery rate, the authors found that the database, which to date includes data from 470 separate proteomics experiments, contains at least one peptide for each of the roughly 12,500 entries in the Swiss-Prot database, leaving roughly 7,500 gene products that have not yet been confidently identified.
The study serves as a status report of sorts for the PeptideAtlas database and for the proteomics field in general, Robert Moritz, director of proteomics at ISB, told ProteoMonitor. In addition to assessing the proteins catalogued by the database, the researchers provided an analysis of the missing proteins along with techniques and sample types that might facilitate their identification.
The ISB team used Gene Ontology analyses to try to define the properties of the proteins being missed, Moritz said.
"You can see a bit of a trend," he said, noting that more hydrophobic and very basic proteins tended to be poorly covered, as well as very integral membrane proteins like G-protein coupled receptors.
The impetus for the effort was a series of studies performed last year by the labs of Utrecht University researcher Albert Heck, Swiss Federal Institute of Technology Zurich Ruedi Aebersold, and Max Planck Institute researcher Matthias Mann. Each group performed mass spec analyses of the human proteome using state-of-the-art workflows and equipment and each arrived at around 12,000 protein IDs.
"They all came up with pretty much the same answers – 50 percent of the proteome, around 11,000 to 12,000 proteins," Moritz said. "So we were interested to see what the overlap between each of those [datasets] was."
At the same time, he noted, the ISB team was working on a large plasma dataset contributed to PeptideAtlas by Roche. "Combining that on top of the [Heck, Aebersold, and Mann] datasets, we just really wanted to see where things were," he said.
Upon publication of his team's 2011 analysis, Aebersold suggested to ProteoMonitor that he, along with Heck and Mann, had essentially saturated the proteome accessible via the standard LC-MS/MS workflows they used (PM 11/18/2011).
“That’s not to say that there are not other types of proteins in these cells,” he said. “But we would claim that with this workflow — this particular type of cell lysis, this particular type of digestion, this particular type of LC-MS/MS — we would be unlikely to discover many additional proteins even if we kept sequencing.”
Identifying the remaining 7,500 or so gene products would likely require different techniques, he added.
"If instead of trypsin you used a different protease, you might uncover a different protein, or if you used harsh lysis conditions to extract membrane proteins, you could certainly uncover additional proteins," Aebersold said.
Moritz agreed, noting that the ISB's recent analysis is "a bit of a guide to say that if you haven't found the protein [you are looking for], have a look at its sequence and make the decision from there whether you have to use a different enzyme or a different approach to purify it."
He added that the study also used transcript analyses to determine what tissues might be the best places to look for certain proteins. In all, PeptideAtlas includes data on 52 human sample types, with blood plasma being the most common.
Moritz said that the analysis was intended in part as a tool for research teams working on the Chromosome-Centric Human Proteome Project, which aims to characterize one representative protein for each gene located on each of the human chromosomes (PM 9/14/2012). That project is being co-led by Northeastern University researcher William Hancock, who is the editor-in-chief of JPR and suggested the journal as a venue for the analysis, Moritz said.
"In the paper we put suggestions to all of the chromosome-centric [project] groups around the world to show them: Here is what we found, and if you are working on the proteins on this chromosome that haven't been found yet, then this is what you should look for," he said.
Moritz noted that in addition to providing a snapshot of the state of investigations into the human proteome, the project offered a look at the state of proteomics data sharing and the challenges involved.
False discovery rates were one such issue, he said, noting that to maintain a one percent FDR across the entire database, it was necessary to recalculate FDRs for the complete collection of data rather than relying on the FDRs of the individual experiments.
This is because while protein IDs in most mass spec datasets have a large degree of overlap, the false identifications don't overlap, Eric Deutsch, leader of the ISB's PeptideAtlas project, told ProteoMonitor.
"If [for instance] you have two similar datasets and you cut each off at a one percent [FDR] and just merge the two, you still have about 1,000 protein identifications," he said. Because the false identifications are distributed randomly throughout the proteome, however, they are less likely to overlap – leaving you with twice the number of false identifications in the combined data set and therefore twice as high an FDR.
"So you end up with a two percent FDR, and if you do that for three, four, five [experiments], by the time you've merged them all you can have a five percent FDR or higher," Deutsch said.
To get around this problem, the ISB team calculates its one percent FDR for the full combined dataset, a process that Deutsch said is not particularly complicated but, which, given the complete set's roughly 40 million protein IDs, requires a significant amount of computing power.
Beyond the FDR issues, the effort also ran into challenges in terms of obtaining outside datasets. Data sharing within the proteomics community is "an absolute disaster at the moment," Moritz said, noting difficulties caused by cutbacks at the University of Michigan-based Tranche data repository (PM 9/30/2011).
Researchers continue to upload raw mass spec files from proteomics experiments to this repository, but reduced maintenance of the resource means that many times that data can't be retrieved, Moritz said, adding that "we've had to go back to all of these original authors" for that data.
PeptideAtlas and the European Bioinformatics Institute's Proteomics Identifications Database, PRIDE, also accept raw mass spec data, but many researchers are still accustomed to using Tranche, he said.
Moritz noted that one encouraging sign on the data sharing front is that the PeptideAtlas team was able to obtain datasets from some researchers – Scripps' John Yates, in particular – prior to them publishing their results.
"They knew [the data] was going to be used [by the ISB researchers] for a different purpose from what they originally had done it for, and that you wouldn't be able to pick the data out," he said. "So it was nice that these people were submitting their data to us that way."
Moving forward, Moritz said, the ISB is considering putting out such a status report on an annual basis. Particularly as the C-HPP progresses, the researchers would "like to get data from all the chromosome groups and feed that through the PeptideAtlas," he said.
This would allow for valuable cross-correlation of protein IDs across the different teams, he said. "You'll see these proteins being done in different samples, and even though the different chromosome groups have different [research] priorities, you'll probably still see many of the same proteins, so you'll get that iterative increase in quality" of IDs.