NEW YORK (GenomeWeb News) – Two independent research groups this week published mass spec-based draft maps of the human proteome.
Detailed in a pair of papers published this week in Nature, the maps, one from a team led by Johns Hopkins University researcher Akhilesh Pandey and the other from a team led by Technical University of Munich researcher Bernhard Kuster, are among the most comprehensive profiles of the human proteome generated to date.
The JHU-led study identified proteins coded by 17,294 genes, or roughly 84 percent of the 20,493 human genes annotated in UniProt as protein coding. This number includes proteins to 2,535 genes for which there was previously no protein evidence.
The TUM-led project detected proteins to 18,097 human genes, approximately 88 percent of the protein-coding genome. It also detected 19,376 of the 86,771 protein isoforms currently listed in UniProt.
With these efforts, researchers are closing in on a complete profile of the human proteome, Kuster suggested to ProteoMonitor.
"Our study is [88] percent of everything, and Pandey's is about the same size," he said, adding that by combining the two the researchers could perhaps push their coverage into the 90 percent range.
The two groups have agreed to do "some kind of protein exchange," Kuster said, noting that this could help fill gaps in their respective maps. The TUM researchers have also added what Kuster called an "adopt-a-protein" function to their database, allowing outside scientists to contribute data on proteins currently missing from the map.
He added, however, that the Nature studies suggest that a number of presumed protein-encoding genes don't, in fact, code for proteins.
"They are annotated as [protein-coding], but they don't make proteins anymore," he said. "So we could look as long as we want, and we would never find them."
Kuster said he based this notion in part on comparisons of his team's findings to UniProt gene annotations. UniProt annotations are graded according to how certain the evidence is that a given gene is protein coding. At the highest level of evidence, proteins have been observed for a given gene. At lower levels, a gene's protein-coding status is based not on direct protein evidence but on factors like gene homology or prediction.
According to Kuster, the TUM researchers were able to identify proteins for 97 percent of the genes graded at the highest level of evidence. In the case of genes identified as protein coding based on the lower standards of evidence, on the other hand, their coverage dropped to around 50 percent.
"Because our coverage is so different for these different grades, we conclude that a lot of those proteins [based on lesser evidence] are probably no longer [made in humans]," he said.
He cited the example of olfactory G protein-coupled receptors, a class of proteins linked to taste and smell that proteomics researchers have struggled to detect.
"There are more than 800 [genes] for these in the genome, but more than half of those, no one has ever seen," Kuster said. "The biological interpretation is that modern humans don't actually rely so much on their senses of smell and taste as they might have a long time ago, and so therefore [those genes] have been turned off."
The flip side of this, he noted, is that the studies also identified proteins from genes not previously identified as protein-coding. The JHU-led project, in particular, focused on using proteomics data to reannotate existing genomics data, with the Bangalore, India-based Institute of Bioinformatics, of which Pandey is founder and director, playing a major role in this effort.
After comparing their mass spectra to conventional reference databases, Pandey and his colleagues then searched spectra left unmatched by this process to a series of unconventional databases that included a 6-frame reference genome, 3-frame pseudogenes, 3-frame RefSeq transcripts, 3-frame non-coding RNA, N-terminal sequences, and signal peptides. Through this analysis they identified 808 novel annotations of the human genome, including translation of 140 pseudogenes, 44 novel ORFs, 9 non-coding RNAs, 160 novel regions within annotated genes, 110 gene/protein/exon extension events, 198 novel protein N-termini, and 201 novel signal peptide cleavage sites.
The researchers confirmed these findings through manual analysis of these spectra and, in certain cases, validation via use of synthetic peptides.
While Pandey and his team didn't use synthetic peptides in all cases, he said he believed this should become standard for validating unusual matches, particularly as the cost of generating these reagents continues to come down.
"With an unusual finding, I think the bar should be a little bit higher," he said.
The JHU map was generated entirely by Pandey and his co-authors, who performed mass spec analysis on either a Thermo Fisher Scientific LTQ-Orbitrap Velos or Orbitrap Elite of 30 normal human samples, including 17 adult tissues, seven fetal tissues, and six purified primary haematopoietic cells.
The TUM map consists of 40 percent data generated by Kuster and his co-authors and 60 percent data gathered from outside researchers and repositories. To simplify processing of the data, the researchers collected only data that had been generated on Orbitrap instruments.
Both maps are accessible via web-based interfaces – Human Proteome Map, in the case of the JHU profile, and ProteomicDB, in the case of the TUM profile – that Pandey and Kuster alike said they hope will provide easy access for specialists and general biologists alike.
"We have [proteomic] repositories, but the issue is that for the average biologist who is not a programmer, how are we helping them?" Pandey said. "We want people over a cup of coffee to be able to browse and look for their favorite protein and where it is found and what peptide [can be used to measure it], and I think this resource allows them to do that in an easy way."