NEW YORK (GenomeWeb) – While still in its early stage, proteogenomics — the integration of genomic and proteomic analyses — has in recent years become an area of significant research activity with projects like the ongoing and upcoming stages of the National Cancer Institutes' Clinical Proteomic Tumor Analysis Consortium (CPTAC) taking up such an approach.
One issue proteogenomics aims to tackle is identifying which variants observed at the genomic level are actually important biologically, the idea being that variants that are ultimately translated into proteins are most likely to have a significant impact.
And, indeed, early proteogenomic studies indicate that the majority of genomic variants never make it to the protein level. For instance, in a study published in December in Molecular & Cellular Proteomics, CPTAC researchers looked at polymorphisms, mutations, and splice variants in cancer cells at the genomic and proteomic level and found that only around 10 percent of the single nucleotide variants detected by both DNA and RNA sequencing were detected at the peptide level (though to an extent this could also reflect the greater coverage provided by genomic sequencing methods compared to proteomic analyses).
Even so, said David Fenyö, a researcher at New York University School of Medicine and author on the paper, many of the mutations that show up at the protein level aren't particularly interesting either.
"They aren't drivers of tumors, for example," he said. "They actually don't have that much of an effect" on a person's biology.
And so, in an effort to further enrich for variants likely to have a biological effect, Fenyö along with Ronald Beavis and John Cortens of the University of Manitoba and Sarah Keegan at NYU have developed a software tool for mapping protein post-translational modifications to genomic coordinates.
Presented in a paper published last week in the Journal of Proteome Research, the software tool is based on data in the Global Proteome Machine Database (GPMDB), which is run by Beavis. Named g2pDB, it allows researchers to identify genomic variants that affect protein post-translational modification sites, enabling them, for instance, to pinpoint a mutation that leads to the loss of a protein phosphorylation.
"The big problem is all the noise," Fenyö told GenomeWeb. "There are so many [genetic] changes that you measure over time. So what we thought with this paper is that an additional piece of information would be whether [a mutation] changes, for instance, the ability for phosphorylation."
Given the importance of post-translational modifications in processes like cell signaling or protein degradation, a mutation that renders a modification site unable to host that modification would be potentially interesting biologically.
Even this isn't a perfect approach, Fenyö said, noting that "there are many modification sites that are not interesting." Nonetheless, he said, "it still adds some more information."
Fenyö said that to his knowledge this was the first tool aimed specifically at such analyses.
"Combing genomic and proteomic data is still pretty new and it is really only a few groups" doing it, he said. One advantage he and his colleagues had was access to the GPMDB which, Fenyö said, has collected essentially all the high-quality public proteomics datasets generated over the last decade.
"Having access to that and having developed that system gave us the opportunity to do this relatively easily," he said, noting that the advantage this provided was primarily a matter of organization. "The data was searched to identify the peptides in a consistent way with the same search engine, and all the information from those results was stored in a consistent way."
"In principle you could download all the public data from different sources," he said. "But to redo all the searches and so on is a big undertaking. This [g2pDB tool] is based on … something like 350,000 LC-MS/MS experiments, so many hundreds of millions of spectra. So it is a big dataset, and we have it organized in a way where it was pretty straightforward to do this."
Fenyö cited CPTAC's work as an example of where the software could potentially add value.
"What the [CPTAC] groups have been doing is to collect both proteomics and phosphoproteomics data on a lot of tumors and then, from the [NCI's Cancer Genome Atlas project], there was both exome sequencing and RNA seq available for the same tumors," he said. "So then the idea is how do you analyze this dataset? How do you combine these different types of data?"
"That is still an open question, and people are still experimenting with different ways of doing this," he said. "So one thing that one could do is use this database and focus in on what variants make or abolish the possibility [of a post-translational modification] and then look at how these modifications figure in different pathways."
"These big [studies], they apply different kinds of analyses, and this could add another way of looking at the data," he said.
While in the JPR paper the researchers focused on using the g2pDB tool to identify mutations that abolished protein modification sites, it can also be used to identify mutations that create new modification sites.
"Both are interesting and we definitely want to do both," Fenyö said, noting that the reason they didn't look for such mutations is that the bulk of the datasets in the GPMDB were searched using standard reference databases.
"What we are planning to do is to redo the searchers with a database that includes all the [known] variants," Fenyö said, which will give the researchers a better look at variants leading to new modification sites.
He said he and his colleagues plan to update the tool several times a year to incorporate new datasets as they are added to the GPMDB.