NEW YORK (GenomeWeb) – A team led by researchers from the Human Proteome Organization's Chromosome-Centric Human Proteome Project has completed a study using data from the National Human Genome Research Institute's Encyclopedia of DNA Elements Consortium (ENCODE) to aid in identification of previously undetected proteins and proteoforms.
Published this week in Journal of Proteome Research, the paper is another example of the growing interest in proteogenomics research, and is a step toward the sort of collaborative efforts between the C-HPP and ENCODE projects that C-HPP leaders suggested two years ago.
Using the ENCODE data the researchers were able to identify novel splice forms detected at the transcript level that were, in fact, translated to proteins, including those of potential clinical interest. For instance, said Carol Nilsson, a researcher at MD Anderson Cancer Center and first author on the paper, among the new proteoforms identified was a single amino acid variant (SAV) that could be linked to metabolic phenotypes and invasiveness in glioblastoma multiforme.
"We could never have found this through [genome-wide association studies] because selection for this allele is at the post-transcriptional level," Nilsson told ProteoMonitor, noting that exploring SAVs in the ENCODE data "opened up a bigger area of research than we initially expected."
The project grew out of past contact between Nilsson and Lund University researcher György Marko-Varga, a co-author on the JPR paper and one of the leaders of the C-HPP effort.
"At MD Anderson [Nilsson and her colleagues] had this study that they did with 36 brain tumor patients where they did surgery and isolated the tumors, and so that was the starting point for the project," Marko-Varga told ProteoMonitor.
The ENCODE and C-HPP researchers, meanwhile, were also discussing potential ways to collaborate with members from the ENCODE team attending the 2013 Human Proteome Organization in Yokohama, Japan and, ultimately, helping the C-HPP access the ENCODE data – a task that "was not trivial," Marko-Varga noted.
"Their main interest [where proteomics is concerned] is actually the functional aspect," he said. "So you have this variant, but what happens? What does it actually do?"
The C-HPP initiative expressed interest in collaborating with ENCODE two years ago in an editorial in Nature Biotechnology.
At the time, Northeastern University researcher William Hancock, a C-HPP chair and co-author on the JPR paper, told ProteoMonitor that a collaboration between the two projects could allow researchers to better understand the protein outputs generated by the genomic machinations characterized in the ENCODE work.
"For a large part, I think it's safe to say, biology is mediated by individual protein structure, and if you know [that in] enough detail then you can really understand what is the [output] of all this genomic manipulation," he said.
The JPR study analyzed glioma stem cells from the 36 glioblastoma multiforme patients from MD Anderson using a variety of proteomic techniques including shotgun mass spec, nucleic acid-programmable protein arrays, and antigen arrays like protein epitope signature tags and antibody arrays from the Human Protein Atlas Project. The researchers also developed workflows to craft custom databases using RNA-seq data, and used the ENCODE data to build a searchable proteomics database named proteoENCODEdb, out of which they identified 80 previously unpredicted proteins.
While such proteogenomic analyses have drawn much interest for their ability to identify novel proteins and reclassify previously non-coding regions of the genome as likely protein coding, some efforts have also been criticized due to the questionable accuracy of their identifications.
As Wellcome Trust Sanger Institute researcher Jennifer Harrow, one of the leaders of the GENCODE consortium, told ProteoMonitor last week, such efforts "don't help the [genome] annotation because people then expect that all these proteins have been verified ... but actually if you drill down in the data it's not correct at all, and then we have to spend time convincing people that these are not protein coding, and that's a problem."
Nilsson echoed these sentiments, noting that she and her fellow authors on the JPR study were "very careful about only reporting true positives."
"We want to be extra sure that it is correct, because once something incorrect is reported in the literature it makes its way into data and other knowledge bases and that causes trouble," she said.
In the case of the novel proteoforms like SAVs, the researchers validated these hits via a variety of methods including manual inspection of the spectra, checking for evidence at the transcriptional level, and detecting the target proteins via selected-reaction monitoring mass spec, Nilsson said.
She said she expects there to be corresponding proteins for most of the novel splice variants detected at the transcript level, noting that "just from a biological point of view it wouldn't make sense to have these transcripts if they weren't doing something. It's not energetically favorable for the cell to construct the nucleotides [to no purpose]."
As a researcher specializing in glioblastoma multiforme, Nilsson said she was particularly excited to explore such variants' potential roles in cancer, though, she noted, "it's going to take a long time to evaluate on a large scale what these new ENCODE peptides and proteins mean for biology."
"It's going to take decades to figure out what they are doing and in what types of tissue they are expressed," she added.
For instance, Nilsson said, thus far the researchers have looked at the variants identified in the JPR study only in cancer cells. "So now we need to go back to normal glioma cells and see what the difference [in expression] is."
"A lot of this can be figured out through differential proteomics," she said.