In an editorial published this month in Nature Biotechnology, the heads of the Human Proteome Organization's Chromosome-Centric Human Proteome Project proposed a collaboration between that initiative and the National Human Genome Research Institute's Encyclopedia of DNA Elements Consortium.
Such a partnership could help unravel how the actions and interactions of the genomic elements identified via the ENCODE project are manifested at the protein level, noted C-HPP leaders Young-Ki Paik and William Hancock. Points of synergy between the two efforts might include identifying proteins suggested by genomics but not previously detected in proteomics studies as well as cataloging the protein products of various gene variants, they wrote.
The pair's proposal is another example of the trend in proteomics research – as embodied by efforts such as the National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium (8/26/2011) – toward better integrating genomics and proteomics data. Despite the growing interest in such integrations, however, significant technical and philosophical challenges still remain.
Formally launched at this year's HUPO meeting, the C-HPP calls for participating countries to take one of the human chromosomes and characterize one representative protein for each gene located on the chromosome (PM 9/14/2012). The ENCODE project, meanwhile, aims to identify all the functional elements in the human genome.
Combining the two, said Hancock, who is chair of bioanalytical chemistry at Northeastern University and editor of the Journal of Proteome Research, could allow researchers to better understand the protein outputs generated by the genomic machinations characterized in the ENCODE work.
"For a large part, I think it's safe to say, biology is mediated by individual protein structure, and if you know [that in] enough detail then you can really understand what is the [output] of all this genomic manipulation," he told ProteoMonitor.
Stanford University researcher Michael Snyder, who is an ENCODE investigator as well as a member of the Human Proteome Project's senior scientific advisory board, told ProteoMonitor that such a collaboration could prove particularly useful in helping scientists identify the functional consequences of genetic variation.
One of the "big issues is mapping out if these things you see as transcripts really exist as proteins," he said. A second, "and perhaps more important," issue, he noted, "is that as we sequence our own genomes there are going to be a lot of [genetic] variants. And the question is: Which ones disrupt function? [Which] are really getting made?"
The typical genome, Snyder noted, has many inactivating mutations, leading to proteins that are rapidly degraded or never expressed at all. "People care about whether particular variants are expressed," he said. "And proteomics can really help resolve [that question."
Boise State University researcher Morgan Giddings agreed. "If we find [in ENCODE] RNA that is expressed or find a gene that there is some evidence for, then the question is: Is that a protein, and if so, under what conditions? That's the fundamental question," she told ProteoMonitor.
In fact, as part of ENCODE Giddings has been doing proteomics work similar to that proposed by Hancock and Paik in their editorial, collaborating with University of North Carolina researcher Xian Chen to perform proteomic characterizations of several cell lines being investigated under the project.
"There is a lot of alternative [gene] splicing going on, and it's pretty clear that some alt splices have a regulatory function, whereas other alt splices may translate into a protein," she said. "And you can't really know which is which until you have the genomic data that is directly aligned with the proteomic data."
Beyond providing insights into specific variants, "displaying proteomics in a genomics context" allows biologists to obtain a more holistic view of the processes involved, Giddings said.
That, she noted, "can be very powerful for a biological researcher who can then say, 'Here is what the transcription factors say, and here is what the RNA-seq says, and here is what the proteomics says.' You can see it all in one place and get a sense of what all the data is saying together."
Gidding's and Chen's work has been referenced in several publications that emerged this year from the ENCODE project, she said, but the first major publication detailing the work is still to come. She added that her team also plans to publish soon on a new proteogenomics search engine called Peppy that enables better integration of proteomic and genomic analyses.
The tool is designed for quick searching of large genomic and mass spec datasets, Giddings said, adding that one of its key features is the ability to do progressive searches of multiple databases, allowing, for example, researchers to begin with a search of their data against standard proteomic databases followed by a search against a genomic database first without and then with SNP data.
"So the search goes progressively deeper and deeper, hitting multiple databases to get the best insight of where [a protein] comes from," she said.
In addition to her ENCODE work, Giddings is also part of the CPTAC initiative, where, she said, "we have brought the tools and insights [developed in] the ENCODE project and have really pushed on doing proteogenomics."
Yet, despite the potential benefits of large-scale integration of proteomic and genomic data, such efforts come with significant challenges.
For instance, Giddings said, her team recently ran into a problem caused by the different false discovery rates considered acceptable by the proteomics and genomics communities.
"In the proteomics community you're typically expected to have a one percent or better [FDR], she said. Large genomics projects, on the other hand, typically report results at FDRs of 5 percent to 10 percent.
Given this discrepancy, Giddings and her colleagues sought in a forthcoming paper to report their ENCODE proteomics results at a 5 percent FDR with the idea that researchers interested in only the more stringent data could filter the results to limit them to the desired FDR.
Their reviewers, though, "didn't buy it," she said. "[They] thought [5 percent] was just way too loose."
Perhaps even more important than such cultural differences are the technical differences between the two fields. Indeed, Giddings said, she didn't apply for grant money to continue her proteogenomics work as part of the second phase of the ENCODE project due largely to the current limitations of proteomics technology.
"This new phase of ENCODE is very much about extremely high-throughput analysis of full genomes in many cell lines," she said. "And even though we were able to cover several cell lines with our proteomic data, I'm not convinced, and I don't think NHGRI is convinced, that proteomics techniques are mature enough to [work] on the same scale as genomics technologies can right now.
"Further technology development is necessary for proteomics to get the depth and coverage to really contribute to the genomic understanding that we have," she added.
With these limitations in mind, C-HPP participation in the project could prove useful, Giddings said, but, she suggested, only if it were done in a highly systematized and organized way.
"It would be helpful if they were to do it in the same cell types and specifically within the way that the ENCODE consortium operates," she said. "I think it would be hard for some proteomics people to just say, 'Well, we're generating a bunch of data, let's add this to ENCODE.
"ENCODE is trying to be very systematic and organized in its exploration of the genome, so it would require that same level of systemization and organization among the proteomic people," Giddings said.
Given the deliberately loose structure of the C-HPP, wherein each chromosome group has a considerable amount of autonomy in terms of structuring its programs and obtaining funding, such a level of systemization seems unlikely.
"One thing about a global collaboration is that you can't mandate everything," Hancock said. "I think what we can do is have a structure by which the groups communicate, but they are going to follow their [own] priorities."
That said, he noted, "we do have the structure, I think, to keep the communication going."