Researchers at Utrecht University have completed a proteogenomics study of rat liver tissue, integrating whole genome sequencing, RNA-seq, and mass spec-based proteomics.
In the study, detailed in a paper published last week in Cell Reports, the researchers identified 13,088 proteins, making it one of the most comprehensive proteome analyses performed to date. Integrating their genomics data, they were also able to validate 1,195 gene predictions, 83 splice events, 120 proteins with nonsynonymous variants, and 20 protein isoforms with nonsynonymous RNA editing.
The effort also provided several biological insights, Albert Heck, chair of the Biomolecular Mass Spectrometry and Proteomics group at Utrecht University and author on the study, told ProteoMonitor.
In particular, the researchers were able to investigate the question of RNA editing – a process in which modifications are made to the sequence of an RNA molecule after it has been generated.
Past studies, Heck said, have suggested that cells undergo a substantial amount of RNA editing. Integrating their genomic, transcriptomic, and proteomic data, however, he and his colleagues found that while such editing appeared to occur fairly frequently, few of these modified transcripts ultimately produced stable proteins.
"When we looked at RNA editing at the transcript level, we saw quite a bit," he said. "But then at the peptide level we found very little [evidence of this editing]. So it's not definite proof, but from our data we are pretty sure that RNA editing may happen quite a bit, but that it doesn't often lead to a stable protein."
The Utrecht team was also able to identify a genomic variant potentially linked to hypertension. Their study looked at two rat strains – BN-Lx, a strain closely related to the Brown Norway rat, and SHR, a spontaneously hypertensive rat strain commonly used as a model in blood pressure studies – and by comparing the different levels of omics data across the two strains, they identified 41 differentially expressed genes whose expression changes also remained consistent at the transcriptome and proteome level.
Four of those genes had been linked in previous studies to hypertension, including Cyp17a1, which exhibited the highest level of protein downregulation in the SHR rat compared to the BN-Lx. Investigating their RNA-seq data, the researchers discovered that the transcriptional site start of this gene had been annotated incorrectly. They also found that in the SHR rats the promoter for this gene contains a mutation that disrupts transcription – likely causing the variation in expression between the two strains.
Beyond the specific findings of the work, Heck said he viewed it as a proof-of-concept for what he suggested would in the future become a standard proteomics workflow – use of sample specific mass spec reference databases.
In the Cell Reports study, Heck and his colleagues used a variety of approaches, including genome sequencing, gene prediction algorithms, and RNA-seq to construct comprehensive reference databases specific to the two rat strains. By employing these databases, along with proteomic techniques like the use of multiple proteases to expand their protein coverage, they were able to achieve extremely deep proteome coverage identifying 13,088 proteins – roughly 40 percent of the total rat proteins entered in the Ensembl database – at a false discovery rate of zero percent and with median coverage of all proteins of 15.6 percent.
Heck noted that while he didn't necessarily expect the genome sequencing portion of the workflow to become standard, he did envision that use of RNA-seq-based sample-specific databases would become widespread.
"For me, the bottom line of the study is that I think this is how proteomics experiments will be done," Heck said. "It's still too expensive at the moment [for regular implementation], but this is how it will be done in a few years, [as the] standard."
This view echoes a sentiment that appears to be emerging in the field. For instance, in a May interview, University of Wisconsin-Madison researcher Lloyd Smith likewise told ProteoMonitor that he thought construction of sample-specific databases would in the future become the standard for mass spec-based proteomics.
The approach's rise was also clearly on display last month at the Clinical Proteomic Tumor Analysis Consortium's first annual scientific symposium where a significant number of the researchers presenting had used such databases for their work.
This approach should prove particularly useful in studies of samples – like cancer tissue – with high mutation rates, Heck noted, given that mutations specific to a particular sample may not be captured in the generic reference databases commonly used today.
It is also necessary for proteogenomics work like that presented in the Cell Reports study, where researchers integrate genomic, transcriptomic, and proteomic data from the same sample.
Long cited as a promising potential approach to omics research, proteogenomics has recently emerged as a significant area of activity in proteomics, with a number of researchers pursuing such studies both independently and as part of large-scale initiatives like CPTAC, the Cancer Genome Atlas, and the National Human Genome Research Institute's Encyclopedia of DNA Elements Consortium (ENCODE).
To a significant extent, this activity on the proteogenomics front stems from advances within proteomics that have made it feasible to obtain coverage comparable to that achieved by genomics and transcriptomics, said Janne Lehtiö, platform manager for mass spectrometry at the Stockholm SciLife laboratory and author on a proteogenomics study published last month in Nature Methods.
"You need to get good enough coverage in proteomics in order to come up with good systems biology conclusions," he told ProteoMonitor. "If you are only looking at 5,000 or 6,000 proteins, that doesn't really allow you to do a comparison with the transcriptomics or genomics."
In the last few years, obtaining coverage in the range of 10,000 to 12,000 proteins has become routine for top labs. In their recent work, for instance, Lehtiö and his colleagues identified 13,078 human and 10,637 mouse proteins.
Lehtiö noted, however, that achieving this level of coverage typically requires extensive fractionation, making it difficult to analyze a large number of samples. He also observed that while researchers can now cover a significant portion of the proteome, coverage of individual proteins is still somewhat spotty.
"We have very little peptide coverage per gene, which limits us in terms of talking about variants," he said. "If you look at the Heck paper, they have on average 15 percent sequence coverage per protein, and that is good sequence coverage. But to turn it around, that means that you still aren't seeing 85 percent of the protein. And if you are going to look at variants you need to get more than 15 percent."
Heck noted that in his lab's experience perhaps the most significant challenge to such work remains integrating the different levels of omics data.
"If you ask me what is the major bottleneck in this [Cell Reports] study, my conclusion is that the most difficult part was to get these platforms to talk to each other, to make the data from one directly comparable with the data from another platform," he said.
One development that could prove helpful to this end is Thermo Fisher Scientific's recent acquisition of Life Technologies, which will add Life Tech's sequencing capabilities and expertise to Thermo Fisher's mass spec-based proteomics offerings. The company has not publicly commented on whether it has any plans to develop integrated workflows for either proteogenomics studies or the construction of sample-specific mass spec reference databases, but Heck said that he has been in contact with people at the Thermo Fisher regarding the Cell Reports paper and that he thinks they are interested in such applications.
In a statement to ProteoMonitor, Andreas Huhmer, Thermo Fisher's marketing director for proteomics, did not comment specifically on the company's aims regarding development of tools for proteogenomic workflows, but he did note the increasing interest in such research and the need for methods to better facilitate it.
"It is particularly apparent from recent cancer research that cells harbor individual genomic changes that are not detected in protein expression studies by searching against standard protein databases," Huhmer noted, adding that "those genomics changes are potential markers of disease by affecting protein expression specific to a particular population of cells" and that "the obvious solution is to utilize modern proteomics techniques in conjunction with [next-generation sequencing] to investigate those sample specific changes."
"Standardizing the experimental workflow and creating a robust data analysis pipeline are most likely requirements to make the proteogenomic technique widely accessible and commercially viable," he said.