Demonstrating the value of combining genomic and proteomic data, researchers in the Netherlands have published a proof-of-concept "proteogenomics" study that analyzes whole-genome, RNA-seq, and proteome data from two rat strains.
Matching proteomic data to a database enhanced by strain-specific genome and transcriptome data, for example, enabled the scientists to identify thousands of peptides they would otherwise have missed.
Their study also showed that RNA editing only has a limited effect at the protein level. Finally, by focusing on genes that were differentially expressed at both the transcriptome and the protein level, they discovered a promoter variant that may help explain the hypertension phenotype of one of the rat strains.
The results "demonstrate the power of and need for integrative analysis for understanding genetic control of molecular dynamics and phenotypic diversity in a system-wide manner," the scientists wrote in their study, which was published online in Cell Reports last month.
According to Edward Cuppen, a professor of genome biology and human genetics at the Hubrecht Institute in Utrecht, and one of the senior authors of the study, one single data set – be it genomic, epigenomic, proteomic, or metabolomic – will not provide a complete answer to researchers' questions. "It's basically just one snapshot from one angle," he said.
"We typically live in our own worlds and think that next-gen sequencing can do everything, and is so powerful to measure a lot of things, but when you're a biologist, you know it's only part of the whole story," Cuppen told In Sequence.
With both mass spectrometry-based proteomics and next-gen sequencing advancing over the past few years, "it now becomes feasible to start combining these types of techniques."
While their study is not the first to integrate different data types, such as transcriptome and proteome data, he said, it is one of the most comprehensive to date.
For their study, the researchers analyzed liver tissue from two inbred rat strains, one of which, the spontaneous hypertensive rat, is widely used in hypertension studies. The genomes of both strains had previously been sequenced.
To analyze protein mass spec data, scientists usually map back the spectra to a peptide database that is derived from the reference genome of the organism they are studying. However, Cuppen explained, that database does not include strain-specific genetic variation, so peptides with amino acid changes due to non-synonymous variants will not be matched.
In order to map those peptides, he and his colleagues enhanced the existing rat peptide database by incorporating strain-specific non-synonymous genetic variants, which affect more than 6,000 protein isoforms.
They further improved the database by adding transcript splice events from their RNA-seq data. "We used the transcriptomics data to get the peptide reference database more complete, so we can assign more of the sequenced peptides to a protein," Cuppen said.
The researchers also studied what percentage of RNA editing events they observed in the RNA-seq data was present at the protein level, and found that about 10 percent of non-synonymous editing events were. "It's a tiny percentage," Cuppen said, "so one could wonder what the relevance is."
Finally, the team compared gene expression levels at the RNA and protein level, which he said often do not correlate well.
By focusing on a set of genes that showed strain-specific expression changes in both the transcriptome and proteome data, their attention was drawn to a cytochrome P450 gene that had previously been linked to hypertension.
They identified a genetic variant in the promoter of that gene that likely alters its expression in the hypertension rat strain and might contribute to their phenotype.
"If you look critically at the data, you could have picked it up from just RNA-sequencing or just proteomics," Cuppen acknowledged, "but then, it would have been one of many events in a cloud of outliers. By combining these techniques, this protein very clearly stood out as the most prominent candidate that is differentially regulated, both at the RNA level and at the protein level."
Combining RNA-seq and proteome data, he said, allowed the researchers to quickly eliminate a number of candidate genes that looked promising by RNA data alone but were not differentially expressed at the protein level.
However, there is still value in RNA-seq data, he said, for example to look at the classes of genes affected by an experimental perturbation. "Transcriptome measurements are much easier and much more quantitative than proteome measurements at this moment," Cuppen said.
"RNA-seq is a fully good measurement to see what type of transcriptomic programs you are triggering by inducing a change, so it's very powerful for that," but if researchers want to know the effect on cell function, they need to do proteomics, and maybe also metabolomics, he said.
Going forward, Cuppen and his colleagues want to apply the integration of genome, transcriptome, and proteome data to experimental cancer systems, with the goal of better understanding tumor induction, the effects of treatment, and drug resistance mechanisms. "One of the major challenges is to have sufficient material from controlled environments to do these studies properly," he said.
A number of cancer research studies have combined DNA-based data, such as genome or exome sequences, and RNA-seq, but few have incorporated protein mass spec data as well.
While Cuppen believes the approach to be powerful, and likely to be used more often in the future, he cautioned that it is not easy.
For example, interpreting data from two different technologies is hard because the methods have different sensitivities and accuracies, and because the bioinformatic approaches to analyze the data types differ.
"I just want to give a warning that it is challenging," he said. "It's not just measuring two things and putting them together in a computer, and then better solutions come out. There is still work that needs to be done."