NEW YORK (GenomeWeb) – The announcement this month of AB Sciex and Illumina's OneOmics partnership marked a significant moment for ongoing efforts to marry proteomic and genomic data as two of the largest vendors in the respective spaces signaled their interest and investment in this goal.
Still in its early stages, integration of proteomic and genomic data has become an area of significant research activity in recent years with projects like the National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium combining protein biomarker discovery with genomic characterization, and high-profile papers like a pair of human proteome maps published this summer in Nature using proteogenomic analyses to search deeper into the proteome and contribute to genome annotations.
Beyond its scientific merits, the growing interest in integrating proteomic and genomic data also represents a potential opportunity for mass spec vendors as it could generate increased demand among genomics researchers for proteomic analyses and instrumentation.
Given the much larger size of the genomics field relative to proteomics, it would prove a significant new market for proteomics tools. Thus far, though, much of the interest driving efforts to combine genomic and proteomic data has come from the proteomics side, and, according to several researchers, proteomics still has yet to fully win the trust of the genomics community.
"I think there has probably been more interest from the proteomics guys teaming with the genomics [researchers] rather than the other way around," Stanford University research Michael Snyder told ProteoMonitor, adding that there has been not as much interest in proteomics from the genomics side "as you would have hoped."
One of the leaders of developing and performing multi-omic analyses, Snyder said he thought this lack of interest is in large part due to the fact that many current researchers came up during an era when genetic research was prevalent but proteomic work was much less so.
"So they don't really have much experience with proteins, and mass spec is very foreign," he said.
Mass spec's daunting reputation is, in part, what collaborations like the OneOmics partnership are intended to tackle.
"This is not just rehashing the same software and putting it into the cloud. It's about trying to solve biological problems," Aaron Hudson, senior director of the academic and omics business at AB Sciex, told ProteoMonitor upon announcement of the project.
"If you look at the apps, you don't actually see a mass spectrum anymore – you see up and down regulation of the proteins and some biological insights as well," Hudson added.
Yale University researcher Christopher Colangelo told ProteoMonitor that his "ultimate goal is [to] make proteomics more streamlined and accessible to genomics people … to improve the number of people who use proteomics in general." Colangelo and his colleagues created one of the first outside apps for the OneOmics project – a program that helps researchers more easily integrate RNA-seq and mass spec data.
"The genomics market dwarfs [proteomics]," he said. "So the goal is, everybody who does a genomics study uses these standardized tools to do their analysis, and if we can interface with those and they can more easily do proteomics, then it will create a lot more research for proteins."
"I think all mass spec companies realize that the need to integrate their data with genomic data is paramount to their success," Colangelo added, noting that he expected to see all the major vendors pursue arrangements similar to the AB Sciex-Illumina deal.
With its acquisition earlier this year of Life Technologies and its next-gen sequencing business, Thermo Fisher Scientific would seem to be particularly well positioned for such an endeavor. Indeed, the company "is very interested in merging these two types of data since our acquisition of Life [Technologies]," said Mary Lopez, director of the company's Biomarker Research Initiatives in Mass Spectrometry (BRIMS) Center.
In terms of integrating its mass spec and sequencing systems, Lopez told ProteoMonitor that the company is currently only discussing the idea, "trying to understand what are the most effective ways of integrating this data. I think you will see much more activity over the next year, because it's a focus of interest not just for us but I think across the community."
The AB Sciex-Illumina partnership is "tremendously exciting and I think emblematic of the direction [the field is] moving in," Lopez said, adding that she and her team have received interest from the genomics community in their proteomics capabilities – particularly with regard to integrating RNA-seq data.
"From a big picture perspective there are a lot of things that proteomics can provide to expand on the data that people in genomics are exploring," she said. "The phenomenon of protein isoforms and post-translational modifications, for instance, can certainly add a dimension to transcriptomics."
There remains the question, though, of the extent to which genomics researchers see proteomics as a field ready to contribute to their own analyses.
"It's critical that we can trust [the proteomic data]," Wellcome Trust Sanger Institute Jennifer Harrow, one of the leaders of the GENCODE consortium, told ProteoMonitor. "This is the issue and always has been the issue – what is the source of the proteomics data? How has it been generated? What is the underlying database?"
That researchers must ask such questions perhaps hits on a significant cause of the genomics field's hesitation thus far to get too involved in proteomics. To an extent, though, they are still unavoidable at this point in proteomic technology development, Stanford's Snyder suggested.
"Mass spec is becoming more and more turnkey," he said. However, "people do need to know what the strengths and weaknesses of some of these analyses are, because there can be artifacts if you don't really understand what is going on."
Beyond that, Harrow noted that questionable analyses on the proteomics side – even by experienced proteomics researchers publishing in high profile journals – can make this data less useful to genomics researchers than it otherwise might be.
She cited the example of the two Nature human proteome map publications. These studies, one led by Johns Hopkins University researcher Akhilesh Pandey and the other led by Technical University of Munich researcher Bernhard Kuster, have come under criticism from a variety of quarters since their publication, and Harrow too noted her reservations about the work.
For instance, in the Pandey study, the researchers used as their reference set an older version of the RefSeq database, which, Harrow said, led them to identify as novel protein coding regions some parts of the genome that were in fact already annotated as protein coding in more recent versions of the database.
"I don't think that was very reassuring," she said.
Harrow also said the analysis had a relatively high false discovery rate and questioned whether their data in fact solidly backed up all the peptide IDs they claimed.
"These kinds of papers don't help the [genome] annotation because people then expect that all these proteins have been verified because these two [Nature] papers have shown it," she said. "But actually if you drill down in the data it's not correct at all, and then we have to spend time convincing people that these are not protein coding, and that's a problem."
In a review published this week in Nature Methods, University of Michigan researcher Alexey Nesvizhskii highlighted various challenges facing proteogenomic analyses and cautioned that not enough attention was being paid to these issues within the field.
"I have been interested in this field of proteogenomics for a long time and have built my career by developing methods for dealing with error rates in proteomics and false discovery rates, and it's kind of disappointing to see that the field is taking a step backwards with regard to rigor in processing large scale datasets and, specifically, with regard to identifying novel peptides," Nesvizhskii told ProteoMonitor.
He said that in the past, proteomics had been limited in how much it could contribute to genomics by its relatively low sensitivity and small size of its datasets. With the emergence of better mass spec technology and large datasets in repositories like ProteomeXchange and PeptideAtlas, however, the field is ready "to make a contribution by using proteomics data in parallel with genomics data to identify novel peptides and reannotate genomic models," Nesvizhskii said.
His excitement at this possibility has been tempered, however, by the fact that "the way the data analysis is [often] done is not adequate, and all the mistakes that we were making 10 years ago in the field of proteomics are now being repeated in the proteogenomics context," he said.
One basic example, he noted, was in the matter of scoring novel peptides. Because hits on such peptides are much more likely to be false positives than for known, previously identified peptides, "you have to apply a much more stringent cutoff to the novel peptides to get the same 1 percent false discovery rate as you would typically do for known peptides," he said.
Another commonly ignored issue, Nesvizhskii said, is the fact that the traditional method of constructing decoy databases -- randomizing the reverse protein sequences – does not allow for good estimation of FDRs for novel peptide variants differing from the reference peptide by only one or two amino acids.
Like Harrow, he cited the recent Nature papers, saying that he "would argue that in those two papers the majority of reported novel peptides, especially in the categories of pseudogenes and non-coding RNAs, are false positives."
"There are all these issues that we need to understand and acknowledge and modify our strategies for these kinds of proteogenomic studies," he said, "because right now it's kind of all over the place."
Nesvizhskii added that he worried this could hurt proteomics' chances of making inroads into genomics research.
"A year ago my hope was that with the very large datasets, the situation would change and the genomics groups would start using proteomics data more," he said. "Now I worry that they will see these papers and look at them and look at the peptides that are reported and say, 'Well, maybe we shouldn't rush into this.'"
One consolation? The RNA-seq field isn't faring much better, said Michael Tress, a researcher with a group from the Spanish National Cancer Research Centre (CNIO) that in July published a critique of the two Nature proteome map papers in the Journal of Proteome Research.
"From the RNA-seq side, I know that many researchers wish that proteomics would have a breakthrough that improved the coverage of genes and isoforms," Tress told ProteoMonitor in a recent email, noting that "the possibility of marrying RNA-seq data directly to proteomic data would definitely be interesting to the RNA-seq side."
However, he added, "if any groups do try to work with these [RNA-seq] instruments, I would definitely urge them not to trust RNA-seq data for abundances or splice isoforms. I know we did criticize the excesses of the Pandey and Kuster papers, but on the whole, the proteomics community generates more reliable data than the RNA-seq community!"