Skip to main content
Premium Trial:

Request an Annual Quote

SciLifeLab Team Develops Improved Proteogenomic Workflow


NEW YORK (GenomeWeb) – Researchers at Sweden's Science for Life Laboratory (SciLifeLab) have developed a workflow for proteogenomics that offers improved identification of protein variants, including single amino acid variant (SAAV) peptides.

Described in a study published last week in Nature Communications, the workflow uses automated inspection of MS/MS data for spectral evidence supporting SAAV identifications, which significantly improves the accuracy of these identifications compared to conventional proteogenomic approaches, said Janne Lehtiö, scientific director of the SciLifeLab and senior author of the paper.

Driven by technologies like next-generation sequencing and improvements in the breadth and quality of proteomic data, proteogenomics has seen increasing adoption in recent years. The approach integrates protein and nucleic acid data in the hope that combining  multiple levels of molecular information will enable better understanding of biological and disease processes and improve biomarker discovery and development.

One area where the approach has shown promise is finding protein evidence of genetic mutations with researchers looking, for instance, for SAAV peptides corresponding to known single nucleotide polymorphisms.

Lehtiö said, though, that conventional proteomic workflows are not well suited to this approach and can return large numbers of false positive hits for SAAVs.

Standard algorithms for mass spec-based peptide identification aim to match the observed spectra from peptides run though the mass spec to a database of spectra generated based on predictions from genomic sequences. Lehtiö said, however, that when his group set out to validate SAAVs identified via proteogenomic work, they found that, despite using an overall false discovery rate (FDR) of 1 percent, they had an FDR for these SAAVs of around 40 percent.

"We trimmed the [list of SAAVs identified by] our pipeline to be very stringent, and then we ordered a couple of hundred synthetic peptides and tested them, and we found that the error rate was very high," he said.

This problem of false discovery of SAAVs in proteogenomic data is well known, so much so that, Lehtiö and his co-authors noted, some guidelines recommend against reporting any such findings. However, they added, this approach would result in " novel peptides with single substitutions being ignored, even though the proteins they originate from could play important roles in cellular processes."

To address the issue, the SciLifeLab researchers developed an informatics tool called SpectrumAI that analyzes spectral data to look specifically for ions directly supporting the specific residue substitution identified.

"It actually goes to the spectra and looks for the peak evidence for the [putative] amino acid switch," Lehtiö said. He noted the FDR for SAAVs using this approach was probably still higher than the stated 1 percent — most likely in the 1 to 5 percent range. "But it is definitely a major improvement compared to standard pipelines."

The SciLifeLab researchers' proteogenomic pipeline, which they named IPAW (integrated proteogenomics analysis workflow), also includes an improved high-resolution isoelectric focusing (HRIEF) step, which they use for pre-fractionation upfront of mass spec analysis.

Lehtiö and his team have been using HRIEF for pre-fractionation since they began developing their proteogenomic workflow some five years ago. One of the challenges of proteogenomic analyses is the large search space involved in matching mass spec data to spectral databases, particularly when searching for variants and peptides produced by regions of the genome traditionally thought to be non-coding.

HRIEF is one way to limit this search space. The technique pre-fractionates the proteome by peptides' isoelectric points, the pH at which a molecule contains no net electrical charge. Because a peptide's isoelectric point depends on its sequence, the researchers are then able to similarly fractionate their mass spec reference database according to the included sequences' theoretical isoelectric points. In this way, they could search only the specific portion of that database featuring isoelectric points corresponding to that of a given experimental peptide.

A problem with initial versions of this approach was that their HRIEF approach was optimized for acidic peptides, which, Lehtiö said, is fine for traditional proteomic approaches where peptides are intended as proxies for the protein-level data.

In proteogenomic work, on the other hand, researchers want to look at peptide sequences with as much depth and detail as possible.

"The more peptides [per protein] you see, the better chance you have to see the region of the protein where the mutation or single amino acid variant is," Lehtiö said. To improve this breadth of coverage, he and his colleagues optimized their HRIEF method to work across a wider pH range.

In the Nature Communications paper, the researchers used the IPAW pipeline to analyze two proteomic datasets generated by analysis of A431 cancer cells and five normal human tissue samples. Their analysis found 426 novel peptides in the A431 cells and 155 novel peptides in the normal tissue, including a variety of peptides encoded by what were thought to be non-coding regions of the genome, among them, the authors wrote, "pseudogenes, 5′ or 3′ untranslated regions (UTR) of mRNAs, antisense transcripts, dual-coding transcripts, lncRNAs, intergenic, and intronic sequences."

Of 117 novel peptides they selected for validation using synthetic peptides, they confirmed 110. In a test of the SpectrumAI tool, they validated 30 SAAV peptides, 19 of which were deemed true positive hits by SpectrumAI and 11 of which were deemed false positives. Their validation confirmed the SpectrumAI results.

Lehtiö said he believes proteogenomics is beginning to find "interesting applications," among them the use of peptide-level data to improve genome annotation, the layering of peptide and protein data atop genomic and transcriptomic data to further disease research, and the discovery of cancer-specific protein variants that could be used for targeting immunotherapies.

His lab is using the IPAW pipeline for breast cancer studies to investigate how copy number variants influence the proteome as well as to look for tumor-specific proteins for use in immunotherapies.