Skip to main content

CNIO Study Suggests Single Dominant Splice Form for Most Protein-Coding Genes

Premium

NEW YORK (GenomeWeb) – With efforts to integrate proteomics and genomics gaining steam, a number of recent proteogenomic studies have sought to leverage this combination to delve deeper into samples in search of protein-level evidence for features like splice variants.

Solid evidence exists that alternative splicing of messenger RNA can lead to the expression a variety of different RNA transcripts from a single gene. Determining whether these transcripts are translated into stable proteins, however, has been a more challenging task.

In part this is due to the technological limitations of mass spec-based proteomic platforms, which are able to identify only a small proportion of the proteins in any complex sample and tend, in particular, to miss low-abundance peptides, which alternative splice forms most likely would be.

There is also the possibility that few alternative splice forms actually exist at the protein level – that most genes are expressed as a single dominant isoform.

Taking up the question in a paper published last week in the Journal of Proteome Research, scientists from the Spanish National Cancer Research Centre (CNIO) have completed an analysis that would seem to bolster the latter viewpoint.

Reanalyzing eight different large-scale mass spec proteomic datasets – the PeptideAtlas and National Institute of Standards and Technology datasets as well as sets from six published large-scale studies – the researchers were able to map at least two peptides to roughly 64 percent of the human genome. However, they identified alternative splice forms for only 246 genes (1.2 percent of the genome) – a finding that, they noted, "clearly suggests that the vast majority of genes express a single main protein isoform."

This observation was not especially surprising in and of itself, said CNIO researcher Michael Tress, senior author on the paper. He noted that the researchers initially did a similar analysis looking at several smaller-scale experiments and found no alternative splice forms at the protein level. They also put together datasets from PeptideAtlas and the Global Proteome Machine Database from which they found around 150 alternative forms.

"So we have known that they are rare for a while," he said.

More surprising, Tress told GenomeWeb, was how well the dominant isoforms identified by their proteomic analysis corresponded to the dominant isoforms identified by independent genomic and structural analyses.

In the JPR paper, the researchers identified from their combination of the eight datasets 149,954 highly reliable peptides, 111,382 of which, they noted, discriminated between isoforms of the same gene.

Using these peptides, they determined the dominant isoform for each of the 12,716 genes for which they had peptide evidence by totaling the number of peptides that mapped to each splice form annotated for a given gene and identifying the isoform with the most mapped peptides as the dominant isoform for that gene.

This process led them to identify a single primary protein isoform for 5,011 genes and 25 genes with evidence of alternative splicing where the two splice isoforms tied in the number of mapped peptides. In the case of the remaining 7,680 genes, 3,977 were only annotated with one protein coding isoform, while 3,703 did not have enough isoform discriminating peptides to identify specific isoforms.

The CNIO team then compared the 5,011 genes for which they identified a single primary protein isoform to several independent isoforms references sets – the unique consensus CCDS variants, the APPRIS principal isoforms, and the isoforms with the longest sequence.

The CCDS set consists of variants based on genomic evidence that have been mutually agreed upon by manual annotators. It annotates 13,297 of the Gencode 20 genes as having a single variant.

Of these 13,297 genes, 3,331 overlap with the 5,011 genes identified by the CNIO researchers as having a single primary protein isoform. And for these 3,331 overlapping genes, both sets identified the same isoform as primary for 98.6 percent of them.

The APPRIS set calls principal isoforms based on conservation of structure and function and identifies a primary isoform for 15,172 of the Gencode 20 genes. Comparing this set with their 5,011 genes, the researchers found 4,186 in common, and of these, the primary protein isoform they identified matched the most conserved isoform in the APPRIS set for 97.8 percent of them.

Looking at the 3,015 genes for which all three sets identified a primary isoform, the researchers found that their main isoform matched the CCDS main isoform for 99.4 percent of these genes and matched the APPRIS isoform for 99.5 percent of these genes.

Characterizing this level of agreement as "very surprising," Tress noted that it suggests real biology underpinning the identification of the primary isoforms given that the three methods arrived at their IDs via entirely different routes.

The researchers also compared the primary isoforms they identified through their proteomics analysis to dominant isoforms identified through the Human BodyMap RNA-seq study. Here the JPR set had 1,038 genes in common with genes from the BodyMap study in which a dominant isoform (defined as one with five-fold higher expression than all other variants for that gene) was identified. Here, however, the two datasets agreed on the dominant isoform only 77.2 percent of the time.

This would seem to suggest that isoform expression at the transcript level is substantially different from expression at the protein level. However, Tress said he suspected that the more likely cause of the discrepancy is poor performance of the RNA-seq deconvolution methods used to identify the primary isoforms at that level.

The researchers noted that when they performed the comparison by looking at genes where one transcript had twice – as opposed to five-fold – as many reads as the next most common one, the agreement between the RNA-seq and proteomic sets rose to 95 percent, demonstrating, they wrote, "that RNA-seq reads do indeed contain a signal that can be used to select the main isoform."

Tress said that based on the results he and his colleagues are following up to try to determine the extent to which the discrepancy is due to the deconvolution methods versus actual differences between expression at the RNA and protein levels.

They are also working on a paper detailing the types of alternative splicing they found in their experiments, he said.

One trend they observed is that "the alternative isoforms for which there is more detectable evidence tend to be highly conserved," Tress said. "We were finding things that went back to when we actually split from bony fish.

"The most [biologically] interesting [variants] seem to be those that are conserved and don't have catastrophic effects on the structure or the function of the protein," he said. "So from that point of view, you could almost set up an order of which splice forms are most interesting that could allow people to concentrate on the most interesting ones."

The Scan

Pfizer-BioNTech Seek Full Vaccine Approval

According to the New York Times, Pfizer and BioNTech are seeking full US Food and Drug Administration approval for their SARS-CoV-2 vaccine.

Viral Integration Study Critiqued

Science writes that a paper reporting that SARS-CoV-2 can occasionally integrate into the host genome is drawing criticism.

Giraffe Species Debate

The Scientist reports that a new analysis aiming to end the discussion of how many giraffe species there are has only continued it.

Science Papers Examine Factors Shaping SARS-CoV-2 Spread, Give Insight Into Bacterial Evolution

In Science this week: genomic analysis points to role of human behavior in SARS-CoV-2 spread, and more.