NEW YORK (GenomeWeb) – A team led by researchers at Baylor College of Medicine and Vanderbilt University have found that use of trypsin digestion in proteomics could limit the identification of alternative splice forms.
In a study published last week in Molecular & Cellular Proteomics, the researchers found that nucleotide triples encoding lysines and arginines are enriched at exon boundaries, and because lysines and arginines are cleavage sites for trypsin, protein digestion with this enzyme makes cleavage at such sites more likely, thus reducing the detection of exon-exon junction spanning splice isoforms.
Given that trypsin is the standard enzyme used for protein digestion in shotgun proteomic experiments, this suggests such experiments undercount these splice forms.
This is of particular interest given the emergence in recent years of proteogenomics, wherein genomic and proteomic data are combined in the expectation that the two levels of data together will be more informative than either would be separately. One use of such analyses is looking for evidence at the protein level of splice forms and other genetic variants, the idea being that such variants are more likely to be biologically relevant if they are ultimately translated into proteins.
"In order to understand splice forms, the most important thing you want to understand is which exons are connected," said Bing Zhang, a BCM professor and senior author of the study. "For example, you have exon one, two, and three, in that order, right? If there is an exon exclusion, you have exon one and three connected to each other. But if the trypsin cuts at the end of exon one, then you don't know if it is connected with exon two or exon three."
There would always be the possibility that trypsin would cleave at such a site, but the finding by Zhang and his colleagues that exons are actually enriched for nucleotides coding the lysines and arginines where trypsin cuts, heightens this issue.
Zhang said the observation stemmed from work he and his colleagues were doing using transcriptomic and proteomic data to analyze transcript and protein isoforms. Using software his lab developed to map identified peptides to the genome, the researchers observed that a large number of peptides detected in proteomic studies end exactly at exon-exon junctions.
"That was kind of confusing," he said. "It was maybe two-fold higher than you could expect."
Looking more closely at exon-exon junctions in the genome of humans and other organisms like mice, the researchers found that these regions were enriched for codons that are translated into lysines and arginines. While genome-wide the frequency of these two amino acids is around 5 percent, they found that it is around 15 percent at exon-exon junctions and that around 25 percent of these junctions include a lysine or arginine.
This suggests that using enzymes other than trypsin could help researchers identify more splice variants. To get at this question, the researchers did an in silico digestion of the human proteome using six different proteases. In total these enzymes created 161,125 detectable junctions, with 1,029 common across all six enzyme digestions. Chymotrypsin generated the largest number of junctions.
The motifs recognized by chymotrypsin are actually slightly underrepresented at exon-exon junctions, Zhang noted. Combining trypsin and chymotrypsin in an experiment using parallel digestion approaches to analyze RKO cell proteomes, the researchers found they were able to detect 37 percent more junction-spanning peptides than they could with trypsin alone and identified more than 1,000 junctions that were not identified in trypsin-only digests.
The findings, Zhang said, indicate that for experiments where detecting splice forms is a priority, researchers should consider using additional enzymes besides trypsin.
"We used six common enzymes that have been used in proteomics studies and we created a total of around 150,000 detectable junctions," he said. "But if you look at the overlap, only 1,000 junctions are covered by all six, which means that they are very complementary. So, ideally, it might be best to use them all, because you can do the digestion separate and combine the samples and then still do just one proteomics measurement."