Combining isoelectric focusing and mass spectrometry, researchers at Sweden's Science for Life Laboratory have devised an unbiased proteogenomics approach and used it to investigate the full six-reading-frame translation of the human and mouse genomes.
Detailed in a study published this week in Nature Methods, their analysis identified 98 previously undiscovered protein-coding loci in humans and 52 in mice, including a number of refined gene models and pseudogenes.
Proteogenomics – the integration of proteomic and genomic data – has emerged of late as a key area of focus within proteomics. The second stage of the National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium, for instance, is at heart a proteogenomic effort, with participants seeking to generate proteomic data on tumors previously characterized genomically by the NCI's Cancer Genome Atlas initiative. Likewise, the National Human Genome Research Institute's Encyclopedia of DNA Elements Consortium, ENCODE, has added proteomic data to some of its genomic and transcriptomic analyses, and researchers from the Human Proteome Organization's Chromosome-Centric Human Proteome Project have proposed proteogenomic collaborations between the two initiatives.
Underlying this interest in proteogenomics is the notion that protein data can, perhaps, shed light on the importance or consequence of various genomic features, allowing researchers to determine, for example, whether or not a specific genetic variant ever actually becomes a functional protein.
As the SciLife researchers demonstrated, the technique could also be useful for improving current genome annotation. A major impediment to such efforts, however, is the vast size of the genomes of higher eukaryotes like humans, said Janne Lehtiö, platform manager for mass spectrometry at the Stockholm SciLife laboratory and author on the Nature Methods paper.
This is particularly the case in efforts looking at regions of these genomes traditionally thought of as non-coding, he noted.
"In humans, for example, we have about 1.5 percent known protein-coding [genome] sequence, and the rest of the 98.5 percent is considered non-protein-coding," Lehtiö said. "And when we do mass spectrometry experiments, we tend to search [for protein identifications] against just this 1.5 percent."
This is because the sensitivity of such searches is dependent on the size of the database being searched against, he noted. "So if we try to search against the whole sequence, the error rate becomes very high."
To get around this problem, Lehtiö and his colleagues turned to high-resolution isoelectric focusing, prefractionating the proteome by the peptides' isoelectric points, the pH at which a molecule contains no net electrical charge.
Because a peptide's isoelectric point depends on its sequence, the researchers were then able to similarly fractionate their mass spec reference database according to the included sequences' theoretical isoelectric points. In this way, they could search only the specific portion of that database featuring isoelectric points corresponding to that of a given experimental peptide – thus greatly reducing their search space.
Using this workflow, the SciLife team divided the six-reading-frame translation database of human and mouse tryptic peptides into 360 isoelectric point-restricted fractions. Generating peptide spectra on a Thermo Fisher Scientific Orbitrap Velos instrument, they then searched these spectra against their specific fraction.
In all, the researchers identified 13,078 human and 10,637 mouse proteins, including 39,941 peptides not previously present in the Peptide Atlas' human dataset. They also identified 224 novel human and 122 novel mouse peptides, which mapped to 164 and 101 genomic loci, respectively. These included refined models of known genes – 47 in human and 32 in mouse; pseudogenes and long noncoding RNA genes – 51 in human and 20 in mouse; and intronic and intergenic loci that were not associated with existing gene annotations – 66 in human and 49 in mouse.
One finding the researchers validated with follow-up mass spec analysis was the pseudogene MYH16, which, they noted, had previously been thought to have lost its protein-coding ability due to a double base deletion "during divergence of the human lineage from other primates."
Their data demonstrated, however, that in the human A431 cell line the MYH16 gene in fact produces a shortened protein isoform.
The authors also noted that they were unable to find evidence in their RNA-level data linking the peptides identified as products of novel unconnected intronic and intergenic loci to Ensembl gene annotations. "This category," they wrote, "most likely contains the majority of false positive hits in the proteogenomics search."
The extent to which the presumed non-protein-coding regions of the genome do, in fact, code for proteins is currently something of an open question, Lehtiö noted, particularly in light of transcriptomic data presented last year by ENCODE researchers suggesting that as much as 75 percent of the human genome is transcribed.
"That [finding], of course, poses the question of how much of that [RNA] is converted to protein," he said. He observed that in this initial experiment he and his colleagues had found more than 100 potentially new coding regions, and that they would likely find additional ones were they to repeat the work in different cell types.
"This was a proof-of-principle, but now that we have the technology up and running, we are interested to see [what might exist] in different cell lines and different materials," Lehtiö said. "We are definitely going to try to do a more systematic study."
Beyond analyzing additional cell types, the SciLife researchers might also look to add to the number of peptides analyzed, Lehtiö said. In the Nature Methods study, they investigated only peptides with acidic isoelectric points. Adding peptides in the basic range could be "an extension possibility" for the work, he said.
Given the extensive amount of fractionation involved, the method is fairly time-consuming, Lehtiö noted, but, he said, the human analysis could be done in around a month and the mouse analysis could be completed in as little as one week.