Utrecht University researchers have devised a new method for database-independent proteomics.
The technique enables the identification of peptides and proteins without use of a protein database and could prove particularly useful in analysis of post-translational modifications and protein variants as well as in studies of organisms for which no protein databases exist, said Albert Heck, scientific director of the Netherlands Proteomics Centre and leader of the effort.
Mass spec-based proteomics typically involves fragmenting unknown proteolytic peptides to produce fragmentation spectra, which can then be searched against protein databases based on predicted spectra generated from genomic data. Peptide identifications are made and scored based on agreement between the experimental and predicted spectra.
This has proven a powerful technique, and, Heck noted, as genome sequencing becomes cheaper and easier such databases will likely proliferate. However, he said, such databases are less useful for analyzing phenomena such as post-translational modifications and hypervariable protein regions like those contained by many antibodies.
“At the moment, of course, genome sequencing is faster, quicker, and cheaper than proteomics, so anything you can solve by genome sequencing you will not [approach] by proteomics,” Heck told ProteoMonitor. “But if you are talking about [post-translational] modifications, if you are talking about mutations, you cannot rely just on [genomics-based protein] databases, and so you have to use methods like this one.”
The new method, which the Utrecht researchers described last month in a paper in the Proceedings of the National Academy of Sciences, relies on peptide digestion with the protease Lys-N paired with a combination of electron transfer dissociation and collision-induced dissociation mass spec to generate peptide sequence ladders that allow for identifications without reference to a protein database.
The Lys-N digestion, Heck said, is used to simplify the fragmentation spectra to produce easily interpretable sequence ladders. “When you combine it with ETD, you get these very nice sequence ladders where you can really read off the amino acid sequence going from the C-terminus to the N-terminus,” he said.
This method, however, doesn’t provide full coverage of all unknown peptides, and, Heck noted, filling in sequence gaps presents a significant challenge given the thousands of possible amino acid combinations that could complete a given gap.
To tackle this issue, the researchers sequenced each peptide by both ETD and CID. They then created a library of the ETD sequences that included all possible solutions to any sequence gaps, which they fed, along with a decoy library, into the identification software Mascot, which scored these various solutions. They then used Mascot to score the CID spectra against the ETD solutions, taking the highest ranking combined ETD and CID ion score for each precursor.
Applying the method to an ostrich meat sample, Heck’s team identified 2,744 peptides. No protein database currently exists for ostrich, making it impossible to determine the exact level of accuracy for the experiment, but in a phylogenetic analysis of ostrich based on the identified proteins, the researchers were able to generate a phylogenetic tree with exactly the same topology as the established tree provided by the National Center for Biotechnology Information, suggesting their data’s reliability.
They also tested the method by comparing its performance to a conventional database-searching approach for the analysis of human HEK293 cells. The database-independent method resulted in 1,097 CID/ETD queries and 1,029 unique peptide sequences, while the conventional method generated 2,904 CID/ETD queries and 1,492 unique peptides. The two methods generated 745 CID/ETD queries in common. Of these 745 queries, 183 peptide sequences agreed fully.
“The overlap [between the methods] was not full, but it was reasonable,” Heck said. He added that using the database-free method the researchers detected peptides that were not present in the International Protein Index database used for the conventional approach as well as post-translational modifications like phosphorylation and acetylation.
The results, Heck said, “show the quality of the de novo [sequencing] approach and that [by] relying on the IPI database you may miss peptides that you would find with our [database-independent] approach.”
The technique is obviously useful in the case of organisms that are extinct or for which researchers have not yet compiled protein databases. More broadly, such approaches could prove handy as proteomics moves toward taking more into account phenomena like post-translational modifications, mutations, and highly variable protein regions.
“I think there are two areas where this could directly have great benefit: In the analysis of post-translational modifications and in the analysis of antibodies, which have highly variable regions that actually determine their affinity for antigens,” Heck said. “Also, [events] like single polymorphisms. Some of the peptides we identified that were not in the IPI [database] were actually caused by single mutations in the database or in the sequence compared to the database.”
He added that he expects advances in mass spec technology will lead to improvements in the technique, noting that as mass specs continue to become faster and more sensitive, the method could prove more and more useful. The PNAS study was done using a Thermo Scientific Orbitrap XL, but, Heck said, newer machines like the Orbitrap Velos or the Orbitrap Elite could both speed up the method and improve the accuracy of its IDs.
“Right now we rely very much on genome databases for proteomic experiments, and we are getting more and more of them,” Heck said. “But many are not that well annotated, and in the future it might be that you just want to rely on the proteomics data by itself.”
Have topics you'd like to see covered in ProteoMonitor? Contact the editor at abonislawski [at] genomeweb [.] com.