Skip to main content
Premium Trial:

Request an Annual Quote

Mass Spec, Informatics Advances Give Boost to De Novo Peptide Sequencing


NEW YORK – Proteomics researchers have explored de novo peptide sequencing for decades, but technical challenges have limited the approach's usefulness.

In recent years, however, improvements in data analysis and mass spec performance have made such approaches more feasible with implications for a number of applications including antibody discovery and characterization, immunopeptidomics, and forensics.

Most conventional mass spec-based proteomic workflows rely on a reference database of the mass spectra that are expected to be generated in an analysis of a given sample. Researchers identify peptides in a sample by matching the mass spectra produced experimentally to the expected spectra in the reference database.

This approach has proved widely useful, but it has certain limitations. For instance, reference databases often don't account for sample-specific features like protein variants due to single nucleotide polymorphisms or other genetic alterations. In the case of immunopeptidomic applications like the discovery of cancer-linked neoantigens, reference databases may likewise be unavailable. The same limitation can apply in areas like biosurveillance where users may be looking for proteins from novel or uncharacterized organisms.

In such cases, de novo peptide sequencing is a potentially useful tool as it allows researchers to identify the amino acids that compose a detected peptide simply by analyzing the experimental spectra and without comparison to a reference database.

The approach is technically challenging, though. Use of a reference database restricts the possible peptides an experimental spectrum can be matched to. This allows researchers to make peptide identifications with high confidence and also makes for a less computationally intensive process as the potential peptide matches that must be searched through is limited.

With de novo sequencing, on the other hand, "the statistical space is much bigger," said Eric Merkley, a proteomics researcher at Pacific Northwest National Laboratory who uses de novo peptide sequencing for biosecurity and forensics work. He noted that this makes de novo sequencing data "way more likely to have errors in it than database searching [data]."

It also means that de novo approaches require very high-quality mass spectra to function well.

"If you have low signal to noise, if you have poor fragmentation, you are kind of out of luck for de novo," Merkley said. "Data quality is quite important."

As mass spec performance has advanced, improvements in data quality have made de novo sequencing experiments more tractable. Merkley pointed to Thermo Fisher Scientific's launch a decade ago of its Q Exactive instrument as something of a turning point in this regard. Since then, newer Thermo Fisher instruments as well as releases like Bruker's timsTOF platform and Sciex's ZenoTOF 7600 have continued to push forward the speed and quality of mass spec analyses.

"I think high resolution on both the timsTOF and Orbitrap has been a game changer for immunopeptidomics [in terms of] the speed at which you can sequence peptides and the accuracy of the information you can obtain." said Pierre Thibault, a principal investigator in the proteomics and bioanalytical mass spectrometry research unit at the University of Montreal.

In one sign of vendor investment in the area, Bruker in March launched a new de novo peptide sequencing software package, Paser Novor, for use on its timsTOF platform. Bruker, which developed the software in collaboration with antibody sequencing firm Rapid Novor, is targeting it particularly at immunopeptidomic work on its timsTOF SCP instrument.

Chris Adams, director of bioinformatics at Bruker, said the new software combines Bruker's Paser (Parallel Database Search Engine in Real-time) real-time search capability with Novor's de novo sequencing tools to enable faster, more streamlined analyses.

Bruker launched the Paser software for conventional reference database workflows two years ago, but Adams said that uptake of the company's timsTOF SCP for immunopeptidomic work convinced it that it needed software for de novo applications, as well.

"It really spoke to us that we needed to have modern software tools to be able to keep up with this," he said.

Thibault said that the growing prevalence of ion mobility as an additional form of sample separation is also improving the sensitivity and specificity of analyses and could also help researchers distinguish between analytes like isomeric peptides that are challenging to differentiate. 

Advances on the informatics side have also made de novo analyses more feasible. For instance, AI-based approaches to predicting fragmentation patterns and intensities for particular peptides and amino acids have improved the confidence with which algorithms are able to identify peptide sequences. The ability to train these algorithms on increasingly large datasets has also been key to their refinement, said Anthony Purcell, a professor of biochemistry and molecular biology at Monash University who uses de novo peptide sequencing for research into cancer immunology.

Where de novo sequencing stands as a technology depends on the experiment, Merkley said. For instance, he said, the approach is well worked out for applications like sequencing of purified antibodies for biopharma work. Immunopeptidomics, on the other hand, is an area where, while de novo sequencing is useful, further advances and refinement are needed.

Purcell said that his lab frequently uses a hybrid approach in its immunopeptidomic work, generating reference databases specific to cancer cells or samples they are studying but using de novo sequencing to identify potential peptides of interest not present in the reference database.

In immunopeptidomics, "we're dealing with a lot of atypical peptides — non-tryptics, potential mutations, etc. — so we tend to use a combination of de novo and more traditional database searches."

"When we look at a dataset, there are many good-looking spectra that we can't assign [to a reference spectra] easily," Purcell said, noting that de novo sequencing allows researchers to investigate what those peptides might be, even if they don't appear to be a good match to whatever reference database they are using.

"It's maximizing what you can get out of your data," he said. "And it is hypothesis generating to some extent, as well. For instance, if there's a peptide [in a clinical cancer sample] that looks like it might come from Pseudomonas, well, why might that be there? So you get this novel information that looks interesting and doesn't match what you're maybe expecting to see, and then you can compare it against a database you might think it would come from and try to confirm it."

Merkley said that despite the field's advances, de novo peptide sequencing still suffers somewhat from a lack of software tools and noted that Bruker's new de novo software release is interesting in this regard. The Peaks software from Bioinformatics Solutions is the most commonly used tool for de novo work, and options remain limited, particularly when it comes to free software packages, Merkley said.

"There are academic labs that have made software tools, but many aren't well supported because [often] the grad student or postdoc who developed moves on and goes somewhere else," he said.

Echoing Purcell, Merkley noted the development in recent years of many machine learning models for predicting peptides from mass spectra and suggested that these would push de novo sequencing efforts forward.

"Making those [models] accessible to people will be really good," he said. "I don't know that right now any of those are present in user-friendly, GUI-type software. They are still more for your data scientists right now. I think getting them into the hands of people who just want to do biology with mass spectrometry and don't want to become data scientists, don't want to become machine learning experts, will help."

Merkley also suggested that combining machine learning with established models of peptide fragmentation might improve these tools.

"There was a lot of research before proteomics started to get big in terms of understanding the basic science of peptide gas phase fragmentation from the point of view of physical chemistry," he said. "You can see that machine learning models are recapitulating stuff we knew from those physical chemistry experiments from years ago. So how could you put that machine learning data into a more formal chemical and model structure? I think that would be really cool."

The Scan

Machine Learning Helps ID Molecular Mechanisms of Pancreatic Islet Beta Cell Subtypes in Type 2 Diabetes

The approach helps overcome limitations of previous studies that had investigated the molecular mechanisms of pancreatic islet beta cells, the authors write in their Nature Genetics paper.

Culture-Based Methods, Shotgun Sequencing Reveal Transmission of Bifidobacterium Strains From Mothers to Infants

In a Nature Communications study, culture-based approaches along with shotgun sequencing give a better picture of the microbial strains transmitted from mothers to infants.

Microbial Communities Can Help Trees Adapt to Changing Climates

Tree seedlings that were inoculated with microbes from dry, warm, or cold sites could better survive drought, heat, and cold stress, according to a study in Science.

A Combination of Genetics and Environment Causes Cleft Lip

In a study published in Nature Communications, researchers investigate what combination of genetic and environmental factors come into play to cause cleft lip/palate.