NEW YORK (GenomeWeb) – Shotgun proteomics experiments are now able to profile large chunks of the proteome, with some analyses capable of identifying upwards of 13,000 or 14,000 proteins in a single go.
At the same time, the majority of mass spectra generated in a typical proteomics experiment are never matched to a corresponding peptide. In fact, according to a Nature Methods study published last week by researchers at the European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), an average of 75 percent of the spectra generated in a shotgun mass spec experiment are never identified.
This percentage has decreased only slightly over the years, despite improving instrumentation and informatics tools, Juan Antonio Vizcaíno, leader of the EMBL-EBI's proteomics team and senior author of the Nature Methods study, told GenomeWeb.
"Based on the data we have in the PRIDE [PRoteomics IDEntifications] database the proportion [of unidentified spectra] has dropped a little bit over time, but not by much," he said.
In their recent paper, Vizcaíno and his colleagues put forth an approach for tackling this problem that's based on the notion of clustering unidentified mass spectra found consistently across different datasets. They can then work to identify these spectra via various matching approaches, the idea being that identifications made can be applied to all the spectra in the cluster.
The effort builds on previous work by EMBL-EBI researchers in which they also developed an approach to clustering unidentified spectra. That work was done in 2011, and since then, Vizcaíno noted, new proteomics methodologies and instrumentation have allowed researchers to generate significantly more data, meaning that a new algorithm capable of clustering much larger proteomic datasets was needed.
With the original clustering method, the researchers could process around 20 million spectra, he said. In the recent study, they clustered a total of 256 million spectra — 190 million unidentified and 66 million identified. Using the new approach, they were able to identify roughly 20 percent of the 190 million unidentified spectra in the PRIDE database.
Vizcaíno cautioned that identifications made using the clustering approach should not be considered the last word, noting that researchers should confirm his group's results with searches of their own. The method does, though, offer a way to narrow subsequent searches by providing a likely identification for spectra in question.
This is significant in that one reason for the high rate of unidentified spectra in mass spec experiments is the wide variety of variants or post-translational modifications that can prevent researchers from making a match. If researchers search without accounting for the mass changes caused by such modifications, they likely won't make an ID.
As Vizcaíno and his co-authors note, there are a variety of search methods that can address this challenge, including precursor mass-tolerant searches and open modification searches. However, such approaches vastly expand the search space for a given experiment, which can significantly increase the computational time and resources required. They can also leave researchers with ambiguous identifications where several peptides appear equally good matches to a spectrum.
Another emerging approach that could help improve identification rates is the use of sample-specific search databases wherein methods like RNA-seq are used to generate databases specific to the sample being analyzed via mass spec. In theory, this would help account for variants particular to that sample that might not be present in more general databases. However, Vizcaíno said, in practice, such approaches appear thus far to offer only slight improvements in IDs.
The clustering technique allows researchers to take spectra that are unidentified but found consistently across many datasets and target them using these less commonly used and computationally intensive search approaches. In the Nature Methods paper, the researchers used methods including open modification searching and spectral library-based searching to make their identifications, and, Vizcaíno said, in the future, researchers could use a variety of other techniques for making IDs.
"One could analyze the data in many different ways to try to [improve] the number of identifications," he said. "What we are showing here is just the tip of the iceberg."
The unidentified spectra issue remains a challenging one, though, Vizcaíno said, noting various reasons why spectra could be difficult to identify.
In the first case, a number of spectra might not even come from peptides, he said. Others, he said, could be chimeric, deriving from peptides that eluted at the same time. Additionally, he noted, many unidentified spectra may come from peptide variants, which are usually not found in a very high proportion, so these spectra are not good enough for making peptide identifications. "We are at the limit of what the mass spectrometer can see in a consistent way," he said.
The researchers also found a significant number of spectra in the PRIDE database that are due to contaminants.
"When people do searches, it is a good idea to include a contaminant database with the main contaminants," Vizcaíno said. Otherwise, for instance, "if the sequence for trypsin is not in the database, then that spectra will be identified as something completely random and different."
The number of known contaminants identified in the study "will be used as the basis for a future service in the PRIDE Archive, which will automatically warn submitters if their datasets contain a high proportion of such potentially incorrect identifications," the study authors wrote.
The researchers' initial analysis focused on the PRIDE database, but now they are working with several groups to apply the method to their datasets, Vizcaíno said.
"We are [working] with a couple of groups to cluster their own datasets or groups of datasets that were all generated in the same experimental setting," he said, adding that this work was focusing on difficult samples where the percentage of identified spectra was even lower than the 25 percent average they found across the PRIDE datasets.
In the future, Vizcaíno said, he and his colleagues would like to develop the tool "so that when people submit their data to [PRIDE] we could compare their spectra to spectra submitted by other people, and that way, we could suggest to these researchers that maybe some of their unidentified spectra could correspond to this or this or this peptide."