Researchers at the National Institute of Standards and Technology are building a reference library of mass spectra that they hope will help improve peptide identification in proteomics experiments.
The team introduced an updated version of the library at this week’s American Society for Mass Spectrometry conference in Denver, along with a new “hybrid” search tool that combines a spectral search algorithm called NIST MS Search that was co-developed with the National Center for Biotechnology Information’s Open Mass Spectrometry Search Algorithm, known as OMSSA.
Paul Rudnick, a researcher in NIST’s Mass Spectrometry Data Center, told ProteoMonitor’s sister publication BioInform that the expanded library and the hybrid search tool should be broadly available to researchers here a few weeks after ASMS.
Steve Stein, who directs the NIST Mass Spectrometry Data Center, began developing the spectral library a little over five years ago as an offshoot of NIST’s Electron Ionization Library, a collection of mass spectra for small molecules that has been in use for around 25 years.
Spectral libraries are commonly used to identify molecules in chemical applications, but Rudnick noted that since proteomics is still a relatively new field, similar collections haven’t been widely available for peptide spectra.
Instead, rather than matching measured MS/MS spectra directly against a spectral library of known peptides, the most commonly used peptide-identification algorithms, such as Mascot, Sequest, X!Tandem, and Phenyx, match measured spectra against pre-calculated “theoretical” spectra derived from peptide sequences.
This approach is effective, but it has a number of limitations, said Rudnick. For example, it’s well known that there is very little overlap in the protein lists that these sequence-search algorithms identify — a characteristic that has led to a recent trend in the field to merge results from multiple search algorithms into a single consensus protein list.
In addition, sequence-based searching is very computationally intensive and time-consuming, especially for large-scale proteomics experiments, because a lot of time is wasted by identifying the same peptides multiple times.
Finally, the “theoretical” spectra that underlie these algorithms don’t account for many informative features of the experimental spectra, such as relative abundances and ratios of products with different charge states, which could be used to identify peptides more accurately.
“You need to catch the right diversity of tissue samples, of blood samples, and you want to make sure you’re catching all the proteins in a developmental or disease pathway. We still don’t know when and where all these proteins are expressed.”
Spectral library searching promises a more precise and efficient option than sequence searching because it directly matches an acquired MS/MS spectrum to a library of previously observed and identified peptide spectra. This method is at least twice as sensitive and up to several-hundred-fold faster than the sequence-based approach, Rudnick said. However, the challenge is building the library and ensuring that it is optimized for searching, he added.
NIST’s library is compiled from in-house MS experiments and collaborations with several large proteomics data repositories, such as PeptideAtlas.org, Tranche, the European Bioinformatics Institute’s PRIDE (Proteomics Identifications) database, and the Global Proteome Machine Organization.
The NIST library includes peptide spectra for human, mouse, rat, fly, and yeast. For human, there are nearly 224,000 spectra from around 138,000 peptides, which represents around 14.5 percent of the human proteome, Rudnick said. Coverage of the yeast proteome is the highest, at around 22 percent, while less than 1 percent of the rat proteome is represented in the current library.
Rudnick noted that this low coverage is the biggest drawback of the spectral-searching approach because the algorithm can only identify spectra that are in the library.
Rudnick said that his group is looking to increase the coverage level of the library, but noted that there are a “lot of considerations” to determine which data sets to target first.
“You need to catch the right diversity of tissue samples, of blood samples, and you want to make sure you’re catching all the proteins in a developmental or disease pathway,” he said. “We still don’t know when and where all these proteins are expressed.”
One goal of the NIST group, therefore, is “to get involved with people who really know how to do protein and peptide fractionation and who have access to the right samples” in order to build the library in a more systematic fashion, said Rudnick. In the meantime, “we’re taking a really wide-reaching approach.”
The aim, he noted, is to create something that’s “usable for most applications,” including 1D or 2D separations and a range of mass-spec platforms. So far, he said, spectral searches perform better than current practice, but “now we’d like to see the library include all biologically relevant peptides, including modified forms.”
For now, the NIST team is trying to address that challenge by using a hybrid approach that blends its NIST MS Search spectral search tool with the OMSSA sequence search algorithm. The NCBI algorithm “piggybacks our library searching” to identify peptides which are not yet in the library, Rudnick said.
An Alternative, Not a Replacement
Despite its potential promise, spectral searching won’t likely ever completely replace sequence-based approaches, according to Ron Beavis, a professor at the University of British Columbia’s Biomedical Research Centre and a founder of the Global Proteome Machine project.
Beavis, who developed the X!Tandem algorithm for sequence searching as well as its spectral-searching counterpart, X!Hunter, said he doesn’t see spectral searching as “totally replacing the existing search engines — just changing the way people think about analyzing the data.”
The main benefit of the spectra-based approach is speed, Beavis said. “What takes the most time in sequence-based searching is calculating the theoretical mass spectrum, but in the case of the library, you already have that,” he said.
Compared to sequence-searching methods such as X!Hunter and X!Tandem, the spectral library search method is typically 200 to 500 times faster, and is sometimes up to 1,000 times faster, Beavis said. In addition, “you can usually find data that’s further down in the noise than you could with sequence searching,” he noted. “It tends to dig down deeper and faster.”
However, the downside to the approach is that “you can’t find anything [the library] doesn’t already know about,” which makes it less suitable for exploratory experiments, or for organisms that are not well characterized.
In practice, Beavis said he typically runs proteomics data through X!Hunter first “to find out what it looks like.” Then he uses the results from several X!Hunter runs with different parameter settings to “tune” X!Tandem for a more complete search. Because X!Hunter is so fast and can run on a laptop, that initial processing step is nearly negligible, he said.
Ultimately, Beavis said that the choice of algorithm will probably depend on the type of experiment a researcher is conducting. In a case where an investigator is looking for a particular set of known peptides, as in some clinical applications, “there’s not much point in using a conventional search engine,” he said.
However, “if you want to do a proteome-wide study and you’re looking for post-translational modifications or splice variants,” it would probably be best to use a sequence-based approach.
Beavis acknowledged that even as groups like GPM and NIST rapidly expand their peptide-spectra libraries in order to improve the performance of their algorithms, adoption of spectral searching in the proteomics community has been sluggish.
“I know from personal experience in this world that it takes people about five years to adopt anything, and we’re still pretty early in that cycle” due to the inherently conservative nature of the proteomics community, he said.
“Because of the large investment that people make in equipment in this area, they’re used to taking quite a bit of time to evaluate things,” he added. “But when you’re spending millions of dollars on equipment, it’s probably good to be conservative.”
— A version of this article originally appeared in the May 30 edition of ProteoMonitor’s sister publication Bioinform.