Skip to main content
Premium Trial:

Request an Annual Quote

PNNL Team Releases Software Package for Top-Down Proteomics


NEW YORK (GenomeWeb) – Researchers from Pacific Northwest National Laboratory have developed a new open-source software package for top-down proteomics that presents something of a middle way compared to commonly used existing packages.

Detailed in a paper published this week in Nature Methods, the software adopts a moderately restrictive search technique to identify intact proteoforms, an approach Samuel Payne, a senior scientist at PNNL and author on the study, analogized to search methods traditionally used in conventional bottom-up proteomics workflows.

He suggested that the package, called Informed-Proteomics, could help fill what he currently sees as a gap in top-down informatics tools.

As opposed to bottom-up proteomics, where proteins are digested into peptides prior to mass spec analysis, top-down proteomics aims to analyze intact molecules. Studying intact proteins has potential advantages, as information such as the type and location of a molecule's post-translational modifications is retained during this kind of analysis. However, as Payne noted, it is significantly more complicated from a technical standpoint, and researchers are still working to develop effective and streamlined methods addressing all steps of the process from sample preparation and separation to mass spec analysis to the backend informatics.

While continued improvement on the front end will aid informatics efforts by upping data quality, top down data analysis is an inherently challenging business, due, in large part, Payne said, to the massive search space researchers have to query. Humans have roughly 20,000 protein-coding genes, and, even without accounting for splice variants and other genetic modifications, each of these 20,000 proteins can contain multiple post-translational modifications in numerous combinations and at various locations. That, the authors noted, makes for a search space in humans consisting of more than a billion possible proteoforms.

Given this vast search space, top-down researchers must determine how to balance the challenges of a more restrictive search, which will produce more confident identifications but likely miss more unexpected or unknown proteoforms, against a more open search, which could identify unknown proteoforms but is likely to be computationally expensive and result in a higher rate of false positives. (Bottom-up researchers likewise struggle with this tradeoff.)


The Informed-Proteomics package allows for a different balance of these competing concerns than several currently popular top-down software programs, Payne said.

On the more restrictive side, he said, is the ProSightPC software developed by Northwestern University's Neil Kelleher and sold by Thermo Fisher Scientific. That program, Payne and his coauthors wrote, "restricts the search space to a limited set of proteoforms in a 'proteome warehouse', a curated collection derived from known PTMs, splice variants, and single-nucleotide variants."

At the other end of the spectrum is MS-Align+, a package developed by researchers at the University of California, San Diego and PNNL scientists including Richard Smith, co-author on the Informed-Proteomics study. That software allows for blind searching, which, as Payne noted, allows researchers to identify unexpected proteoforms but also raises the chance of false-positive identifications.

"Both of those tools have a good place in [top-down] research," Payne said, but with their release, he and his colleagues aimed to strike a balance between the two, requiring that researchers specify what PTMs they are interested in searching, but not where they are located.

"You have some modifications that you know may exist, but you don't know where they exist," he said. "So, a researcher knows they care about phosphorylation, for instance, and they want [the software] to look for it anywhere there is [a potential phosphosite], not just in the 100 places where you told it [the modification] might be."

"It fits a middle ground [between open and closed searches]," he added. "And in that sense, it's more like what people are used to from bottom-up proteomics."

The software uses a graph-based approach to searching that takes advantage of the fact that many proteoforms differ not in what modification is present but only in the placement of that modification, Payne said.

"Whether you place a lysine methylation on lysine number one or on lysine number five, those two proteoforms are similar," he said. "They might have divergent paths [in terms of the placement], but loss of the components of them are the same. So, if you have a protein which has 10 different placements for [a modification], the graph [method] allows you to explore all of those at one time, as opposed to a separate computation of events, and that gives us some real efficiency savings."

Payne noted that while in theory users could specify as many modifications to search as they want, "there's a practical limit in the amount of time you want to spend."

"I think what you specify to search for will be driven by your biological questions," he added, citing the example of an analysis of patient-derived xenograft (PDX) breast tumors he and his colleagues included in the Nature Methods paper. "There we were really interested in the common mutations that are associated with cancer dysfunction, so phosphorylation or methylation or acetylation."

In addition to the search approach, the Informed-Proteomics package includes an LC-MS feature-finding algorithm that the authors said improves feature detection by aggregating signals across different charge states and across LC elution times. It also features a new set of visualization tools to aid in manual validation of results.

In their analysis of the PDX breast tumors, the PNNL researchers examined five technical replicates of two breast cancer subtypes, basal like and luminal B, identifying a total of 3,207 proteoforms in the two subtypes, 1,636 of which they found to be differentially expressed.

This, they noted, was tenfold more differentially expressed proteoforms than a recent top-down analysis of the same tumor subtypes found.