NEW YORK (GenomeWeb) – Aiming to bridge what they've identified as a gap in informatics workflows for label-free multiple-reaction monitoring mass spec, Yale University researchers have devised a new pipeline for preprocessing of MRM-MS datasets.
Detailed in a paper published last week in Biology, the pipeline automates pre-processing steps including data quality assessment, outlier detection, identification of inaccurate transitions, and data normalization.
While open-source tools for such steps exist for MRM-MS workflows using stable isotope-labeled peptide standards, tools for label-free experiments were lacking, Chris Colangelo, director of Yale's protein profiling resource and author on the study, told ProteoMonitor.
He suggested that such a tool could prove particularly useful given the uptick in large label-free MRM experiments and use of data independent acquisition mass spec methods like Swath and parallel-reaction monitoring.
While software like the University of Washington's Skyline program allows users to develop targeted proteomics assays and packages like MSstats provide tools for analysis of quantitative protein data, Colangelo said he and his colleagues felt there was need for software to aid with data preprocessing.
"You get your Skyline data and you want to [analyze] it with something like MSstats, but you [first] have to [determine] is the [data] good or not," he said. "You can do that by hand, but when you start talking about these larger experiments where you have thousands of transitions, it's tough to do by hand."
It can also be cumbersome to generate SIL peptides for every target peptide in large MRM or DIA experiments, Colangelo noted, and while another option might be using a smaller subset of SIL peptides as standards for the entire experiment, such an approach often does a poor job of taking into account the full complexity of a sample, he said.
The problem with using a small subset is that it doesn't always accurately describe the entire matrix," Colangelo said. "We showed that in the [Biology] manuscript. We had a small subset of only four [SIL] peptides, and they performed as a very poor metric for the variability of our assay."
"So this is a way for people who want to do quality control without having to make standards," he said.
Many of the techniques used in the Yale team's data preprocessing package have their origins in microarray analysis, which, Colangelo noted, presents a similar problem to label-free quantitative proteomics.
"With microarrays you had thousands of targets you were probing, and you had to find a way to normalize them between experiments, because you didn't have a standard," he said.
The first steps in preprocessing involve assessing the quality of the transitions and samples used, issues the Yale researchers tackled by investigating the consistency of retention times and peak areas across all the samples in an experiment, looking at transitions with large variations in retention time as well as using linear regression to identify peaks that could be potential quantitative outliers.
Such approaches are "standard for any assay," Colangelo said. More difficult, he noted, was determining the best approach for normalizing the MRM data.
A key challenge to this process is the fact that, in a typical MRM dataset, unlike a global proteomics dataset, most proteins are expected to show some change. For instance, if a researcher is using MRM to analyze a panel of biomarkers, then, Colangelo said, "if everything is a marker and everything is changing, then you have nothing really to normalize to."
"With MRM you aren't looking at the full picture, and when you are normalizing without looking at the full picture, you have to be really careful that you're not doing something to introduce bias or that loses the real differences that you are seeing," he said.
This, Colangelo noted, means that different normalization methods will work better or worse with different datasets.
With this in mind, the researchers incorporated a number of normalization approaches into their package. In the Biology paper they compared five methods — global median, quantile, cyclic loess, IS.median, and invariant — set across three datasets: an MRM analysis of 112 target proteins in the rat brain post-synaptic density; an MRM analysis of 24 mouse brain proteins from a study of mice lacking the presynaptic vesicle protein cysteine string protein α; and an MRM study of 29 Streptococci Pyogenes proteins.
In most sets, the quantile and invariant set methods offered the best results, Colangelo said, adding that this "wasn't unexpected, since these are the most common ones in microarray analysis."
He noted that all of the methods, though, provided an improvement over the original non-normalized data. "So [the study] shows that by performing normalizing, you clearly improve your data."
Colangelo said that commercially available software packages like Waters' Progenesis and Proteome Software's Scaffold offer preprocessing features like those developed by he and his colleagues. Their aim, he said, was to build an open-source tool for proteomics researchers.
He said they are now aiming to integrate the package as a plug-in to Skyline, which is perhaps the most popular open-source program for developing targeted proteomics assays.