Skip to main content
Premium Trial:

Request an Annual Quote

Study Looks at Suitability of Different Missing Data Imputation Approaches in Label-Free Proteomics

Premium

NEW YORK (GenomeWeb) – Label-free quantitation is a commonly used method for measuring protein expression levels in a sample and comparing those levels across different samples and biological states. A major challenge for users of this approach, though, is the high rate of missing values in label-free datasets, which can significantly affect the reliability of peptide and protein quantitation.

At root, this problem is due to the fact that in traditional mass spec assays, the instrument selects only a sampling of precursor ions for fragmentation and generation of MS/MS spectra. Mass spectrometers are not able to scan quickly enough to acquire all the precursors entering the instrument at a given moment, and therefore many ions are never selected for MS/MS fragmentation and so are never quantified.

To address this problem, proteomics researchers have devised methods for "imputing" missing values. And while these methods are commonly used in the field, a paper published last month in the Journal of Proteome Research by a team from the University of Grenoble and the Cambridge Center for Proteomics suggests that researchers are not necessarily choosing the optimal imputation methods for their experiments.

As the authors write, there are a variety of different reasons why a value might be missing, ranging from biochemical issues like miscleavages and ion suppression to bioinformatic issues like peptide misidentifications and poor matching of precursors in quantification. From a statistical perspective, they note, the most important distinction is between values missing due to random effects not tied to the nature of the peptide (classified as Missing Completely At Random, or MCAR), and values missing due to non-random factors such as peptide abundance (classified as Missing Not At Random, or MNAR).

MCAR imputation methods are more widely used than MNAR methods, but the JPR study indicates that MNAR methods may be more appropriate for use with datasets where the majority of missing values are of non-random nature.

Given the broad use of MCAR methods, this suggests that MNAR approaches are likely underused, said Thomas Burger, a University of Grenoble researcher and the first author on the paper.

While trained statisticians will typically apply MNAR methods where appropriate, "many biologists may be confronted with missing values while not having the necessary background in imputation algorithms, so the best they can do is test various methods and try to make a more or less educated guess," Burger told GenomeWeb.

"I [would] guess that very few people wonder on the [underlying nature] (what does that mean? Maybe rephrase?) of missing values, and most just try to find a 'good' or a 'not too bad' algorithm to impute," he said. "As MCAR algorithms are more widespread in the literature, I would guess that MNAR approaches are underused."

In the study, the researchers looked at five imputation algorithms, three designed for use with MCAR values and two for use with MNAR values, and applied these algorithms to a simulated dataset and a set consisting of label-free quantitative data from an analysis of adenocarcinoma and squamous cell carcinoma.

Their analysis found that, averaging performance across all the datasets, the MCAR-based methods worked best, suggesting, the authors noted, that in cases where researchers have no knowledge as to what percentage of their missing data is MCAR versus MNAR, an MCAR method is best.

That said, when a large portion of the missing values are MNAR, MNAR methods work significantly better than MCAR methods, they observed.

This, the authors noted, suggests the need for "diagnosis tools that are capable of categorizing the missing values according to the mechanism that generated them," which, Burger said, is a problem his lab is currently working on and hopes to be able to publish about this spring.

Such tools would also potentially allow for the development of hybrid approaches to missing value imputation, using, for instance, MCAR for the portions of a dataset where it is most appropriate and MNAR for the portions to which it is best suited.

"I am sure that it would be better to impute each type of missing value according to its missing-ness mechanism," Burger said, noting that his lab is also working on such an approach.