Skip to main content
Premium Trial:

Request an Annual Quote

Researchers Address Issue of Inflated False Discovery in Large-Scale DIA Datasets


NEW YORK (GenomeWeb) – An international group of researchers with expertise in data independent-acquisition (DIA) mass spectrometry has developed guidelines and software to address the issue of false positive protein identifications in large-scale DIA datasets.

Presented in a paper published this week in Nature Methods, the effort grew out of a large multi-lab study looking at variability in DIA mass spec workflows, said Ruedi Aebersold, a professor at the Swiss Federal Institute of Technology (ETH) Zurich and senior author on the paper.

In that initial study, Aebersold said, the researchers sent an aliquoted sample to 11 different laboratories and had them run it using Swath DIA seven times a day on Monday, Wednesday, and Friday of one week, making for 21 runs per lab across 11 labs, or 231 mass spec runs.

They then analyzed these runs in aggregate to assess how reproducible their workflow was both within and between labs. When they did, they saw that as they increased the number of runs they analyzed, the number of protein identifications they made also increased.

While additional runs of the same sample might lead to some increase in identified peptides, at a certain number of runs you would expect to hit a saturation point, Aebersold said, noting that the fact that he and his colleagues saw peptide identifications continue to increase as they added runs to the analysis suggested that they were likely false positives.

"When you keep adding data to a data set and the identified peptides or proteins keep creeping up, that is a telltale sign of a problem with your FDR [false discovery rate] control," he said. "That is what we observed, and so then we held back publication of this large dataset and resolved this [FDR] issue."

Aebersold said that the potential for such a problem is widely acknowledged in the field but has only become an issue as DIA researchers have begun doing larger number of runs and compiling larger datasets.

"There have been very few datasets that have hundreds of runs, and if you just have a few runs, it doesn't really happen," he said.

Aebersold noted that a similar issue arose several years ago in the case of two high-profile mass spec experiments published in Nature that used conventional data dependent-acquisition analyses to put together what the authors described at the time as nearly complete profiles of the human proteome.

Those analyses, one performed by a team led by Johns Hopkins University researcher Akhilesh Pandey and the other by a team led by Technical University of Munich researcher Bernhard Kuster, came under criticism by many in the proteomics community for using methods that may have led them to overstate the number of quality protein identifications supported by their data.

Among the issues cited by outside researchers was the fact that combining many different datasets can lead to a significant increase in false positive identification rates that must be accounted for, a situation more recently observed by Aebersold and his colleagues in their DIA reproducibility study.

Aebersold gave as an example a theoretical mass spec run that identified 10,000 peptides with a false discovery rate of 1 percent. In such an experiment, you would expect 9,900 of these identified peptides to be accurately identified and 100 of the 10,000 identifications to be incorrect. If you were to do another run also at an FDR of 1 percent, you would likewise expect to make 9,900 true identifications and 100 false ones. However, while the 9,900 true identifications would by and large be the same given that these are the peptides actually present in the sample, the 100 false identifications would like be different from the 100 falsely identified peptides from the first run.

"If you have just three or five runs, this is not really an issue," Aebersold said. On the other hand, "if you have hundreds of runs and there are 100 false positives in every run, then that will accumulate into a very large number."

Aebersold noted that this problem is actually somewhat less challenging for DIA than it is for DDA approaches due to the fact that DIA matches peptides to a discrete library while DDA infers proteins based on the peptides the mass spectrometer stochastically selects in a given run.

As such, for DIA "you are not going to have in every run an independent set of peptides, because in every run that you compile you assume the peptides identified are a subset of those in the library," he said.

There is the question, though, of how expansive a library a study uses, Aebersold said, noting that the more expansive a library, the more stringently researchers need to control an experiment's FDR.

If, for instance, a DIA study uses a library generated from only the specific sample being studied, false positives are less of a problem, he said. "There you have a library that only consists of peptides that you know are in your sample, so the issue of false positive identifications is drastically reduced."

But using this sort of narrow or "local" library, as Aebersold and his co-authors termed it, could lead researchers to miss proteins.

"If, say, you have a clinical cohort of controls and cancer cases and you generated your library from the control tissue, then some of the cancer cases might express a protein or several proteins that are not in your library, and you would miss these proteins," he said.

Aebersold suggested that, broadly speaking, the two approaches provide largely similar results.

"You can have a very expansive library and you can control the FDR and basically the inflation of false positives but you just have to be more conservative in the scoring [of identifications]," he said. Whereas, "if you have a library that only consists of peptides that you know are in your sample, then the issue of false positive identifications is drastically reduced, but you might miss proteins."

The Nature Methods authors suggested that additional work could help further refine DIA search approaches.

"It might be interesting for future applications to consider strategies for reducing the query space to provide an optimal tradeoff between proteome coverage and the fraction of undetectable targets," they wrote.

More immediately, Aebersold said the work is aimed at getting out in front of what will likely become a more common issue as researchers put together larger and larger DIA datasets.

"We thought it would be useful now that an increasing number of people use DIA methods to analyze the issues and discuss the problems and provide solutions," he said.