NEW YORK – A team led by researchers at the Francis Crick Institute and the University of Cambridge have developed a new software package for the analysis of data-independent acquisition proteomics experiments.
Detailed in a study published last month in Nature Methods, the software, called DIA-NN, uses a combination of signal correction strategies to reduce interferences and neural networks to assign the confidence of peak identifications, allowing for high performance even when using very short liquid chromatography gradients, said Markus Ralser, a group leader at Francis Crick and senior author on the paper.
He said that by using the approach he and his colleagues were able to quantify several hundred proteins in five-minute DIA analyses of undepleted plasma. He added that the approach had shown particular promise in combination with the Scanning Swath DIA method introduced by Sciex at the most recent American Society for Mass Spectrometry annual meeting.
In a DIA mass spec experiment, the instrument steps through wide m/z ranges, fragmenting all the ions in a given window. This avoids the stochastic sampling problems facing conventional shotgun proteomics and consequently improves reproducibility, However, it leads to complex, overlapping spectra that must be deconvoluted.
While a number of DIA software packages exist to do this work, Ralser said his lab decided to develop its own DIA software to support yeast research in which he and his colleagues generated knockouts for every gene in the yeast genome and quantified the resulting proteome.
"This required the measurement of something like 8,000 proteomes, and when we came into it we realized that there was no analytical capability in proteomics that would allow us to easily measure such a large number of proteomes," he said.
The researchers turned to DIA mass spec because of its high reproducibility, which was necessary to compare results across the many proteomes they were measuring, but Ralser said they found that, at the time, existing DIA analysis software was not well suited to handling the thousands of samples they were analyzing.
They decided to develop their own software focused specifically on high-throughput applications. In doing so, they incorporated a number of strategies to improve the deconvolution of DIA data collected using very short LC gradients, including, Ralser said, "algorithms for correcting for signal interferences and the use of deep neural networks to assign confidence values to the peaks we identify."
Shorter LC gradients mean less separation and compressed chromatographic peaks, which make it more difficult to confidently identify large numbers of proteins, noted Vadim Demichev, a researcher in Ralser's lab and the primary developer of the software.
Using the DIA-NN package "we can now get more information out of shorter gradients and this allows us to speed up our analysis," Ralser said.
One key to the approach, Demichev said, is the ability of neural networks to synthesize more of the information available in generated spectra.
"We have very complex data with lots of information encoded, but [conventional machine learning] was not able to efficiently utilize all of this information," he said. "And that can be done with neural networks."
Ralser said that by using DIA-NN he and his colleagues are able to quantify between 200 and 400 proteins in plasma samples using a five-minute LC gradient.
Lukas Reiter, chief technology officer at proteomics firm Biognosys, whose Spectronaut software is widely used for DIA analysis, said that the strong performance of the DIA-NN software at very short gradients was "an interesting observation."
"That's not necessarily what I would have expected," he said, noting that a shorter gradient typically makes extracting protein identifications more difficult.
Reiter said that Biognosys was also exploring the use of neural networks in its software, noting that it has become fairly straightforward to plug and play different analytical strategies for classifying peaks, allowing researchers to fairly easily apply different machine learning approaches or neural networks.
Reiter said the company is also using deep learning to predict peptide properties like retention time, fragmentation patterns, and collisional cross section.
He said that the main benefit of this approach currently is that it could allow for researchers to do DIA experiments without having to generate the spectral libraries typically used for such analyses.
His comments echoed earlier previous remarks from Jürgen Cox, group leader in computation systems biochemistry at the Max Planck Institute of Biochemistry, who noted this year that he believed deep learning tools would make it possible to replace experimentally generated libraries with spectral prediction tools.
"In the long run I don't think that we will be generating libraries anymore, which is actually the part that makes DIA a little bit work intensive, especially for smaller labs," he said.
Ralser's emphasis on short mass spec experiments fits with the broad direction taken by much of the proteomics field in recent years as many prominent researchers and vendors have moved to emphasize the number of samples that can be reproducibly analyzed, rather than the depth of proteomic coverage a particular workflow enables.
Reiter likewise noted the move toward higher-throughput workflows and highlighted the partnership the company announced this week with oncology firm Indivumed to run large-scale mass spec experiments generating proteomic and phosphoproteomic data from that company's cancer sample database.
He said, though, that from a commercial standpoint, the company had seen little demand for runs shorter than 30 minutes, noting that that amount of mass spec time would cost a customer only around $20 to $30.
"That's not much money, so you would have to have a special reason to go below half an hour of measurement time," he said, observing that demand for greater throughput could be met by running samples on multiple instruments in parallel.
"People always want to push the boundaries, but on the commercial side there is not that huge of a need to go below half an hour at the moment," he said.
Ralser said that DIA-NN's ability to deconvolute data from very short gradients makes it potentially useful in combination with the Scanning Swath workflow Sciex recently introduced.
The Scanning Swath workflow uses a sliding quadrupole window as opposed to stepped isolation windows, which boosts performance by improving ion accumulation and providing additional information for matching of precursor and fragment ions.
The additional information for matching precursors and fragments comes from the fact that as the isolation window slides across the mass range of interest, precursor and product ions can be seen entering and leaving the window, and those entry and exit times provide another parameter that can be used to resolve complex spectra.
"The first quadrupole is no longer acquiring mass windows but is continuously scanning," Ralser said. "With Scanning Swath you can have even faster duty cycles, and so I see the possibility of using DIA-NN to deconvolute these complex spectra.
Ralser and co-authors including several Sciex researchers published a bioRxiv preprint in May describing the use of the Scanning Swath method in combination with the DIA-NN software, running five-minute LC gradients using standard high-flow chromatography and quantifying around 3,000 proteins in cell digest samples.
"The combination of DIA-NN with Scanning Swath and high flow chromatography opens the door to lots of applications that proteomics before just couldn't do," he said. "We can go now, for instance, into the epidemiological space, recording tens of thousands of proteomes."