NEW YORK(GenomeWeb) – Researchers at the Institute for Systems Biology have developed cloud computing functionality for the Trans-Proteomic Pipeline mass spec informatics suite.
Presented in a paper published last month in Molecular & Cellular Proteomics, the cloud-compatible version of the TPP is one of several recent proteomics tool releases suggesting a growing interest within the field in using cloud computing for processing and searching mass spec proteomics data.
While cloud computing is fairly common in genomics, the approach is less widely used in proteomics, said ISB researcher Eric Deutsch, an author on the MCP paper and one of the developers of the TPP.
In large part, this is due to the fact that proteomics has traditionally dealt with smaller datasets than genomics, Deutsch told GenomeWeb. However, as proteomics techniques become capable of higher throughput and new instruments allow for generation of massive amounts of information, moving analysis to the cloud has become a more attractive option.
"Datasets are just getting very large," he said. "[For instance], for some of the Swath datasets, the files are tens of gigabytes, and so that becomes hard for desktop computers to handle."
One option, Deutsch said, is to move the processing to local computing clusters, but, he noted, many researchers don't have access to such facilities. And given the decreasing cost and improved ease of use of cloud computing services, the decision between establishing a new local computing cluster for proteomics research or moving to the cloud "has now really tipped in the favor of cloud computing," he said.
"A few years ago [cloud computing] was a complex option and not really that widely used," Deutsch said. "But driven by [uptake in] many other industries, cloud computing... is finally getting to a price point where it makes a lot of sense and is becoming easy and widely used."
Indeed, he said, as the ISB researchers prepared the MCP paper, they had to change the costs they quoted several times to account for falling prices from Amazon Web Services, the service the TPP uses. In the demonstration of the cloud-based TPP system they provided in the paper, Deutsch and his colleagues processed 1,100 mass spec runs through four different search engines in 9.5 hours and at a total cost of less than $100. As the authors noted, the TPP suite includes applications for mass spec data representation and visualization as well as peptide identification and validation, protein inference, quantification, spectral library building and searching, and biological inference.
"When you start comparing those kinds of costs with trying to install or maintain your own local computer cluster, it's very competitive depending on how those costs are managed locally," he said. "If you already have the system administrator and the hardware, then maybe it's not that competitive. But if you don't already have personnel able to do this, if you don't already have the back-up systems, then this becomes a very attractive scenario."
Another potential advantage, Deutsch added, is the elimination of wait times. As opposed to local clusters where researchers may have to wait for other researchers ahead of them to process their data, in a cloud environment "you are allocated the machines you want to use right away, and when you are done they are returned to the pool," he said. "So if you have four post-docs who are feverishly trying to analyze some data in advance of ASMS, they won't be competing against each other for the same resources."
The MCP authors noted as well that with the rise of proteogenomics, mass spec experiments will likely become much more computationally expensive, providing further impetus for a shift toward cloud computing. Because such experiments involve searching not just for generic forms of proteins but for specific variants or modified forms, they often require much more processing power.
"Each search can take many hours or even days," Deutsch said. "And then suddenly cloud computing becomes much more attractive."
In fact, another recent example of proteomics' move into cloud computing – Illumina and AB Sciex's OneOmics collaboration to add Swath mass spec data processing to Illumina's BaseSpace cloud environment – is explicitly designed to tackle proteogenomic applications.
Under the partnership, AB Sciex will place its Swath Proteomics Cloud Tool Kit into BaseSpace. Outside researchers have also created apps for the environment, such as one developed by Yale University researcher Christopher Colangelo for integrating RNA-seq and Swath data to enable, among other things, generation of sample-specific proteomic search databases from RNA-seq data.
Deutsch and his ISB colleagues have added their SwathAtlas tool, which helps with planning Swath experiments and depositing and searching Swath datasets.
In addition to the TPP and OneOmics developments, a team led by University of Washington researcher Michael MacCoss last year launched the Chorus cloud application which allows researchers to store, analyze, and share mass spec data of any file type.
Scientists including MacCoss, theUniversity of Pittsburgh's Nathan Yates, and InfoClinika President and CEO Andrey Bondarenko founded the non-profit Stratus Biosciences to manage the Chorus project, which, like the TPP, uses Amazon Web Services.
According to a poster presented at the American Society of Mass Spectrometry annual meeting in June, since the system was introduced at the previous year's ASMS meeting, more that 550 user accounts from more than 150 labs had been created and more than 7 terabytes of data had been added.
Chorus is currently optimized for proteomics data, but, according to the developers, the ultimate goal is that the system will serve as a complete and open access resource for all the world's mass spec data.