By Uduak Grace Thomas
Researchers at Stanford University have proposed a cloud-based model for bioinformatics resources that they claim could help improve reproducibility of experiments that rely on computational analysis.
In a commentary published last week in Nature Biotechnology, Stanford's Joel Dudley and Atul Butte recommend storing on the cloud snapshots of entire computational environments that are used to generate published research results. This approach, they argue, would make it easier for the scientific community to replicate and validate research findings from computational studies.
Butte and Dudley refer to the approach as the Whole System Snapshot Exchange, or WSSE. In this approach, researchers would create digital images of the complete computer system or systems used to produce experimental results, including the operating system, application software, and databases.
These images can then be uploaded into the cloud where other groups can have access to them. As a result, "researchers would be able to obtain precise replicas of a computational system used to produce the published results and have the ability to restore this system to the precise state … when the experimental results were generated," the authors write.
Reproducibility of computational experiments has been a longstanding issue in the bioinformatics community. While "it was once thought that computers would improve reproducibility because they yield repeatable results given the same set of inputs, most software tools do not provide mechanisms to package a computational analysis such that it can be easily shared and reproduced," Dudley and Butte note.
This issue took center stage during the last year after two biostatisticians at MD Anderson Cancer Center were unable to replicate the results a 2006 Duke University study led by Anil Potti. The findings of the Duke study, which claimed to identify gene expression signatures correlated with response to different cancer therapies, were used to determine treatment for several clinical trials. Duke halted the trials and then resumed them after a review, while a colleague of Potti's has since called for the 2006 paper to be retracted.
One attempt to address the reproducible research problem has focused on developing software that let users create and share standardized research pipelines and workflows, such as the open source Taverna project and Microsoft's Trident platform. The Broad Institute has even suggested a modification for its GenePattern software that would embed access to computational workflows directly into online papers (BI 1/22/2010).
However, Butte and Dudley believe that software tools haven't solved the problem because "human nature" and "the realities of data-driven science" have created a research environment in which "efforts are not rewarded by the current academic research and funding environment; commercial software vendors tend to protect their markets through proprietary formats and interfaces; investigators tend to want to own and control their research tools; the most generalized software will not be able to meet the needs of every researcher; and the need to derive and publish results as quickly as possible precludes the often slower standards-based development path."
In light of these challenges, Butte told BioInform that the WSSE model would provide a more flexible approach because researchers wouldn't have to standardize on a specific program. Instead, they would simply download the image files using programs like Eucalyptus and run the analysis locally or in the cloud.
In addition, since datasets and services such as the National Center for Biotechnology Information's Entrez Utilities can be stored and shared in the cloud, even if the underlying infrastructure of these services is changed, users would still have access to the previous versions.
This method also addresses another barrier to reproducibility, according to the authors. Because virtual images of databases and other bioinformatics resources can be stored indefinitely on the cloud, they would still be available for users even if funding sources dry up, as was the case with the Arabidopsis Information Resource, which lost its funding source after the National Science Foundation withdrew its support last year (BI 12/04/2009).
Based on the current cost of $0.10 per gigabyte of storage per month in Amazon's cloud, the authors estimate that one terabyte of data could be stored for less than $100 a month and they anticipate that the cost per gigabyte will continue to decrease. Over time, they estimate that it could cost less than $10,000 to store a one-terabyte database in the cloud for more than a decade.
An added benefit, they note, is that it costs less to perform data analysis within the cloud as opposed to downloading the image files, which may be several gigabytes or terabytes in size, and performing the analysis on a local compute infrastructure.
A study published by Butte and Dudley this summer found that although analyzing a large genomic dataset in Amazon’s cloud cost about three times more than running the same dataset on a local compute cluster and took about 12 hours longer, in the long run, cloud computing is a cheaper and a more sustainable method because of the additional expenses associated with local clusters such as in-house hardware, software, and personnel, which makes them too costly over time(BI 08/27/2010).
Butte noted that as cloud computing vendors such as Google and IBM begin competing more aggressively with Amazon, researchers will have multiple cloud options with varying prices, features, and services. This scenario would prevent vendor lock-in, and further drive the price of cloud computing and storage space down.
Rather than developing a publicly funded cloud computing infrastructure, the authors suggest that funds and development efforts should focus on developing software that works on existing commercial cloud infrastructures.
"I think the industry is incredibly efficient at creating these kinds of solutions," Butte said. "A public resource [that would be] competing against these [cloud computing] companies wouldn’t make much sense."
Dudley agreed, pointing out that a publicly funded and managed cloud resource might also face restrictions in the kinds of software that it could provide because it would have obtain licenses from commercial groups that provide these resources and could end up limiting the software options available to researchers.
Although the authors describe WSSE as a "pragmatic and substantial" first step toward making research reproducible, it is not without its challenges.
Software licenses, for example, could impose constraints on sharing software in the cloud. Dudley noted that because most commercial software is licensed on a per-machine basis, "if I buy a commercial software package that I have paid a license and it’s what I use to derive my results, it’s difficult for me to hand over a machine image with that software because [another] person doesn’t have a license for that software,” he told BioInform.
The authors also note that WSSE does not address issues with data organization, systematic provenance tracking, standardization, and annotation.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com