By Uduak Grace Thomas
A recent study by bioinformaticians at the Stanford University School of Medicine found that analyzing a large genomic dataset in Amazon’s cloud cost about three times more than running the same dataset on a local compute cluster and took about 12 hours longer • but the team still determined that in the long run, cloud computing is a cheaper and a more sustainable method for clinical researchers who need to analyze large datasets sporadically.
The study factored in the capital costs of in-house hardware, software, and personnel as part of the comparison with the cloud-based system, but the researchers determined that additional expenses associated with local clusters make them too costly over time for most clinical researchers.
"What we wanted to do was a simple … comparison of the cost/benefit analysis of running a computation on the cloud versus running it on our own quarter-of-a-million dollar [Hewlett Packard] cluster," Atul Butte, an assistant professor of medicine at Stanford and one of the authors of the study, told BioInform.
The study, which, according to Butte, is the first of its kind, was published in Genome Medicine earlier this month. It showed that while cloud computing may be significantly more expensive at first blush, "it compares favorably in both performance and cost in comparison to a local computational cluster." Furthermore, it might just be cheaper in the long run because researchers don’t have to pay for capital costs such as electricity, cooling systems, and maintenance • not to mention the "non-monetary opportunity costs" associated with the months it can take to purchase and install a local cluster.
It may seem a bit counterintuitive to argue that paying on demand is better than investing in an in-house system • after all, most people chose to purchase cell phone plans rather than pay per call • and Butte conceded that a local cluster may be a better option for researchers and institutions who can afford it and will use it regularly.
"If you are going to run it and if you are going to use it, you are going to benefit from it … you can save money in the long run by having it," he said.
But, according to Butte, even for researchers with the financial resources to purchase and support a local cluster, that option may turn out to be a "short term win." As an example, he noted that his group's 240-core cluster is already a year-and-a-half old. "So, am I going to spend another quarter-of-a-million [dollars] to upgrade it or will I benefit from the cloud, which those service providers will continue to upgrade?"
Furthermore, the authors note that the informatics needs of clinician scientists occupy a unique niche that may be better served by cloud computing. While traditional bioinformatics groups generally have access to large computational resources, translational bioinformatics • a discipline that aims to integrate molecular and clinical data to gain biomedical knowledge • does not have the same computational underpinnings.
"Bioinformaticians can whip up anything in the cloud," Joel Dudley, a bioinformatics programmer at Stanford and first author of the paper, told BioInform. Clinician scientists, on the other hand, "need to request the right amount of computing power they need and they need to easily access these tools that help them utilize that power towards some clinical hypothesis."
The authors recommend in the paper that "rather than present the clinical investigator with a collection of bioinformatics tools (i.e. the “toolbox” approach), we believe clinician-oriented, cloud-based translational bioinformatics systems are key to facilitating data-driven translational research using cloud computing."
Given the fact that there has been a lot of buzz around cloud computing in the bioinformatics community lately, it seems unusual that this kind of cost/benefit analysis hasn’t been done before. Butte suggested that one reason for this is that the majority of cloud users in bioinformatics work at pharmaceutical companies and are therefore the "least likely to write papers about what they are doing."
Even though they may not be writing papers on the subject, pharma companies are at least talking about their cloud computing adventures. At last year's Bio-IT World Conference, companies such as Pfizer, Eli Lilly, and Johnson & Johnson shared some of their early experiences with the approach (BI 05/22/2009).
And while the Stanford team's study may be the first cost/benefit analysis of cloud computing for life science research, its findings echo those that a University of California, Berkeley, computer science group published last year that determined that even though Amazon’s pay-as-you-go pricing "could be more expensive than buying and depreciating a comparable server over the same period, we argue that the cost is outweighed by the extremely important cloud computing economic benefits of elasticity and transference of risk, especially the risks of overprovisioning (underutilization) and underprovisioning (saturation)."
Intensive Computations
Dudley said that the Stanford team was interested in evaluating cloud computing for use by scientists who want to analyze large amounts of genomic data from clinical studies but who wouldn’t normally have access to the "raw computing power" of a cluster or have the "computational savvy" to make sense of the data.
[ pagebreak ]
For the study, the team decided to compute expression quantitative trait loci, or eQTLs, which are genomic loci that are associated with gene transcript abundance. Butte noted that eQTL analysis • in which the genotype of each measured SNP is compared to the expression level of each measured gene expression probe • is one of the "most intensive computations" in translational bioinformatics.
The team used a caBIG dataset of genetic variation and gene expression on a group of 311 cancer cell lines, which Dudley said would be analogous to a clinician conducting a clinical trial with several hundred cancer patients.
In the paper, the authors write that the genotyping platform they used measured more than 500,000 SNPs and the gene expression microarray measured gene expression levels across over 50,000 probes. Crunching these two datasets for the 311 cell lines required more than 13 billion comparisons to be performed. On a single CPU, running all those comparisons would likely take nearly 14 years.
For the study, the team compared 198 CPUs of its in-house HP compute cluster and 198 CPUs on a virtual machine in the Amazon cloud running the same code on both systems.
Both systems took about six days to complete the eQTL analysis, with the local cluster finishing 12 hours earlier than the virtual cluster • a difference that Dudley said he found "surprising" because he had expected the cloud to run even slower since it would incur some "performance hits" due to the virtualization software that acts as a "middleman" between the researcher and the cloud.
Since the study was designed to mimic what a researcher with limited computational knowledge and resources would do, Butte and Dudley said that they didn't "optimize" Amazon as much as they could have to obtain more performance benefits.
Butte explained that even a "naïve translation of the code" would have yielded some improvements in how the cloud handles the data.
Comparing Apples to Apples
Butte noted that most of the drawbacks related to an in-house cluster are incurred before the system is even installed because the process of buying and installing a local compute cluster is long and expensive. For starters, a prospective buyer would have to put together the specifications, obtain requests for proposals from vendors, and then compare the proposals to determine the best bang for the buck before placing the order.
In Butte's lab, once the order was made, it took almost three months to get the cluster deployed and installed. "It's non trivial to get a huge cluster like this," he said. "Having seen these results … I don’t know how to defend a huge cluster."
To calculate the cost of the local cluster per hour, the team incorporated hardware and software costs, as well as operational expenses such as server hosting and personnel, over a three-year period.
Based on their calculations, performing their analysis on the Amazon cloud cost $0.19 per CPU, or around $5,400 total, while the cost on the in-house system was $0.06 per CPU, or $1,700 total.
While the paper includes a comprehensive cost breakdown for installing and maintaining a local cluster, the researchers didn’t do a similar breakdown for cloud computing. According to Butte, that was because the team was trying to come up with a way to make sure that they were comparing "apples to apples."
"What does a CPU-hour cost on a local cluster when you’ve bought the machine? That's where we folded in the hosting cost, the personnel to manage the cluster," he said. "On Amazon, you are … just paying per hour … for the actual CPU time you use and you pay for the cost to move the data to the virtual cluster."
Aside from cost considerations, the researchers note in the paper that the cloud-based system offers many benefits over a local cluster, such as its “elastic” nature, which allows it to scale the number of server instances based on need.
"If we needed this [eQTL] analysis done in a day, we can pay more per hour for CPU for a rush job whereas when we [have] already purchased a local cluster, we cannot scale past that," Butte said. "In that way, the hardware, when it's virtual, can change at will depending on exactly the needs."
Dudley noted that an added advantage of the cloud is that researchers have access to it "on demand as needed" without the kind of limitations that can occur on a local cluster if, for example, it is shared by multiple groups.
The researchers also cited the cloud's ability to archive entire systems for subsequent reuse as another benefit, as well as the system's use of “spot instances,” which allow instances to be launched during off-peak activity periods at a lower cost. "Although this feature may have increased the total execution time of our analysis, it might also reduce the cost of the cloud-based analysis by half depending on market conditions," the authors wrote.
While computing on the cloud costs more per CPU hour and takes longer, according to Butte, "it's in the ball park" of the costs of doing the same analysis on a local compute cluster. He pointed out that as more vendors begin to offer cloud computing services, the cost will likely drop, reducing the seemingly large price gap.
Amazon currently dominates the life sciences cloud computing arena with companies like IBM, HP, Google, and Microsoft starting to make inroads into the space.
Next Step: Make Tools Cloud Enabled
Butte said that even though some bioinformatics tools for sequence analysis and assembly are cloud enabled, there is still a lot of work to be done • especially since tools such as R and several commonly used bioinformatics tools aren’t “that easy” to run on the cloud.
Butte noted, for example, that a number of “well known and well loved” cluster analysis tools could run much faster on the cloud.
Another example Butte mentioned is the Broad Institute’s GenePattern, a platform that gives users access to over 125 tools for gene expression analysis, proteomics, SNP analysis, and data processing tasks, which he said to the best of his knowledge hasn't been cloud-enabled yet.
There has been some progress in porting bioinformatics tools to the cloud, however. For example, the developers of the open-source Galaxy genome analysis system have enabled the platform to be instantiated on the Amazon and Eucalyptus cloud systems.
“I know people are working on this but I would just put a call in for more such folks to get their tools cloud enabled or at least to take better advantage of the cloud, specifically bioinformatics folks,” Butte said.
Dudley added that one way to get tools onto the cloud would be to have a "generic cloud-enabled bioinformatics library" that would serve as a platform that developers could build on rather than retooling each of the many "one-off" bioinformatics tools to work in the cloud environment.