A recent study from a group at the University of Maryland's Institute for Genome Sciences suggests that researchers can eliminate bioinformatics bottlenecks and lower data analysis costs by using a combination of cloud computing and virtual machines.
In the study, published in a recent issue of PLoS One, the researchers compared the costs, infrastructure, and time it took to run several microbial genome analysis applications on a desktop computer and Amazon's Elastic Compute Cloud.
For the analysis, they used the Cloud Virtual Resource, or CloVR, a virtualization tool they developed that includes pre-configured sequence analysis tools that have been bundled into automated pipelines for analyzing microbial genomes.
A separate paper describing the CloVR architecture was published in BMC Bioinformatics earlier this year.
CloVR includes pipelines for 16S rRNA-based analysis; taxonomic and functional analysis of metagenomic whole-genome shotgun sequence data; bacterial single-genome sequence assembly and annotation; and large-scale Blast searches of sequence data.
It runs on single and multi-core computers as well as on the Amazon EC2 platform.
With CloVR, "we realized that ... we have a standardized version of sequence analysis" that, in combination with "a commercial cloud platform like Amazon that charges an amount per computational hour," can be used to calculate the run times and costs are associated with microbial analysis, Florian Fricke, an assistant professor of microbiology and immunology at UMD and one of the study's authors, explained to BioInform.
In the PLoS One study, the team calculated the costs of running CloVR pipelines on datasets from Roche/454 and Illumina sequencers on both a desktop computer — a 64-bit quad core with 4 gigabytes of RAM — and various configurations of Amazon's cloud.
For example, they ran the CloVR-Microbe pipeline on a 500,000-read 454 dataset on various cluster sizes on the EC2 in order to determine the configuration that would provide the lowest runtimes and costs. They found the optimal cluster size was between 72 CPUs and 120 CPUs, with the former taking 23 hours and costing $58 and the latter requiring 20 hours and costing $60.
"These numbers represent a runtime and cost improvement of up to 36 hours and $16 compared to a 56 hour-run with 16 CPUs for $74," the authors wrote. Furthermore, increasing the cluster size to 172 CPUs "did not result in runtime improvements but resulted in increased cost ($82) due to payment for under-utilized instances."
Running the same data on a single-CPU machine "was canceled after 14 days and was extrapolated to require in excess of 24 days runtime," the authors wrote.
They also determined the amount of sequence analysis they could perform on the cloud for the price of purchasing and maintaining a 240-CPU cluster, which they estimated to be $130,000 per year for three years.
They found that for the cost of a local cluster, they could process 43,333 runs of CloVR-16S; 5,416 runs of CloVR-Metagenomics; and 2,166 runs of CloVR-Microbe each year on Amazon EC2.
For single whole-genome microbial sequencing projects, they calculated that "up to three [454] sequencing machines can be supported using Amazon EC2 at current prices, using CloVR-Microbe benchmark protocols, before the estimated cost of a local cluster is reached."
They note, however, that any comparison between a local cluster setup and a cloud-based model would need to account for average utilization rates for an in-house system. "In cases where a local resource achieves a very high utilization rate, the benefits and cost savings of an on-demand model may disappear," they wrote.
Furthermore, "as multi-core CPUs are increasingly becoming accessible on the desktop computer market, the ability to process larger data on local desktops is also likely to increase in the future."
These calculations provide real dollar costs that could help guide grant application budgets for sequence analysis costs, which have been difficult to estimate in the past, Fricke said.
A Virtual Improvement
CloVR is implemented as a virtual machine — an operating system with pre-configured software in a single executable file that can be distributed and run elsewhere.
According to its developers, this approach meets the computational challenges of analyzing next-generation sequence data and also address the difficulties associated with installing, operating, and maintaining current open source computational tools.
Current attempts to address these challenges include systems like Galaxy and Taverna, which provide user interfaces that simplify the execution of tools and pipelines; and the IGS Annotation engine, which offers centralized web-based services specifically for microbial genome analysis, the researchers note in the BMC Bioinformatics paper. Other efforts bundle tools into software packages that can be installed on a local computer.
But even with these systems, researchers still have to decide between multiple analysis tools and protocols, the authors wrote, adding that "the complexities of analysis pipelines and lack of transparent protocols can limit reproducibility of computed results."
Furthermore, while there is a lot of enthusiasm about cloud computing, it still requires "technical expertise" to use "bioinformatics tools and pipelines on such distributed systems" and to "achieve robust operation and intended performance gains," they wrote.
CloVR tackles these issues because its applications don’t require further configuration by users, thus eliminating "complex software installations and adaptations for portable execution," the researchers explained in the paper.
Furthermore, Samuel Angiouli, director of bioinformatics software engineering at IGS, told BioInform that while tools like Galaxy focus on providing a general set of resources for researchers to configure and build their own analysis tools, CloVR offers pipelines that are generally accepted by the microbial genomics community as standardized analysis protocols, which makes it easier for researchers new to the space, for example, to select the most useful tools for their analysis projects.
Currently, CloVR is still being beta tested but its developers plan to provide a "polished" release later this year. Its developers expect that it will be used by smaller microbial genomics labs although they have seen some interest from larger sequencing centers.
Meanwhile, the researchers are working on additional pipelines for CloVR including one for RNA-seq analysis of both microbial and eukaryotic microbial genomes. They are also planning to extend its comparative analysis pipeline.
Furthermore, they plan to enable the virtual machine to support open source cloud platforms including one being developed at the UMD School of Medicine called the Data Intensive Academic Grid and an initiative led by Indiana University's Pervasive Technology Institute called FutureGrid.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com