The open source bioinformatics community is increasingly leveraging cloud computing resources for managing and analyzing genomic data.
This week, the Galaxy development team at Pennsylvania State University and Emory University launched a version of the web-based open-source sequence analysis platform that runs on Amazon Web Services and other cloud platforms.
The release of the cloud-enabled version of Galaxy followed last week's Genome Informatics conference at Cold Spring Harbor Laboratory, where a number of presentations highlighted bioinformatics tools that are now being offered on cloud computing resources, as well as efforts to build open source alternatives to commercial clouds.
In an article published in Nature Biotechnology this week, Galaxy's developers said the cloud-based version of tool "allows anyone to run a private Galaxy installation on the cloud exactly replicating the functionality of the main site but without the need to share computing resources with other users."
It also includes tools like Blast and de novo assembly software that are too computationally demanding to run on the web-based version of the platform, James Taylor, an assistant professor in Emory University's biology department and one of Galaxy's developers, told BioInform.
Moreover, users can "customize their deployment" and "retain complete control over their instance and associated data," the developers said.
"Rather than run Galaxy on one's own computer or use Penn State's servers to access Galaxy, now a researcher can harness the power of the cloud, which allows almost unlimited computing power," Anton Nekrutenko, an associate professor of biochemistry and molecular biology at PSU and one of tool's developers, said in a statement.
The cloud version of Galaxy offers an alternative to commercial cloud-based sequence analysis services, which often "contain limited sets of analysis tools," the authors wrote. Furthermore, "because they are proprietary solutions, users must give up some control over their own data and risk becoming dependent on a single commercial service" for their analysis needs, the developers said.
The cloud version of Galaxy relies on a tool called CloudMan that the Galaxy team developed to "automate management of the underlying infrastructure cloud resources."
CloudMan handles resource acquisition, configuration, and data persistence and also allocates storage for user data among other tasks. It also includes an "autoscaling feature" that adapts as workflow demands change to provide users with the shortest total run time and lowest cost for their analysis.
Currently, Galaxy Cloud is deployed on AWS but users can move it to other clouds, such as the Eucalyptus platform, if they want.
Taylor told BioInform that it would be easier to deploy the tool on clouds that "use an infrastructure model like the Rackspace cloud" as opposed to platform-as-a-service clouds like Google's.
Bioinformatics in the Cloud
The Galaxy release joins a number of other efforts within the open source bioinformatics community to harness cloud computing resources.
At last week's Genome Informatics conference, Scott Cain, the bioinformatics software manager in the Ontario Institute for Cancer Research, discussed efforts by some participants in the Generic Model Organism Database project — a collection of open source tools for creating and managing genome-scale biological databases — to build cloud versions of their applications.
Cain, who is the GMOD project coordinator, discussed a cloud version of GBrowse2, the community's genome viewer.
GBrowse2 includes a new feature that let users parallelize genome tracks, as well as a set of standard datasets from several common organisms that are stored in the cloud, he said.
In addition, one of the posters at the meeting described an effort led by researchers at the University of Maryland to develop a freely available cloud resource for bioinformatics analysis.
The system, called the Data Intensive Academic Grid, or DIAG, is built on the Nimbus open source framework but it can also be accessed using the Amazon EC2 application programming interface as well as the Web Services Resource Framework-based cloud-client API.
According to the poster, DIAG includes 1,500 cores for high-throughput analysis and 160 cores connected via a low latency InfiniBand network, as well as a storage network with more than 400 TB of shared parallel storage and 400 TB of local storage. Users also have access to genomic datasets that were generated by mining public sequence repositories.
DIAG's developers said the bioinformatics community can access the resource as a platform-as-a-service using Ergatis, a web based analysis pipeline creation and management tool that includes support for several commonly-used software, or via an infrastructure-as-a-service model using virtual machines such as UMD's Cloud Virtual Resource, or CloVR.
In addition to CloVR, DIAG supports bioinformatics tools such as the IGS annotation engine from UMD, which provides tools for prokaryotic annotation, and the Viral Informatics Resource for Metagenome Exploration, or VIROME, a web application for exploring viral metagenome sequence data.
Another poster at the conference highlighted the use of cloud infrastructure built by the Plant Science Cyberinfrastructure Collaborative, or iPlant, program to help its members utilize cloud computing services.
The tool, dubbed Atmosphere, is built on the Eucalyptus cloud platform and lets users launch their own private virtual working environments and associated software.
It gives researchers access to a catalogue of plant data analysis tools that are bundled together into preconfigured virtual machine images that are launched from icons in its portal as well as data storage using an application called iRODs.
Christos Noutsos, one of the collaborators on the project, told BioInform that the group launched the cloud infrastructure late last year.
Meanwhile another poster described efforts by researchers from UMD's Institute for Genomic Sciences, the Argonne National Laboratory, and University of Colorado to develop a resource called the Open Science Data Framework, or OSDF, which will provide microbiome datasets and analysis tools.
According to the conference abstract, the system is comprised of a database, a data exchange format, and an API that will enable community members to submit and retrieve data. Initially, the developers plan to offer approximately 700 human microbiome samples, as well as samples from model organisms and the environment. Users also have access to tools to query, analyze, and display the data.
OSDF relies on a combination of cloud computing infrastructure and distributed data storage.
Another cloud computing-based tool described in a poster at the meeting was DawgPack, developed by a team from the University of Georgia. The tool, which runs on an Amazon EC2 Cluster Compute machine, uses a variant of the bi-directional Burrows Wheeler transform to align next-generation sequences to a reference genome using multiple nodes and then aggregates the results.
Another tool, Jnomics, developed by researchers at CSHL and Stony Brook University, is a cloud-scale sequence analysis suite based on Google's MapReduce framework that allows users to deploy and execute genome analysis pipelines that are distributed across multiple resources.
A poster on Jnomics at the meeting described a case study in which the suite was used for stepwise paired read alignment and to detect structural variations related to cancer in several genomes.
A Word of Caution
The obvious attraction of cloud computing is that it offers researchers hardware on an as-needed basis, which is often a much cheaper alternative to in-house compute clusters — especially for smaller research labs.
For example, in the Galaxy paper in Nature Biotechnology, the authors reported that they were able to analyze 45 gigabytes of sequence data in 15 hours for $25, "using nothing but a web browser."
Other studies, such as one published last year by bioinformaticians at Stanford University School of Medicine, have found cloud-based analysis to be a low-cost and sustainable computational option for labs that don't have their own clusters (BI 8/27/2010).
Others have noted that labs with large in-house computational infrastructures would not benefit from cloud-based analysis. For example, Toby Bloom, director of informatics for the Broad Institute's genome sequencing platform, said earlier this year that the cloud does not appear to be a cost-effective option for large sequencing centers that require almost constant compute power to manage and move files up to 1 terabyte in size (BI 4/15/2010).
The Galaxy authors acknowledge in their paper that cloud computing resources may not be cost effective "for all usage scenarios," noting that the workflow used in their example was already pre-developed and ready to be executed. That isn't the case for many analysis workflows, which have to be refined before they are ready for use, which can significantly increase the cost and time necessary for the analysis.
Additionally, there are limitations with services from current cloud providers, the Galaxy team said. For example, they noted that the largest memory instances that Amazon provides aren't sufficient to run some de novo assemblers.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.