Whether you’re a scientist who only occasionally needs access to high-performance compute resources or you manage a megacluster and see demand spiking beyond your capacity, cloud computing could be the answer for you.
First, let’s dispense with the fanciful notions of Beowulf clusters aloft in the stratosphere and the rest of the high-flying hype surrounding this increasingly popular computing model. A compute “cloud” is essentially a large network of servers running in parallel and accessible to users via an Internet connection. Data, applications, and computation all reside in the so-called compute cloud. This is the type of cutting-edge network architecture that makes Google work properly, including all of its Web-based applications, such as Google Documents. Users need only be concerned with the particular application they are using, not the software behind it or where the data is being stored. The server cloud itself is managed by software that handles provisioning and scheduling, and monitors client usage.
Despite its ephemeral imagery, expanding the compute cloud horizon has also become a very real trend among companies like Google, Yahoo, IBM, and Amazon, among others. Some of these companies are taking a utility computing approach by offering pay-as-you-go services, such as Amazon’s Elastic Compute Cloud (EC2). As with many compute clouds, EC2 makes use of virtualization software in order to maximize server space. In much the same way that virtualization software allows desktop users to launch multiple operating systems within the native operating system on their desktop, such as running Windows or Linux inside Mac OS X, virtualization enables compute cloud hosts to create multiple virtual machines on a single server.
EC2 users can log on to the cloud and create as many virtual Linux compute nodes as they need, complete with the desired processor speed and memory specs. According to the EC2 website, each node has the equivalent CPU power of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. Various pricing configurations range from $0.10 per hour with 1.7 GB of memory and one virtual computer running on a 32-bit platform with 160 GB of storage, to $0.80 per hour with 15 GB of memory, eight virtual machines on a 64-bit platform, and almost 1.5 terabytes of storage capacity. When you have finished, the cloud simply tallies your charges for the duration and type of usage. Think of it as renting your very own Beowulf cluster that always has your data and your programs, but without the mess of having to maintain it.
Cloud computing may serve as a viable alternative for bioinformatics researchers and biologists by addressing the age-old problem of the costs associated with setting up and managing supercomputers or clusters and the need for ever-expanding data storage. If all of the hardware and storage is no longer your problem, and using bioinformatics tools becomes as easy as setting up a Facebook account, why not? There is also the scenario of the researcher who may only occasionally need large computing resources, so an on-demand model becomes much more cost effective than buying a system.
Last fall, Google and IBM launched a university initiative to encourage Web-based application development and parallel programming for large-scale compute architecture such as clouds. The University of Washington is serving as the pilot program, and Carnegie Mellon University, Stanford University, and the Massachusetts Institute of Technology have also been named as participants. The companies have established a dedicated data center comprised of some IBM servers that will contain more than 1,600 processors.
The program aims to introduce computer science students to the Java software framework Hapdoop, an open source version of Google’s own computing structure that includes MapReduce. MapReduce is a software framework that supports parallel computing across large clusters of computers, and the Google File System.
“These cloud systems are very information-intensive, and they are able to process petabytes’ worth of data — be it Gmail messages or gene sequences,” says Dennis Quan, chief technology officer of on-demand computing at IBM. “I think you’re going to see use of cloud approaches like MapReduce to solve some of these [large-scale] problems moving forward.” IBM is slated to kick off its own cloud offering later this year, which will be geared toward providing clients with the tools to set up a compute cloud within their own data centers.
Randal Bryant, dean of the School of Computer Science at Carnegie Mellon University, says that many biologists’ ever-growing need for storage capacity can be addressed by the use of Web applications and compute cloud technology. “I think that Google and the other search engine companies have devised ways of constructing, organizing, and operating large-scale systems that could revolutionize many scientific disciplines, including the life sciences,” according to Bryant. “I’m glad to see these companies making facilities available to university researchers and students.”
Carnegie Mellon is also participating in the M45 initiative program sponsored by Yahoo, another academic project aimed at inculcating young programmers with cloud computing know-how. The M45 is actually a 4,000-processor cluster with 1.5 petabytes of storage, 3 terabytes of memory, and a maximum performance of 27 trillion calculations per second.
Dennis Gannon, a computer science professor at Indiana University, and graduate student Jong Youl Choi recently demonstrated a virtual lab application specifically designed for use on compute clouds such as EC2. Their Virtual Collaborative Lab (V-Lab Protein) is a Web-based virtual lab with a graphical user interface for protein sequencing analysis which allows users to create compute nodes on the fly using multiple databases and analysis tools.
“The idea is that if you have a group of people that want to collaborate on something but they’re not on the same campus or even in the same country,” says Gannon. “What users can do with this cloud model is very easily throw together a collaboration where they have secure, shared access to data and enormous computing power.” The V-Lab system is still in development, but the group aims to have an official release later in the year.
In December, Michael Cariaso, a biotech consultant with BioTeam, released RunBlast, a version of Blast specifically tailored for EC2. According to Cariaso, RunBlast gives researchers a much cheaper alternative to dropping $50,000 to get a cluster up and running. Cariaso has also demonstrated that it is possible to run mpiBlast and ClustalW on the Watson and Venter genome sequences, all for less than $50 on EC2.
Gannon says that many of his bioinformatics colleagues have a real need for high-throughput genomics, at low cost, for massive-scale molecular modeling or pathway analysis. “They are very excited about doing this on a very large scale. Once you configure one of these virtual machines to do some of this analysis, you can upload a hundred images all working in parallel on different parts of a data set. You have quite a bit of computing power without having to buy a cluster, so that’s an enormously enabling thing for a lot of people who do not have the access or funding,” he says.
One of the biggest concerns about cloud computing is that of privacy and security. Critics argue that many users will feel uneasy about having their precious data live offsite. But Gannon says there is very little difference in a compute cloud’s security compared to that of a research institute or university. “The security issue is definitely one that is significant, but I find it amusing that people would think their campus cluster is more secure than Amazon’s system or any other large-scale cloud computing system,” he says. “You can put lots of different levels of security on this stuff, including encrypted data or a type of protection you’d get on a campus machine.”