Like the beginning of a rainstorm as the first few drops go from lonely pitter-pats to a torrential downpour, cloud computing has been steadily picking up momentum and announcing itself loud and clear. While undoubtedly the biggest player is still Amazon Web Services — which includes both the pay-to-play Elastic Compute Cloud (EC2) and the Amazon Simple Storage Service (S3) — teams from Google, Microsoft, IBM, Sun Microsystems, and even NASA are all jockeying for position in this growing market.
The popularity of cloud computing in the life sciences community was on full display at the Bio-IT World conference back in April, where IT folks from pharmas such as Pfizer, Genentech, and Johnson & Johnson, as well as clinicians and academic researchers, were all cautiously optimistic about the potential of cloud computing to ramp up their research. Other signs of interest include the National Science Foundation's $5 million in grant funding allowing 14 universities to participate in the IBM/Google Cloud Computing University Initiative, a partnership that kicked off in 2007 to get up-and-coming computer science students familiar with cloud technology. In late April of this year, Amazon announced that it would start accepting grant applications for cloud computing projects and would provide researchers with free access to its services as a reward.
When it comes to preaching the gospel of cloud computing to the average bench biologist, proponents tout access to readily available, cheap, seemingly limitless amounts of compute power and storage without ever having to leave the lab or worry about IT considerations. The idea that a researcher can have high-grade computing power for only as long as they need it, then log out of their cloud computing account and no longer be responsible for the upkeep of a server or cluster room, the networking infrastructure, and power consumption is tantalizing — and helps level the playing field for those without large IT departments and huge grant budgets.
"Cloud computing democratizes HPC, so the smallest organization or even an individual can have a cluster that's on the same scale as a large pharma, if only for a few hours to run a calculation. …
It really makes it a lot more accessible to everybody," says Jason Stowe, CEO of Cycle Computing, a cloud computing startup. "The big thing that's really going to affect us here are these next-generation sequencers and all the various forms of simulation in genomics and proteomics, so it's helpful for individual researchers to not have to worry about the fact that their IT infrastructure may not be utilized all the time. … That really changes the way people consume computation."
Cycle Computing is focused on being the middleman between the average researcher and the big cloud providers like Amazon or Google. The company's offering, called Cycle Cloud, is a software solution that claims to make provisioning compute instances on Amazon's EC2 a seamless endeavor.
Varian, a scientific toolmaker, recently used Cycle Cloud to simulate the design of a mass spectrometer. Usually a simulation like this would take more than six weeks on the company's internal cluster, but the designers were able to complete the simulation in a day using compute instances on EC2. "The question is, you can rent 1,000 instances, but now what? Imagine I drop 1,000 servers into your backyard and they were magically all powered and networked, but you don't have the operating system setup, the authentication, you don't have any of the standard software stacks — all the stuff that's involved there is not trivial to get working," says Stowe. "So the problem is how to do this in a way that doesn't involve the scientists having to know anything about all the complicated stuff that's going on in the back."
Cloud computing is even bringing out an unusual spirit of collaboration among pharmaceutical companies. For example, Rick Franckowiak, director of the technology office at Johnson & Johnson Pharmaceutical Research & Development, has worked closely with his counterparts at Eli Lilly and Merck to explore cloud issues. "We're looking to actively partner with a lot of other pharmas around how to do [cloud computing] as an industry ... Right now, part of the challenge is just around education, and the fact that there's so much being written and so much being discussed on the cloud. The first thing we have to do is educate people on what cloud computing is and what it is not, because there are a lot of perceptions, and not all of them are accurate," says Franckowiak. "I think the industry as a whole profits by collectively bringing these capabilities to the masses, and we wouldn't want to do it by ourselves."
For pharmas, cloud computing can solve the problem of overtaxed internal resources. "Currently, we have a pretty decent-size grid infrastructure here, but the one problem we run into is when we have a certain job coming up on submission time, we get a peak demand and the grid gets flooded, and we don't have enough compute capacity," he says. "So we thought it be great to have a spillover environment that can handle these peak capacities ... and take advantage of a pay-per-use capacity," Franckowiak says.
Security and latency
But cloud computing isn't all puffy, floating happiness: the question of how secure this environment really is still raises some eyebrows.
Right now, Franckowiak and his colleagues are just testing the water with non-critical data. "We're not taking the big risks that could jeopardize IP, so we're working with our security folks to get an understanding of these environments," he says. "It's not like we could traditionally put a wall across our network and assume that everything inside was safe."
The very idea that one's private data does not live onsite is not only a concern for those working in IP-sensitive environments like pharma, but for academic institutions as well. "At Stanford, we would like to keep everything in-house where we have firewalls," says Baback Gharizadeh, a research associate at the Genome Technology Center. "We've had a lot of hacking before so … we prefer to manage our data ourselves."
The Stanford center gives high priority to tight IT security, which is why the team opted for an increasingly popular model of the technology called private cloud computing. In the private cloud, the site owns and maintains hardware upon which a cloud is hosted — all within the confines of its own firewall — but the users' experience is essentially the same as a "public" cloud like that of Google or Amazon.
The private cloud model is attractive not only because of security, but also because it may help address some of the networking and latency problems involved in getting large amounts of data onto a public storage cloud. Another cloud computing startup is ParaScale, which aims to address these issues with a software solution that lets a site build its own internal cloud storage. "I have hundreds of terabytes, if not petabytes of data, and there's no way I can push that over an Internet connection, so from our perspective, this is an architecture for a platform that can be leveraged in life sciences to simplify management and scale to data sets that are common," says Mike Maxey, a product manager at ParaScale. "With storage it's very rare that you spin up a 100 terabytes and then throw that data away, so it's a little bit of a different paradigm when it comes to storage versus CPU."
At first glance, it may seem that private clouds negate the cost-saving argument of cloud computing since they require users to maintain hardware, but this is often not the case. It is possible to get a cloud environment up and running by recycling old PCs and servers that are just collecting dust somewhere in an institution. "A lot of it boils down to what do you have in house," Maxey says. "We have a lot of bioinformatics customers that have older equipment coming out of their compute farm or different storage usages that they want to repurpose and load their software on and build a cloud around, so the hardware is always an open question."
Stanford's Genome Technology Center, which can create more than 15 terabytes of sequence data per day, is using ParaScale's cloud storage solution as a scalable and cheap alternative to hardware storage. Gharizadeh says that relying on servers or hard drives to keep next-generation sequencing data can be risky. "If your server fails or hard drive crashes, then you lose all your data. … The cost of sequencing, if you include everything, all the labor, it's about $15,000 [for] each run — so if you have a lot of runs, you could suddenly lose a lot of data," he says. Cloud computing lets you store "the data on different computers, so if one of them crashes, you can have it on another server."
Software coming along
Not surprisingly, the bioinformatics software developers are designing and porting analysis tools to take advantage of cloud computing. At the start of the year, cloud computing startup Cumulo announced a turnkey Blast solution that allows individual researchers to have their own Blast servers on the Amazon EC2 cloud. Cumulo Blast offers researchers an alternative to the often-congested Blast servers at the National Center for Biotechnology Information.
Meanwhile, developers at the Medical College of Wisconsin Proteomics Center recently announced the release of the Virtual Proteomics Data Analysis Cluster, a proteomics data analysis tool that runs on Amazon's EC2 and S3 services. Users simply log into their EC2 accounts to start up their own personal copies of ViPDAC server, select a set of preconfigured analysis parameters or create their own, choose a data file type, and submit their job. Results can be stored on S3 or downloaded to the user's desktop. "We have a proteomics center and we have a cluster and that's great … but one of the challenges we've had is if you want to expand that and do something different with it, it's expensive to buy new nodes [and then] licensing fees for the software often become an issue when you really want to put a big cluster together," says Simon Twigger, an assistant professor of physiology at Wisconsin.
The ViPDAC instances on EC2 are publicly available, further emphasizing the democratization idea of HPC and robust analysis tool access to all, says Twigger. "Anyone with a credit card can fire it up and start doing proteomics analysis in 10 minutes' time — high school students, anybody, and that's the cool element of this almost as a software distribution mechanism," he says. "The fact that we can take such a complex set of software packages and put it out there for anybody on the planet to just push the button and make it work, that's not something that we really had the ability to do. … That availability and capacity of computing and complexity of computing when delivered that way is cool as a way to think about the democratization of computing resources."