Genomes, Clouds, and No Headaches

By Matthew Dublin

Probably the best sound bite from day two of the Bio-It World Expo in Boston was provided by Nicholas Socci, assistant director of the Bioinformatic Core at Memorial Sloan Kettering Cancer Center: “Either the computers are ready for me to use, in the way that I want to use them, or they’re not ready-and those are the only real pros and cons.”

The other pros and cons that Socci was dismissing are the often-cited default debating points about what cloud computing brings to the table for researchers (scalability, no hardware ownership costs, etc) and what its lacking (security concerns, application porting issues, etc). But for Socci, cloud computing is only worth using if it requires absolutely nothing from him or his IT staff. "If I have to worry about getting data up into all kinds of clouds I will never get anything done,” said Socci. “Up until this point, I have completely resisted using the cloud because if the cloud doesn’t allow me to run what I’m already running, then it’s no use to me.” The turning point for Socci was a solution whereby Life Technologies’ LifeScope Genomics Analysis Software is hosted onPenguin Computing’s POD (Penguin On Demand) cloud computing service.

Socci proceeded to press upon the audience that what next-generation sequencing analysis really needs is cloud computing. But not just in the sense that cloud computing could provide relief from the massive amount of data being generated by NGS platforms as an elastic storage option, but also, that investigators and IT staff now have a better way to manage an increasingly diverse number of data or classes of data from new applications. Essentially for Socci, the cloud is actually about harnessing people power because as he exclaimed during his talk: “We have too many things to do and too many new things to do with all this next-generation sequencing data!” He pointed out that NGS is creating an environment wherein collaboration is the name of the game as folks with a range of various expertise are increasingly called upon to deal with and analyze the data.

Following up on the idea that if the cloud means having to think very hard about getting things to work properly, then forget it, Angel Pizaroo, director ITMAT bioinformatics facility at the University of Pennsylvania School of Medicine, said the three pillars of cloud computing for life science research are: automatic provisioning of compute instances, automatic configuration of those instances with your applications of choice, and automatic execution (i.e. it should just work when you need it to, no excuses). Without the possibility of a seamless automated workflow that can be initialized at a moment's notice, then the cloud is pointless. This is why Pizaroo is a big champion of the CHEF project, an open source systems integration framework built to bring the benefits of configuration management to your entire infrastructure. The basic idea behind CHEF, which Pizaroo admitted one needs some training to use, is that users can write source code to basically construct an automated infrastructure on any server (or any cloud computing infrastructure) that they like.

A measured talk presented by Victor Jongeneel, a senior research scientist at both the Institute for Genomic Biology (IGB) and the National Center for Supercomputing Applications (NCSA), further explored the role of cloud computing and genomics by examining whether or not the cloud is a good environment for genome assembly. Jongeneel reported performance benchmarks of genome assembly algorithms Velvet, ABySS, and Contrail on Amazon EC2 instances and a large memory local compute clusters assemblying E. coli and S. pombe genomes. While the cloud performs as well as some of the NCSA’s local compute clusters on smaller genome assemblies, Jongeneel pointed out that there are no currently available software implementations for highly parallel genome assembly for large genomes that make using the cloud worthwhile. In the furture, he and his colleagues are planning on developing genome assembly methods that can utilize lots of cores to do high-throughput assembly.