During the Bio-IT World conference in Boston this week, researchers from the Broad Institute and the University of Illinois at Urbana-Champaign discussed two separate evaluations of the ability of cloud computing to serve as an alternative to in-house clusters for analyzing next-generation sequencing data.
Specifically, the teams explored the cloud's ability to address two vital areas in the genomics research space: whether the cloud infrastructure can support high-throughput data analysis pipelines designed for next-generation sequencing data; and whether genome assemblers are able to scale effectively for the cloud environment.
In these presentations and others throughout the conference, cloud proponents stated the oft-cited benefits of the approach — namely that it is a valuable option for small research centers that lack the resources to purchase and maintain in-house infrastructure, as well as for researchers who don’t require sustained compute power but need just enough to handle spikes in their data generation.
Most speakers agreed that for large centers like the Broad, which require almost constant compute power to manage and move files ranging from 1 gigabyte to 1 terabyte in size, the cloud, at least for the present, does not seem to be a cost-effective option.
Nevertheless, the institute decided to explore the approach. In an interview with BioInform conducted prior to the conference, Toby Bloom, director of informatics for the Broad's genome sequencing platform, noted that there are things that the cloud "does well and things it doesn’t do so well," and her evaluation was aimed at addressing both sides of this coin.
She added that the institute also wanted to look into the technology because it is well-suited for large collaborative projects, such as the 1000 Genomes Project, which involve many centers contributing and analyzing sequence data. Relying on a shared resource like the cloud for such efforts is likely to reduce the number of copies of data files — typically tens and sometimes hundreds of gigabytes in size — that are generated and could also reduce expenses associated with storing the data, she said.
In its evaluation, the Broad team performed primary analysis on exome data from the 1000 Genomes Project on Amazon Web Services, doing everything from raw reads through to alignments as well as some secondary analysis of the data.
Bloom said the evaluation showed that research groups must take several factors into account before making the move to the cloud.
For example, she noted that the cloud currently offers a "somewhat narrow range" of computational architectures. This can lead to problems because software that runs on local clusters may not run on the architectures available in the cloud unless some tweaks are made.
The difficulty with this, Bloom said, is that researchers end up running two different versions of the same software and when updates or changes are made to one, they have to be duplicated in the other version to maintain consistency, adding overhead costs.
Furthermore, running high-throughput analysis pipelines on large volumes of data in the cloud yields some challenges that don’t immediately rise to the fore with smaller datasets.
For example, Bloom noted that as data moves through the Broad's analysis pipeline, each step requires a different number of cores — a scenario that works well with an in-house cluster but runs into trouble in the cloud environment.
The institute's data is stored in large file systems that are accessible to an in-house compute farm, making it easy to access the data no matter where it is. "If you are running a pipeline and you have fifteen steps to do on this one piece of data and you need different amounts of memory and compute and disk space at each different step, you can assign step one to this node and it goes and gets the data it needs and runs, writes out the results, and step two can run somewhere else using the data from step one," she told BioInform. All this is done without having to physically move the data at each step.
Conversely, "on the cloud, I have to move the data onto the compute node before I start computing," she said. "That means that [if I use a] node that has only two cores in step one and in step two I need eight cores, I have to go move [the data] again."
Both options have downsides, she noted. There are performance hits associated with running the data locally when compared to the much faster cloud, but moving the data at each step of the analysis can be a slow and expensive process.
In her talk at Bio-IT World, Bloom mentioned some potential solutions for handling data in uneven pipelines. These included network file system servers and Gluster, a scale-out network attached storage solution.
Bloom said that it is difficult to do a cost comparison between cloud computing and a local infrastructure due to extenuating factors such as the cost of maintaining "extra headroom" on local clusters to ensure that enough compute power is available at all times, even though it often goes unused. Long-term data storage costs also make a head-to-head comparison difficult, she said.
Other groups have looked into the comparative costs of cloud-based computing and in-house systems. Last summer, researchers at Stanford University published a paper in Genome Medicine in which they compared the costs of maintaining a local cluster at their lab versus purchasing compute power on an as-needed basis (BI 08/10/2010).
For their test project, they determined that the cloud-based model cost about three times more and took 12 hours longer, but once they factored in the costs of hardware, software, and personnel to manage a local cluster, they came to the conclusion that cloud computing is a cheaper and more sustainable method for researchers who need to analyze large datasets sporadically.
Assembling Sequences in the Cloud
In a separate study discussed at Bio-IT World, Victor Jongeneel, senior research scientist at the Institute for Genomic Biology at UI Urbana-Champaign, evaluated three de novo genome assembly software packages — Velvet, ABySS, and Contrail — on both Amazon's EC2 and a local compute infrastructure in order to compare how well the software runs in each environment.
For the study, Jongeneel's team evaluated the algorithms on data from Escherichia coli, Schizosaccharomyces pombe, and human chromosome 10 using compute capabilities on Amazon's cloud that were comparable to those of their in-house cluster.
The study found that in general, all three software packages worked as well on the cloud as on the local cluster, though there were performance differences between the different algorithms.
When Velvet was run on EC2 and on the local infrastructure, it was able to assemble genomes in roughly the same time with good quality assemblies, Jongeneel said.
For example, Velvet assembled the 2.6-gigbase E. coli genome in roughly 10 minutes on both EC2 and the local cluster, with N50 values of about 186,062 and 187,371, respectively. For S. pombe, Velvet assembled a genome comprising 4.9 gigabases on EC2 in 21 minutes and of 5.2 gigabases on the local cluster in 24 minutes.
Tests run with ABySS had similar times to Velvet for both EC2 and the local infrastructure, though the quality of the assemblies was not as good. The software assembled a 2.7-gigabase E. coli genome with an N50 value of 167,018 on EC2 and a 2.5-gigabase E. coli genome with an N50 of 173,158 on the local cluster.
Tests with Contrail showed that it took longer to assemble the genomes. It took roughly three hours to assemble E. coli and S. pombe genomes on both the cloud and local infrastructure. However, Jongeneel noted that details about Contrail haven’t been published yet and as such there several unknown parameters that might account for the longer assembly times.
For instance, he said, Contrail has five phases in its assembly pipeline. Information about how many nodes are required for each phase could make it possible to optimize the compute infrastructure for each phase and speed up the software.
Contrail and Velvet are currently being evaluated by researchers from the University of Maryland, Cold Spring Harbor Laboratory, and the National Biodefense Analysis and Countermeasures Center as part of a "bake-off" to evaluate gold standard algorithms for genome sequence assembly (BI 01/04/2011).
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.