That cloud computing has yet to become widely regarded as a no-brainer for genomics researchers has not stopped the onslaught of marketing hype aimed at convincing researchers that the cloud is the salve to soothe all of their computational woes. Commercial and open-source academic efforts demonstrate that the most realistic strategy for converting genomics researchers to the cloud is by offering solutions where those users do not have to think about using the cloud at all, or at the very least, recreating or retaining as many of the comforts of home as possible. This means making sure all the favorite analytical tools, software, and pipelines are readily available, just like they would be on a local cluster. Thus, the actual mechanics of establishing an analytics pipeline on the cloud are out of sight and out of mind.
"For the bioinformatician, the cloud is great, and may be ready to use today, but for everyone else, what you need is software so you don't even know you're using the cloud," says Florian Fricke, an assistant professor at the University of Maryland School of Medicine in Baltimore. "From our perspective, you don't want to have to log into Amazon — you want it to be seamless, and you want those resources to just be utilized or interfaced through a Web page or desktop application. I just have not seen anything yet that makes the cloud completely seamless."
In 2011, a number of cloud computing vendors rolled out offerings that integrate next-generation sequencing analysis pipelines into the cloud to entice users with the promise of easy access to its power. These include both firms solely dedicated to exploring bioinformatics and cloud computing as well as established sequencing technology vendors.
DNAnexus, a Stanford University technology spinout, started 2011 by continuing to flesh out its offerings and expanding its next-generation sequencing analysis platform to include a cloud-based variant analysis workflow. A few months later, in May 2011, Life Technologies announced an online portal for its customers called Lifescope Genomic Analysis, which allows researchers to analyze data generated from the company's 5500 Series Solid Systems. Then, Samsung SDS, a subsidiary of the Samsung group, began testing its next-generation- sequencing analysis platform in September, and cloud computing solutions firm Appistry rolled out a release of its CloudIQ platform, another sequence analysis pipeline.
Eagle Genomics, a bioinformatics outsourcing service that uses cloud computing to offer scalable compute resources for next-generation sequencing data analysis, recently partnered with Cycle Computing, a cloud computing service provider that has been targeting the life sciences market for some time. The companies plan to release a commercial solution that addresses some of the security concerns associated with the cloud sometime in mid-2012.
Pacific Biosciences also partnered with Cycle Computing last September to offer a cloud-based version of its SMRT Analysis open-source analysis software suite. And Illumina began offering an open-source analysis platform in October for its MiSeq system that also makes use of the cloud.
Finally, Scale Genomics, also a Stanford spinout, is one of the newest outfits specializing in bioinformatics and cloud computing to arrive on the scene. In addition to offering genomics virtual instances that come prepackaged with a slew of bioinformatics tools, including everything from ClustalW and HMMer to TopHat and Cufflinks, customers can also create their own labs through the service. This means customers can take the analytical pipelines currently running in their own labs and recreate those same pipelines on Scale Genomics' cloud for more compute power.
Open-source efforts also aim to further the ease-of-use concept for the cloud. Last June, researchers at the Fred Hutchinson Cancer Research Center released CRdata, a cloud-based resource for running R and the Bioconductor software suite. A key feature of CRdata allows users to launch their own private -Amazon Elastic Computing Cloud and Amazon storage service, all with a point-and-click menu interoperability. In September, a group of software developers at the University of Maryland School of Medicine's Institute for Genome Sciences led by Fricke released the Cloud Virtual Resource, called CLoVR, desktop application for automated sequence analysis on the cloud.
"The main goal of the project is to make sequence analysis as easy as possible for researchers who don't have a bioinformatics background. We focused on providing a full -analysis pipeline consisting of multiple tools in automated pipeline," Fricke says. "We do that by pre-installing and configuring this software in a virtual machine so the user doesn't have to do any other installation. The virtualization also allows us to use a cloud computing service, such as Amazon Web Services, so users don't have to install complicated software, and they can do large-scale processing using the cloud."
CLoVR is currently bundled with push-button pipelines for microbial genome analysis, including 16S -rRNA sequence analysis, metagenomic sequencing projects, and single-genome projects, as well as prokaryotic and eukaryotic RNA-sequencing, and viral genomics software. But no matter how many tools Fricke and his team add to CLoVR, the emphasis has to be on making the cloud transparent to the researcher. Despite all of these new cloud-computing vendors competing to offer tools that make executing genomics analysis tasks as close to point-and-click as possible for the average bench biologist, no one has really hit the mark yet.
Not for novices
Although cloud computing efforts are paying attention to user-friendliness, some researchers working with the cloud say seamlessness is still a long way off. Even if a biologist is using the latest, click-and-drag cloud service or interface, it is no territory for an IT novice.
"One big downside of cloud computing is that it pushes the systems IT burden into the user space, and sometimes people just want to get their work done. They don't want to worry about why an Amazon Elastic Block Store volume is unmounting, or 'What's this strange thing popping up every time I reboot?'" says Angel Pizarro, director of the bioinformatics facility at the University of Pennsylvania's Institute for Translational Medicine and Therapeutics. "There's years of systems administration that is just assumed you know when you go into the cloud environment that most people don't pay attention to, so they don't manage their resource as efficiently as possible."
Pizarro has been heading up efforts at the Penn Genome Frontiers Institute, or PGFI, that mimic the commercial bioinformatics cloud computing offerings. Academic efforts like this aim to remove the IT burden from the larger community of researchers on the campus. "That's where PGFI is filling in the gap between something like a cloud vendor and local resources. When you need to compute, we'll manage that for you and remove the IT burden," he says.
While PGFI was originally equipped with a local HPC cluster comprised of roughly 1,000 compute nodes and some 600 terabytes of storage, it was simply not enough to accommodate the rate of next-generation sequencing data generation. So Pizarro — who has been experimenting with Amazon Web Services' cloud for the last three years — decided to take the interface and tools that were available on the local campus resource and move that over to the AWS, but without the user being any the wiser.
Using the cloud in this way requires establishing a hybrid compute resource that uses the cloud for computation, and local hardware for storage. For some groups — like the CLoVR initiative, which is focused on microbial genomes — network latency may not be such a factor when uploading and downloading datasets. However, when the work involves moving multiple human genome datasets in and out of the cloud of a regular basis, bandwidth limitations can be costly from a time-management perspective, if research is stalled during an upload or download. This is essentially the argument for genomics data storage on the cloud: If the data is kept close to the compute resource up on the cloud, bandwidth is no longer an issue. Unless of course, results or additional data need to be downloaded or uploaded, which is often the case.
The concerns associated with time management, in addition to privacy issues, are why Vas Vasiliadis, director of products, communication, and development at the University of Chicago Computation Institute, is working with researchers on strategies to create the ideal hybrid computing environments using common workflows like the Galaxy next--generation sequence analysis pipeline — where all the concerns about ease of use, scalable compute resources, privacy issues, and network latency are much more manageable. "At some point, having all the data up there is probably the way things will go. As to when that happens, it's anybody's guess," Vasiliadis says. "It's about overcoming some of the issues, so from a practical standpoint, most people will have a hybrid environment and that will be driven by the challenges of getting data to and from the location of the compute cluster, so it's going to be a while."
A lower hurdle
While the barriers to entry into the cloud world for biologists is getting smaller thanks to the efforts of these new startups and open source academic groups, there still seems to be some debate about how easy it's really getting to use the cloud.
"There is still a small percentage of users that are IT savvy enough so that they can navigate their way through the various cloud services and bring in the right piece," Vasiliadis says. "There are providers that are wrapping their stuff in a Web point and click in Amazon and other providers. But there's still a pretty good gap between having some of those totally new to the cloud to show up in lab and say 'Go run this pipeline in Amazon' — making that happen, we still have some work to do there."