This article has been updated to change the previously reported 'phase algorithm' to 'RF-ACE algorithm'and to clarify that Apache and Spring Framework are two separate programs
As Google prepares to roll out a new cloud computing infrastructure called Google Compute Engine, it is working with a bioinformatics team at the Institute for Systems Biology to evaluate the new infrastructure's ability to handle life science computing requirements.
Researchers from Google have adapted ISB's Regulome Explorer software for Google Compute Engine and are using it to highlight certain aspects of the infrastructure-as-a-service platform, which is still in a "limited preview" phase.
Google showcased the implementation at the Google I/O developer's conference in late June, where it introduced the upcoming cloud offering.
Regulome Explorer is a web-based tool developed to analyze data from the Cancer Genome Atlas. It lets users explore associations between DNA, RNA, epigenetic, and clinical cancer data using statistical approaches like random forest regression, Ilya Shmulevich, a professor at ISB and a principal investigator in the TCGA project, told BioInform.
The ISB team uses the tool to analyze different kinds of mutations such as silent, non-silent, and frameshift mutations; information on structural variation; and amplifications or deletions in the genome. It is also being used to analyze gene expression data from RNA sequencing and microRNA expression experiments, Shmulevich said.
The genetic data and associations from these different analyses are made available through a secure web service that users can query through a web application or visualize in circular or linear browsers or in network visualization tools like Cytoscape, Hector Rovira, a software architect at ISB and one of Regulome Explorer’s developers, explained to BioInform.
A version of Regulome Explorer that contains already-published information is freely available for broad use, while a second version contains datasets that are available only within the TCGA community, Rovira said.
Working with Google
Rovira told BioInform that although the Shmulevich lab focuses on computational biology, “we also have a strong software engineering component and so we’ve been working for a long time on exploring different technologies for enabling the work that the scientists are doing.”
In addition to tools like the Apache and Spring Framework, the ISB team looked at different technologies offered by Google “because they have been very open, they make available a lot of public APIs [and] a lot of tools and components,” such as a programming language called GO and Google App Engine, which the team has used for two years, Rovira said.
“We [have] this strategy [for] how we build software,” Rovira explained. “The idea is to build very adaptable software that we can reuse and mix and match as needed in different projects.”
With this approach in mind, the team constructed Regulome Explorer’s architecture in a manner that allows them “to bring in different components together quickly” and create customized versions of the tool as needed to support researchers’ projects.
This approach has resulted in “a lot of different variations of Regulome Explorer that [were created] to respond to the needs of different scientists and the project itself has also been ported to other projects … that are not TCGA so we’ve been able to translate these tools,” he said.
Google adopted a similar approach in its effort to run the Regulome Explorer analysis pipeline. Rovira explained that researchers from Google became interested the ISB team’s work and selected and combined different parts of Regulome Explorer with some of their own internally developed applications to build a customized version of the software that highlighted “certain aspects of their technology.”
For example, “they took our RF-ACE algorithm … and they deployed it on their platform but they have their own scheduling framework over [it],” he explained.
Google also customized Regulome Explorer’s visualization tools with the ISB team’s help; and then applied the pipeline to cancer datasets that were provided by the institute and for which “we already had benchmarks,” including a colorectal cancer dataset, Rovira said.
The Google researchers then ran their version of the pipeline using the data from ISB on the Compute Engine.
In an e-mail to BioInform, a Google spokesperson explained that with Google Compute Engine, “users specify jobs composed of tasks, [and] the technology then queues the tasks until capacity is available, then executes them. Tasks are executed as Linux processes within a security jail.”
According to a Google white paper that describes its implementation of Regulome Explorer, researchers were able to analyze a cancer dataset in two hours compared to 15 hours on ISB's cluster.
Google versus Amazon?
Google's cloud computing offering places it in direct competition with Amazon's Elastic Compute Cloud, which is the most commonly used cloud infrastructure in the life science market.
The ISB researchers have used Amazon’s cloud “extensively” and both Shmulevich and Rovira said the right cloud platform will vary from case to case.
“Using different cloud technologies is not a straightforward decision,” Rovira said. “You have to make a lot of decisions based on the size of data that you are dealing with, how the data is being generated, where it's being produced, moving the data to the cloud and back, and then the amount of processing that you are going to do with it.” In ISB’s case, “the size of the databases that we prepare for Regulome Explorer is also another key consideration.”
In cases where the team has used Amazon’s infrastructure, “we have had very good success … it’s a great way to manage your infrastructure, especially when you are starting up and you don’t want to invest a lot of money up front on a large infrastructure,” Rovira said.
Both Amazon and Google have “some overlap in terms of the type of service that they are providing [and] seem to be close competitors,” Rovira said, but “until Google Compute Engine really gets a lot of people hitting it and exposing different use cases, I don’t know that we will be able to make a clear determination between the two.”
Meanwhile, there are some discernible differences, he said.
Amazon has services that Google does not provide. For example, companies like Complete Genomics can ship large quantities of data on disks directly to Amazon which then uploads it to the cloud and makes it available for groups like ISB to use, Rovira said. Google, on the other hand, appears to have “streamlined” the process of “ramping up a large number of cores for these types of very parallelizable tasks and … they’ve thought a lot about their network traffic and [improved] the transfer of information within their network,” he said.
Furthermore, Google has integrated its compute engine with other products in its stack such as the Google App engine. This provides users with infrastructure for deploying web applications on the front end with the power to run the computations on the back end, Rovira explained. “I think [having an] integrated environment … where you don’t have to do any of the heavy lifting is very good.”
Conversely, with Amazon’s infrastructure, users have more flexibility and can configure their compute environment to suit their needs, which is good for research labs that would like more control over the hardware instances they are provisioning, he said.
It isn’t clear at this point whether Google intends to aggressively pursue clients in the life sciences.
The Google spokesperson said that the initial focus for the platform is on large-scale data processing workloads, including video transcoding, Hadoop jobs, and running grid applications, so scientific applications in general "are a great use case for this technology due to the volume of workloads.”
Google offers four compute configurations — of one, two, four, and eight virtual cores — that are charged on a per-hour basis. Compute power cost is based on the number of virtual machines required; storage costs are based on the amount of data; and network cost is calculated “based on how often VMs communicate with each other and the Internet,” the Google spokesperson told BioInform.
Pricing for the smallest Google Compute Engine configuration is $0.145 per hour for one virtual core with 3.75 gigabytes of memory, while the largest configuration with eight virtual cores and 30 gigabytes of memory is priced at $1.16 per hour. The company has a number of options for network use costs that are based on the destination and size of the dataset.
Pricing for storage starts at $0.12 per gigabyte for up to one terabyte of data and drops as the amount of storage required increases.
Amazon's pricing, meantime, ranges from $0.08 per hour to $3.58 per hour, depending on the particular instance required.
Amazon declined to comment on the new Google cloud, citing a longstanding policy of not discussing other companies’ activities.