This article has been updated to correct the previously reported aim of one of the grants and to correct the name of a university.
Three life-science projects are among 13 teams that will have free access to Microsoft's Azure cloud-computing platform for two years as part of an agreement between Microsoft and the National Science Foundation.
The life-science projects, led by researchers at Virginia Tech; the University of North Carolina, Charlotte; and the J. Craig Venter Institute, were awarded a total of $1.2 million in grants under the program, which kicked off in 2010 (BI 02/10/2010). The awardees were announced last week.
In addition to providing access to the cloud, Microsoft will provide a support team, tools, applications, and data collections to help the scientists integrate cloud technology into their research.
An NSF review board considered the "appropriateness" of each proposal to the Azure platform's capabilities, Reed Beaman, a program director at the agency, told BioInform.
For example, he said, the reviewers considered the fact that the platform is very strong in so-called "embarrassingly parallel" computations and in its ability to deploy web services.
He observed that in addition to providing Microsoft an opportunity to test the limits of its cloud computing platform, the partnership saves research dollars that would otherwise have been spent on hardware.
A Focus on Sequencing
A team led by Wu Feng, an associate professor of computer science at VT, will use a $370,000 NSF grant to develop "a new generation of efficient data management and analysis software for large-scale, data-intensive scientific applications in the cloud."
These applications will be for pairwise multiple sequence search and alignment, short-read mapping, and next-generation sequence data analysis, Feng told BioInform.
In addition, the project team plans to deliver "reliable computing over volatile computing resources" — compute resources which are shared and could be lost at any point during the computation — as well as put in place infrastructure that address the potential for hardware failure.
Feng explained that his team has developed software that can use in-house hardware to "complement" the Azure cloud. Since the computers are used in conjunction with the dedicated resources of the cloud, users will be able to continue their computations without interruption even if they lose their data center's CPUs.
In addition to some internal data, the team will also have access to sequence data from publicly available repositories.
The project will also explore methods of moving large volumes of input and output data in and out of the cloud quickly. Feng said his team will extend and optimize a semantics-based data management framework they developed in 2007, called parallel metadata environment for distributed I/O and computing, or ParaMEDIC.
Feng described ParaMEDIC as “Wonkavison for scientific data,” because it allowed the team to shrink a petabyte of data so that it can be transferred to a remote site, and then re-expanded. Using the tool, his team was able to "teleport" hundreds of terabytes of data in minutes rather than years from the US to Tokyo (BI 08/15/2008).
Ultimately, the team wants to provide "commoditized" data-intensive biocomputing solutions for genomics, Feng said. He and his colleagues are currently involved in other efforts to develop software for personalized desktops and traditional data centers that will be easy for researchers to use.
The team has lent its expertise to find missing gene annotations that identify mobilomes — mobile genetic elements in a genome — and is involved in molecular modeling and dynamics as well as aeroinformatics projects.
Separately, Feng is also involved in an effort spearheaded by the Nvidia Foundation to develop a genome analysis platform that will make it easier for researchers to identify cancer mutations (See related story this issue).
Predicting Binding Sites
In another study, worth $425,000, researchers at UNC Charlotte led by Zhengchang Su will use the Azure cloud to scale up an internally developed algorithm so that it can predict transcription factor binding sites in thousands of bacterial genomes simultaneously.
In the first year of the grant, the researchers will familiarize themselves with the new resource and then make modifications to their algorithm, dubbed GLobal Ensemble CLUsters of Binding Sites, or GLECLUBS.
The modifications will enable GLECLUBS to work on the cloud, Su, an assistant professor of computer science, told BioInform. The researchers will apply the tool to genomic data in the following year.
The first version of GLECLUBS, which was able to predict binding sites in a single genome, was published in Nucleic Acids Research in 2009. Following improvements to the tool that enabled it to predict sites in groups of genomes, the team released eGLECLUBS in a follow-up paper published in BMC Bioinformatics earlier this year.
The tool is based on a comparative genomics approach and it compares target genomes to groups of related reference genomes.
The NAR paper states that candidate TF binding sites in both a target and reference genomes are identified as a first step. Next the sites are ranked, and a similarity graph is constructed that is then cut into smaller sub-graphs using a Markov clustering algorithm. Through a series of steps, GLECLUBS "iteratively constructs and clusters [these] graphs to gradually filter out the spurious motifs in the motif similarity graph."
Su's lab developed the algorithm because in spite of the copious quantities of data, currently there are no good resources to predict where TF binding sites occur in the genome, he said.
In the NAR paper, the researchers wrote that "although great advances have been made in identifying the coding sequences in prokaryotic genomes using computational methods alone, it remains an unsolved task for both the experimental and computational biology communities to efficiently and accurately identify all the [binding sites] in a genome."
This, they argued, "has hindered our understanding of many important biological processes such as development, differentiation, evolution, disease, and specialized biological functions of many organisms."
A newcomer to cloud infrastructure, Su said that the widespread adoption and cost effectiveness of next-generation sequencing has made hundreds of bacterial genomes available for research and, as such, his lab needs lots of CPUs to run its algorithms as well as storage space for all that data.
Currently, GLECLUBS can predict sites in up to 50 genomes on his department's 500-CPU cluster, Su said. With the aid of message passing interfaces, the algorithm is able to make its predictions in two to three days; however without parallelization, predictions can take up to a week, Su said.
Once GLECLUBS is up and running on Azure, Su anticipates that the algorithm can be scaled up to predict sites in tens of thousands of bacterial genomes at the same time. He added that the researchers will also apply it to other members of the prokaryote family.
A third life science project funded under the Microsoft/NSF initiative aims to improve protein-protein docking simulations by addressing issues of scalability and interactivity.
The team, led by JCVI researcher Andrey Tovchigrechko, will use a $421,000 grant to build a client cloud application for protein-protein docking, a process that tries to predict the three-dimensional structure of protein complexes based on the coordinates of the individual proteins as well as their interactions.
The computationally heavy process involves multiple steps in which protein complexes are formed, analyzed, in some cases modified, and then re-docked. These steps require clusters comprising a few hundred CPUs, Tovchigrechko told BioInform.
However, these heavy compute requirements punctuate stretches of inactivity, he said. For instance, a researcher may study histone proteins for months and only require a few days of large-scale compute power for complex predictions.
With the cloud, the researcher could "fire up a client cloud application" as needed and then "my computational backend with compute nodes riding on the cloud will scale up when I actually need to run the prediction stage and then scale down when I am working interactively looking at the structure."
To increase the interactivity of protein-protein docking process, the researchers plan to use Pymol, a 3D graphical interface, as the client component and provide it with a computational backend, built on a combination of the team's own codes and other academic tools,which will run in the cloud. This will include code that will enable Azure to scale up or down the number of nodes available for computations based on the user's needs.
Prior to this project, Tovchigrechko had some experience with Amazon's cloud and participated in an online workshop run by Microsoft while he worked on his proposal and developed a prototype of his project. His cloud experiences, he said, highlighted some specific challenges with both systems.
Both Microsoft and Amazon charge users by the hour for compute time. However, one of the requirements for protein-protein docking simulation is that the number of compute nodes in the cloud has to grow and shrink based on the activities of the users, and these simulations take only minutes.
"You can't really economically scale up and scale down in one-minute increments because you are still going to be billed for a full hour," he said. "It's understandable because the cloud provider has to launch basically a full operating system instance, [which] takes a certain amount of time and expense and so they cannot allow users to shut down and start instances every second ... but it kind of complicates the dynamic scaling."
One solution to this, he said, would be to set up a system that allows users to share the resource in a way that reduces the amount of idle time per hour.
Additionally, most existing bioinformatics packages are Linux-based, which makes porting software to the Microsoft platform "a little challenging," Tovchigrechko said, adding that there are also considerations about the lifecycle of the application and how it will be maintained once the project ends.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.