NEW YORK (GenomeWeb) – Two research groups from the computer science department at Rice University will use a three-year, $1.1 million grant from the National Science Foundation to develop cloud-based statistical software for analyzing evolutionary patterns.
Specifically, Christopher Jermaine and Luay Nakhleh, who are both associate professors of computer science at Rice, will use the NSF funds to create open-source cloud software that uses Bayesian inference techniques to track how genes and genomes evolve across species, and to make the software broadly available to the research community.
In practice, being able to run analyses in parallel and to access thousands of computers quickly in the cloud will help shorten the time to results significantly, according to the developers. "We're talking about potentially taking a years- or decades-long computation and making it feasible by changing the underlying algorithm and making it amenable to distributed computing," Jermaine said in a statement. Moreover, it would provide a potentially cost effective alternative to purchasing and running large local clusters, they said. It could even appeal, they believe, to researchers who have mainframes in house because of the potential for parallelized analysis.
An otherwise powerful technique for estimating evolutionary history in phylogenetics studies, Bayesian inference is computationally impractical for large datasets, according to Nakhleh. "Analyzing data sets with 10 or 20 gene sequences can easily take hundreds of hours," he said in statement. "But the tree of life has millions of sequences and is built from millions of species. There's no way traditional Bayesian techniques are even going to get close to handling that." It's currently infeasible, for example, to use these solutions to build trees composed of thousands of taxa or species, Nakhleh told BioInform.
Parallel and distributed computer infrastructure offer a solution to the intensive computation needs of phylogenetics researchers; however; very little research has explored the potential of this kind of infrastructure for these kinds of studies, Jermaine said. "There's a reasonably large amount of work on cloud-based Bayesian learning, but it's almost all for data analytics, not for biological applications," he told BioInform. For example, he and his colleagues have developed a system that lets users "write and execute codes for large-scale Bayesian models," he explained, adding, however, that on the whole "there are not many papers describing cloud-based phylogenetics tools, and I think it's safe to say that [nothing] has been targeted to Bayesian phylogenetics in particular."
The NSF grant will enable the Rice researchers to expand existing Bayesian methods and make them more amenable to parallel and distributed computing systems like the cloud. Over the next three years, they'll work on mathematical modeling and algorithm development, implementing and running the software on distributed systems, refining it to remove bottlenecks, and finally publishing the software.
"We want to deliver something that’s very easy to use," Jermaine said, so "that somebody can just boot up a machine instance on Amazon [for example]" and then with "a couple of key strokes, fire up a cluster under that machine's control and then run whatever they want to run."