Almost 175 years have passed since Charles Darwin sketched the first "tree of life" in his notebook, but thanks to the genomics revolution, biologists have begun to move closer than ever before to having a truly detailed history of life and relatedness. While the major challenge for phylogenetics might appear to lie primarily in the development of robust visualization tools for the creation of increasingly detailed trees, the phylogenetics community is also in need of a crash course in high-performance computing. Typical desktops are just not cut out to meet the memory and computational requirements for the kind of tree building that is now possible with the growing store of genomics data.
Unfortunately, most phylogeneticists are not utilizing grid resources or local clusters, says Jim Wilgenbusch, a research associate in the department of scientific computing at Florida State University. "One of the challenges is dealing with the life sciences community that has typically attempted to do this sort of work on their desktop or laptop machine. Getting that community used to an environment where they would be using something like TeraGrid or their local campus HPC resource can be a challenge by itself," Wilgenbusch says. "Life scientists aren't equipped with the skills to use these distributed resources — that's a cultural challenge. And then the second part is that some of the software that is popular with practicing systematists and phylogeneticists is not very well configured to run in a distributed environment; however, both of these things are changing."
The need for HPC stems from the fact that researchers are dealing with what is called in computer science a non-deterministic polynomial-time optimization problem — a class of problems for which there is no shortcut that leads to a simple solution. "This means that given a data set with 50 organisms and a scoring function that tells us how good the data fit a specific tree, it is impossible to score all trees with that function, because there simply exist too many," says Alexandros Stamatakis, a junior research group leader at the University of Technology in Munich. "Thus, even given all the computing power available in the entire world, we would have to wait for too long to analyze all those trees in order to find the best one. … A tree with 50 organisms is a small tree by today's standards — currently, the largest published phylogenies comprise between 13,000 and 74,000 organisms."
Stamatakis focuses on developing open-source tools for the reconstruction of very large trees under statistical models of evolution, but he has also recently begun to explore ideal processor architectures to better meet the computational requirements for phylogenetic inference. These challenges are further compounded by ever-increasing data sets and researchers' desire to compute entire trees rather than a few genes. "This means that a phylogenetic analysis, in particular based upon statistical methods such as maximum likelihood or Bayesian inference, will require enormous amounts of memory for computing the scoring function on trees," Stamatakis says. "A data set comprising protein and genome data for 35 mammals already requires 190 gigabytes of main memory just to compute the likelihood score on a single tree. … We are facing capacity challenges with respect to the two key resources of any computer — that is, memory and CPU cycles may not be sufficient to reconstruct whole-genome phylogenies for hundreds oreven thousands of organisms."
There are some efforts to explore other types of hardware, including graphics processing units (GPUs). The BEAGLE — Broad-platform Evolutionary Analysis General Likelihood Evaluator — project is an open-source tool for phylogenetic likelihood computation that runs on GPUs. BEAGLE, which was published last year in Bioinformatics, has slowly gained traction and interest in the community and has already been integrated within other software packages such as BEAST, a cross-platform program for Bayesian Markov chain Monte Carlo analysis of molecular sequences. But distributed computing approaches offer a range of solutions that may be easier for average biologists with no specialized programming experience. "We actually distribute bootstrap analysis over heterogeneous architectures. These can be anything from a combination of idle desktops to parts of idle machines," FSU's Wilgenbusch says. "I have an application called RepMaker where you essentially take a bootstrap analysis, break it up into individual parts and distribute it over a network of idle computers."
An easier interface
Because most biologists are not comfortable using GPUs or distributed computing tools, Web portals that host informatics tools and allow for quick and easy access to lots of compute power promise to be the simplest way to deliver HPC to the phylogenetics community. iPlant, a cyberinfrastructure for plant biology funded by the National Science Foundation, aims to provide biologists with software tools that are capable of scaling up to large computing environments. As a cyber-infrastructure, iPlant's mission is not to create all of the bioinformatics tools needed by the phylogenetics community, but rather to build upon existing software to make it easier to use, allowing users to do the data conversion that they need and then run that software with their data on large machines.
In an attempt to help bring the whole community together, iPlant will soon release an application programming interface to help expand the platform by letting users integrate it with other software packages. "iPlant isn't trying to replace all existing bioinformatics efforts — and even if we wanted to, we wouldn't have the resources to build every tool any biologist might need. What we are building is a platform, a way to help make tools interact and to expose tools to new users, while reducing the total work for users and tool developers," says Dan Stanzione, deputy director of the Texas Advance Computing Center and co-PI of iPlant. "Say, for instance, that you build a new tool to analyze trees, but you don't support all the input formats a user's tree data might be in and you have no way for a user to view the resulting tree. Rather than write more code, you integrate with the iPlant infrastructure, which provides your users data converters and visualization tools ... [and] what we get back is the same thing Apple gets from people building iPhone apps [in that] the app developers find users, but the iPhone platform is also more valuable to users because you do more things with it."
In an effort to help spread the gospel of high-end computing to the phylogenetics community, the National Institute for Mathematical and Biological Synthesis is hosting an HPC-for-phylogenetics tutorial this month. The tutorial, hosted by Stamatakis, Stanzione, and Wilgenbusch along with other HPC experts, is geared toward biologists who are only making use of desktop computers. It will focus on how to use TeraGrid and CIPRES Portal, another phylogenetics cyberinfrastructure, as well as iPlant and other common HPC resources. Participants will also be offered the chance to take part in a Unix webinar to brush up on their HPC command line chops. "Unix skills are still the most ubiquitous and flexible way to access all HPC resources, and there are still a wide range of hundreds of existing tools that are Unix-based," he says. "It's not a requirement, and we will try our best to make it easier for non-Unix people to access tools, but right now, it really helps [to know it]."