NEW YORK (GenomeWeb) – A University of Maryland-led team has published details of molecularevolution.org, a public computing platform for phylogenetic analysis and a web-based service called the Genetic Algorithm for Rapid Likelihood Inference (GARLI) program — a maximum likelihood-based solution for inferring genetic relationships between organisms.
Launched in 2010, molecularevolution.org is maintained by UMD and runs on a community-based grid system called The Lattice Project (TLP), developed and led by Adam Bazinet, a faculty research assistant and graduate student at UMD's Center for Bioinformatics and Computational Biology, and Michael Cummings, an associate professor of biology at UMD.
The researchers published details of the infrastructure in paper that first appeared online in Systematic Biology in late April.
TLP runs submitted projects on thousands of processors from a combination of volunteer computers from the Berkeley Open Infrastructure for Network Computing (BOINC), grid computing resources such as Condor pools, and compute clusters at UMD.
The system, according to the paper, uses a number of mechanisms to make efficient use of the compute power at its disposal, including "a round-robin scheduling algorithm," which distributes the workload evenly among available resources; and a mechanism for prioritizing jobs so that faster resources receive jobs before slower resources. The system also makes use of a "predicted job runtime to ensure that long-running jobs are placed on resources where they are unlikely to be interrupted," and combines shorter jobs into larger ones "with an 'optimal' aggregate runtime to maximize system throughput."
GARLI was developed by Derrick Zwickl as a doctoral student at the University of Arizona. Zwickl, now a post-doctoral researcher in the University of Kansas' ecology and evolution department, is also a co-author on the current Sys. Bio paper, which focuses primarily on the features of the current version of GARLI; and to a lesser extent on the underlying grid infrastructure and how it differs from similar infrastructure used for phylogenetics analysis.
Full details of how GARLI works have been published elsewhere, but basically it accepts as input genetic sequence data from multiple species stored in a matrix, where each row contains a different species and each column contains information on a different position in the genetic sequence common across all the species, Bazinet, who has worked extensively with the software, told BioInform this week. Once the data is in, users select the sort of evolutionary model they would like to apply — the system has several choices — and then GARLI's algorithm generates multiple candidate trees and, over several generations, gets rid of poorer trees and keeps the better ones. It ultimately combines the best aspects of these optimized trees into a single phylogentic tree, the final output of the system.
About 97 percent of analyses submitted to GARLI are completed in less than 24 hours, Bazinet said. That figure accounts for factors such as system latency as well as the time needed to complete the GARLI analysis. The actual analysis runtime varies from a few minutes to several hours depending on the input data and the analysis parameters, he said.
The Sys. Bio paper also touches on the differences between TLP and similar systems such as the Cyberinfrastructure for Phylogenetic Research Gateway (CIPRES), which was developed and is maintained by researchers from the University of California, San Diego — it also runs the most current version of GARLI released in April 2011. One of the differences between the two platforms, according to the paper, is that TLP has a simple-to-use interface that essentially keeps the underlying compute infrastructure hidden from users. In contrast, "the CIPRES gateway requires the user to become familiar with their computing resources and to specify their analysis in such a way that it will complete on the allocated resource (usually only a small number of processors) within an allotted period of time."
What's also novel about TLP is its ability to leverage community resources through the BOINC project, Bazinet said. As noted in the paper, "volunteers simply download a lightweight client to their personal computer, thus enabling it to process GARLI workunits for [TLP]." As of April, over 16,000 people from 146 countries have volunteered time on their computers to the project.
Other features that set TLP apart from similar systems include its ability to support "up to 100 best tree or 2,000 bootstrap search replicates per submission [with] no resource or runtime limitations," the paper states. "This level of service [is] due to our implementation of stringent error checking, advanced scheduling mechanisms, and inclusion of novel resources such as our public computing pool of BOINC clients."
The system also performs relevant post-processing steps automatically including "computation of the best tree found or bootstrap majority rule consensus tree, and the calculation of various summary statistics and graphical representations," the researchers wrote. These steps, which include "graphical and quantitative characterizations of the set of trees inferred from multiple search replicates," are discussed in detail in the paper.
Overall, the GARLI service has been used in at least 50 phylogenetic studies, a number of which have leveraged the grid resources provided by TLP. According to numbers gathered in early April this year and reported in the Sys. Bio paper, more than 800 GARLI users have completed more than 4,000 analyses comprising over 2.3 million individual search replicates. One contributing factor to those numbers is a recent change in the molecularevolution.org user interface. In addition to a command line-based interface, the developers have added a web-based one that makes it easier for less computer savvy individuals to upload data and submit analyses to GARLI.
Currently, GARLI is the main public service available on molecularevolution.org but TLP's developers plan to incorporate additional tools — it has run some other applications in the past but those are not publically available. Exactly what tools would be added isn't clear yet. Bazinet said that there is a list of more than 20 potential services but the group hasn't yet made a decision on what to tackle next.
The developers have partnered in the past with research teams whose projects used software that required sizable quantities of computer power in order to run, and they are open to doing so again, assuming there is good reason for requiring TLP's resources, for example if there is significant demand for the tool in question, Bazinet said.