The National Evolutionary Synthesis Center, a National Science Foundation-funded center to promote multidisciplinary research in evolutionary biology, has enlisted the help of the bioinformatics open source software community to build a better set of phyloinformatics software tools.
This week, the center, known as NESCent, hosted a so-called hackathon with the goal of building the “glue” to link together disparate complex phylogenetic software programs into seamless workflows.
Around two dozen developers from around the world gathered at NESCent’s headquarters in Durham, NC, from Dec. 11 to Dec. 15. Participants included developers of phylogenetic software packages like HyPhy, CIPRES, and PAUP, as well as representatives of key bioinformatics open-source packages like BioPerl, BioJava, BioPython, and BioRuby.
The goal of the hackathon was to enhance the capabilities of the open-source toolkits to work with phylogenetic software packages. In particular, developers worked on improving support for Nexus, a file format used in many of the leading phylogenetic software packages, and on building data models for exchanging phylogenetic information.
Phyloinformatics faces considerable challenges compared to more established bioinformatics disciplines like genome informatics. There are hundreds of highly sophisticated phylogenetic software packages available — Joe Felsenstein of the University of Washington, the author of Phylip, has compiled a list of 282 phylogeny programs and 34 servers on his website — but little, if any, interoperability between those applications.
“There are many, many ways you can analyze your data, and therefore there is a huge opportunity for aligning these methods together into workflows or pipelines … to ask really sophisticated questions,” said Hilmar Lapp, assistant director of informatics at NESCent.
However, Lapp added, “there is a high barrier of entry for people who aren’t experts with those tools … because they use different formats for input and output, oftentimes their own idiosyncratic ones.” In addition, he said, “even the standards that do exist in the field are used in ways that make ready exchange of data very difficult or impossible.”
Lapp, who organized the hackathon, said the goal was to “step in and create essentially the glue code that languages like Perl have been classically good at, if you look at the genome projects.”
NESCent was founded in 2004 with a five-year, $15 million grant from the NSF as a collaboration between Duke University, North Carolina State University, and the University of North Carolina, Chapel Hill. The “synthesis” in the center’s name applies to its emphasis on synthetic science, which relies on pulling together existing data to gain understanding about a particular field — in NESCent’s case, evolutionary biology.
The center has an informatics team of around six people that supports around two dozen researchers.
Todd Vision, associate director of informatics at NESCent, said that the center has a number of internal development projects underway, but noted that the phylogenomics hackathon “is probably the biggest effort we’ve made to date in this area.”
“To some extent, a lot of what’s going to be done here will be sort of invisible to the user, and that’s sort of the goal.” |
Vision said that NASCent wanted to bring together “the two communities — the developers of the software, who are very sophisticated about the methods involved but haven’t had much experience writing the open source bioinformatics toolkits of this sort, and developers from BioPerl, BioJava, BioPython, and BioRuby — to add support for those particular kinds of software and the toolkits so that they can be strung together.”
The results of the hackathon will be available to end users in the form of more user-friendly analysis. “To some extent, a lot of what’s going to be done here will be sort of invisible to the user, and that’s sort of the goal,” Vision said. “To make it so that there’s not more that they’ll have to learn, but hopefully less that they’ll have do in terms of understanding idiosyncratic formats and going out of one application to go into another to do analysis that requires more than one application.”
Lapp acknowledged that not all of the open source toolkits will provide full support for phyloinformatic workflows by the end of the week, “because they’re starting from vastly different levels.” BioPerl, for example, is much more mature than most of the other toolkits, but the hope is that “they’ll all significantly advance their level of integrating data models and data formats and interfaces to programs.”
Vision added that many projects begun this week will likely continue for some time. “There are a lot of things that have been identified as future needs,” he said, “and now we have the collaborations that have emerged from this to work on them in the future, so I’m sure that the hackathon in spirit will live on longer than this week.”
Further information on the hackathon is available on its wiki site.