From the syntactic details of the mythical Perl 6 to “beauty shots” of production-grade bioclusters, the O’Reilly Bioinformatics Technology Conference had plenty to offer its target audience. In short, it was “a geekfest,” as the University of British Columbia’s Francis Ouellette noted in his opening keynote.
With its line of bioinformatics books growing, O’Reilly has found a niche in the competitive conference circuit among the Perl-hacking bioinformatics masses. Talks at this year’s conference, held Feb. 3-6 in San Diego, offered a good mix of the disciplines that define the field’s practitioners — from pure biology to software engineering to infrastructure design — but at its core, the meeting tended to stick to a single key theme: tools and how to use them.
A number of talks addressed solutions for the bioinformatics problem that just won’t go away: data integration. John McNeil of Isis Pharmaceuticals shared the details of a system his company developed to solve one aspect of this problem. Isis wanted to be able to store not just its data, but the relationships between biological entities — a task that traditional object-relational databases are not equipped to handle, McNeil said. In response, Isis developed a graph-based “semantic network” called MetaGraph that defines relationships between biological objects using a series of edges and nodes. Two viewers, called MeshView and Pluggable Object Viewer (POV), enable Isis biologists “to get answers to their questions without writing code,” McNeil said. The company is making all of its MetaGraph tools available under an open source license through a new website, www.metagraph.org.
Others are putting web services and grid-based technology to work on the integration problem. Robert Grossman of the Laboratory for Advanced Computing at the University of Illinois at Chicago discussed how he is integrating distributed bioinformatics data using a “data web,” a web-based model that is not bound by the security requirements of what is often referred to as a “data grid.” While grids are getting a lot of hype in bioinformatics, Grossman said, biologists don’t really need access to the extra compute cycles that grid technology is based on. Rather, he said, they need handy access to other people’s data. Grossman cited an emerging web services standard — the Data Web Transfer Protocol (DWTP) — as an efficient means to transfer data over the web.
Alan Robinson of the European Bioinformatics Institute expressed a similar view in his discussion of the UK’s MyGrid project. While most biogrids to date are centered around sharing compute resources, he said, “I would contend that most biologists have fairly modest requirements for compute power. What is most important to them is discovering resources, discovering tools, and being able to capture when, where, and why [they] did a [certain task].” MyGrid, which is being developed as a semantically rich layer to sit on top of a compute grid infrastructure, will eventually enable scientists to find and retrieve the data and applications they want through a web services architecture. Robinson said that two software packages that the EBI has developed for MyGrid are already available for download (SoapLab, at industry.ebi.ac.uk/soaplab, and Talisman, at sourceforge.net/projects/talisman/).
The I3C, which had an entire track to itself on the last day of the conference, is also pushing ahead with a web services-based approach to interoperability. Tim Clark, who provided an update of the consortium’s activities, noted that the LSID (life science identifier) specification is currently under review by the group’s scientific advisory board. The I3C will host a hackathon at its next meeting, May 5-9 in Boston, to add LSID identifiers to another web services integration project, BioMoby.
Open Source for Pharma, Too
The O’Reilly brand is inextricably linked with open source software, so it wasn’t surprising to find a number of talks on the benefits of open source bioinformatics at the conference. But it wasn’t just the usual suspects this time around. R. Mark Adams, director of bioinformatics at Nuvelo (the new name for the merged Variagenics and Hyseq), discussed the advantages of using open source software in a CFR-11 regulated environment, of all places. According to Adams, the FDA’s 21 CFR part 11 ruling, which covers electronic signatures for the digital submission of drug applications, is “an unknown fruitful area for open source software.” Not only is it as good as commercial software, he said, but in some cases, it may offer advantages.
For example, Adams noted, the grueling validation process for software in a regulated environment would benefit from the transparent development process that most open source projects follow — bug tracking, coding standards, development history, and documentation are, by definition, openly available, making the central mission of 21 CFR-11 — “proving your software does what it says it does” — much easier, Adams said.
In Nuvelo’s case, the benefits of using open source software did not even include cost. In fact, Adams estimated that for one commercial clinical trial software package the company did use, consulting and validation fees were nine times the cost of the initial software license. “It’s not about the money,” Adams said. “The demands of software validation favor access to the source code.”
While the company eventually ended up going for the commercially supported Documentum to manage its records, Adams said that Nuvelo considered using a modified version of the open source version control software CVS (Concurrent Versions System), but didn’t have the time and resources to create a validatable version of CVS on its own. Adams noted that such a project would be of great value to the biopharmaceutical industry.
The Means to the End
Despite the heavy focus on tools, toolkits, and software development at the conference, there were a few not-so-subtle reminders about the fundamental purpose behind bioinformatics: to further biological research. In a provocative keynote address, Cold Spring Harbor Laboratory’s Lincoln Stein predicted that the field of bioinformatics will be absorbed into the broader field of biology within 10 years, and that those interested in pursuing a career in the field should keep in mind that “it’s the biology, stupid.”
As if to prove Stein’s point, two of the meeting’s other keynote speakers known for their significant contributions to the bioinformatics toolkit — Jim Kent of the University of California, Santa Cruz, and Steven Brenner of University of California, Berkeley — delivered talks that focused very heavily on their biological research. Kent discussed the biological importance of having a complete and accurate list of all the genes in the human genome, and noted that current computational gene-finding techniques “are not enough to get us the genes, the whole genes, and nothing but the genes.” Ultimately, Kent said, wet lab methods will be the only way to arrive at the final human gene count.
Brenner described his own wet lab adventures — a first for his computational group, he noted — as he detailed the discovery process behind a potentially significant finding. What originally started as a purely computational project to predict protein domain organization using alternative splicing data led to the discovery of a set of splicing factors that mediates alternative splicing. His group is currently verifying its prediction that up to one-third of human genes may be regulated by these splicing factors. The finding would have been impossible without the use of bioinformatics techniques, but Brenner’s talk underscored the importance of wet lab verification for any hypotheses derived using computational methods, and emphasized that the goal of biological research — regardless of the methods used — is discovery.