COLD SPRING HARBOR, NY Don't let the name of the meeting mislead you: Anyone who attended this year's Genome Informatics conference expecting five days of talks on genome sequence analysis got a lot more than they bargained for.
The annual meeting, co-organized by Cold Spring Harbor Laboratory and the Wellcome Trust, kicked off five years ago with a heavy emphasis on sequence alignment, genome annotation, and sequence analysis pipelines and tools. But this year's gathering underscored how broadly the field of bioinformatics has expanded during that time: Only one session was devoted to sequence alignment, assembly and annotation; while new sessions were added to address epigenomics and image informatics, and two sessions were devoted to pathway and network informatics up from one on the topic last year.
A number of attendees noted that the emphasis of the talks has also moved beyond software and database development to the application of those methods to gain biological understanding.
The conference, which began on Oct. 28 and ended Nov. 1, also presented a rare opportunity for several Halloween-themed bioinformatics moments, including a PowerPoint slide that placed a "ghoul" on a phylogenetic tree, and another that substituted a jack-o-lantern for the spliceosome. The conference organizers Lincoln Stein of CSHL, Tim Hubbard of the Wellcome Trust Sanger Institute, and Suzi Lewis of the University of California, Berkeley each donned a letter to make up the AUG codon for the evening session on Halloween night.
Untangling Protein-Protein Interaction Networks
Analysis of protein-protein interaction networks is gathering steam in the bioinformatics community, and six talks at the meeting addressed the challenges of gaining knowledge from growing stockpiles of protein interaction data. Francis Ouellette of the University of British Columbia Bioinformatics Center highlighted the difficulties that many researchers experience in integrating protein interaction networks from multiple resources.
"Conservation of genes doesn't necessarily imply conservation of interactions."
Ouellette said that UBC's Atlas data warehouse includes data from DIP, MINT, BIND, HPRD, MIPS, and IntAct, but integration has proven to be tricky. To truly integrate these databases, he said, it's necessary to "define equivalence" between them. But using a somewhat "strict" definition of equivalence that required the same number of interactions, the same interactors, and a shared PubMed identifier resulted in only three proteins that were shared across five databases a number that Ouellette described as "kind of sad."
Ouellette said that efforts such as the Protein Standards Initiative's PSI-MI 2.5 format and the recently launched IMEx consortium should help improve integration between interaction databases, but these projects are not moving fast enough for many researchers especially as new protein-protein interaction data sets are coming online. "There's still no published plan for sharing interaction data," he said.
Others are making progress in aligning protein-interaction networks from different species. Roded Sharan of Tel Aviv University noted that the number of available protein interaction maps has increased from one species in 2000 to eight now, with more on the horizon, so bioinformaticists will need to develop better and faster tools to compare these networks across species. Sharan described a network alignment algorithm he is developing called QPath, which first assigns a confidence score to each interaction and then uses dynamic programming to match individual proteins and interactions across species.
In a comparison of the networks for yeast and fly, Sharan said that QPath identified "functionally enriched" pathways that were shared between the two organisms, indicating that pathway homology can be used to predict function across species.
Antal Novak of Stanford University described another network alignment method, called Nuke, which he said offers the same performance as other alignment tools, such as NetworkBlast (previously called PathBlast) from the University of California, San Diego, and MaWish from Purdue University, but with a faster running time.
Novak's method scores the nodes and the edges of the networks separately, with node scoring based on the "joint probability of evolutionary events" occurring between two proteins. Nuke borrows the idea of "seeded" alignments from Blast, Novak said, but extends it to the network by defining a seed as a cluster of nodes. The alignment begins by matching clusters of nodes with the closest node score.
Silpa Suthram of UCSD discussed work related to a paper that was published in this week's Nature on the protein-protein interaction network for Plasmodium falciparum [Nature 438, 108-112]. The UCSD team used NetworkBlast to align the network with those of Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, and Helicobacter pylori.
After extensive checking and cross-checking to rule out experimental noise or bias, Suthram said the team was surprised to conclude that Plasmodium has only three conserved complexes with yeast, and "almost nothing" in common with the other organisms. While the results are puzzling, and will require more experimental work, Suthram said these results indicate that "conservation of genes doesn't necessarily imply conservation of interactions."
Another area of rapid development is epigenomics, which Anne Ferguson-Smith of Cambridge University defined as "the relationship between whole-genome organization and epigenetic modification." Genomic sequence ultimately gives rise to molecular machinery that causes DNA methylation, histone modification, and other changes to DNA and chromatin that play an important role in biological function. The hard part is identifying the sequence motifs that result in those epigenetic changes. Another problem, Ferguson-Smith said, is that "there are many epigenomes," because epigenetic modifications vary widely among tissue types and cell states.
"Just managing this data, let alone analyzing it, will be a challenge."
John Greally of the Albert Einstein College of Medicine described bioinformatics methods to identify DNA sequence motifs that influence epigenetic organization using ChIP-chip experiments for the CTCF transcription factor. Greally and his collaborators used a motif-finder called Sombrero to identify a set of motifs that was similar to a set of known CTCF binding sites, but the problem is that "it crashes our supercomputer."
The main challenge of epigenomic analysis, Greally said, is that it has very high dimensionality. "Just managing this data, let alone analyzing it, will be a challenge," he said. On the bright side, epigenomics shows signs of being "the equivalent of a public works project for bioinformaticists over the next several years" assuming the funding agencies are willing to support it.
Paul Flicek of the European Bioinformatics Institute discussed a "work in progress" to use a hidden Markov model to analyze ChIP-chip experiments for the ENCODE regions of the human genome. So far, he said, the approach appears to be "more sensitive" than other analysis methods in identifying the locations of histone modification, but there are still unresolved questions about the specificity of the method. Ultimately, he said, "we want to develop an analysis pipeline for ChIP-chip data in Ensembl."
Jason Swedlow of the University of Dundee opened up the image informatics session with the observation that he and his colleagues are either the "black sheep" of the informatics field, "or a sign of where things are going."
Describing a sea of image data that could easily swallow most traditional bioinformatics databases, Swedlow said that the field has "a lot of real problems that people are desperately trying to solve." As an example, he said that his lab currently has 30 terabytes of stored data, which is growing at a rate of around a terabyte every five weeks. In addition, he said, most data-acquisition systems for imaging instruments write in proprietary file formats, so a typical informatics analysis task involves first migrating the files to tiffs and then separating the numbers out into an Excel spreadsheet so they can be integrated with other data. Amazingly, this seems to work most of the time, Swedlow said, "but as we scale, we start to fail."
But there is progress in the field. Zhirong Bao of the University of Washington described an algorithm for tracing cell lineage during embryogenesis. Using a time-lapse imaging strategy that captures the process at high resolution, the algorithm can automatically trace the cells as they divide, and assign them cell identities based on a naming scheme. The method can trace the lineage of the cells up to the 250-cell stage in 30 minutes on a desktop computer, Bao said.
The method can also handle lineage mutants, Bao said, so it could be used to track the division of cells in an RNAi screen.
Bernadette Toner ([email protected])