For sequence analysis, one genome may be good, but two genomes are better, and with three or more you have a bioinformatics bonanza — at least judging by the talks at the Genome Informatics conference at Cold Spring Harbor Laboratory earlier this month. Not only did researchers show how they are applying existing bioinformatics tools to multiple genomes in order to gain more insight into biological function, but it was clear that comparative genomics is providing the impetus for a new generation of sequence analysis tools to speed the process.
Even the cover illustration of the meeting’s abstract book celebrated the new batch of genomes coming online: It depicts some familiar bioinformatics characters seated at café tables in “Chez Bioinformatique,” their bottles of “Nouveau Genome” — “a full-bodied wine with a fruity bouquet of Saccharomyces cerevisiae and a hint of pombe” — nearly empty. The meeting’s talks indicated that the bioinformaticists in attendance are quaffing new genome sequence data with equal gusto.
Cold Spring Harbor Lab’s Jack Chen provided a characteristic example of how new genomes are helping researchers learn more about biology. He described how he and his colleagues used the pre-publication sequence of the nematode Cerevisiae briggsae to assess the number, organization, expression, and function of olfactory genes in its close relative, C. elegans. Chen and his colleagues found that the C. elegans genome has around 200 more olfactory genes than C. briggsae — 718 compared to 496. In more than 400 cases across the C. elegans genome, multiple olfactory genes were found within the intron of another gene, Chen said, which led to a “working model” to describe the evolutionary process by which multiple duplications occur in clusters, and move through the genome together.
Jinhua Wang, also of Cold Spring Harbor Laboratory, discussed how he and his colleagues compared alternative splicing across the human and mouse genomes to characterize conserved cis-acting elements used in the regulation of alternative splicing. Out of a total set of 3,400 alternatively spliced genes in human and 3,655 in mouse, Wang and his colleagues identified 428 conserved alternative splicing events between the two genomes. They plan to study the sequence fragments for motifs that may lead to a search tool for alternative splicing regulatory elements.
David Torrents of the European Molecular Biology Laboratroy, meanwhile, explained how he and his colleagues compared syntenic regions of the mouse and human genomes to detect the “big headache” of gene prediction — pseudogenes. In intergenic regions alone, Torrents said that he and his colleagues found 14,000 pseudogenes in mouse and 20,000 in human.
Other bioinformaticists are focusing on creating new tools and techniques to extract as much information as possible from multiple genomes. For example, Mark Yandell of the Berkeley Drosophila Genome Project shared an approach to make genome annotations more useful in comparative studies. Currently, “there’s not a lot of cross-talk between alignments and annotation,” Yandell said — you can Blast two genomes against each other, but the valuable annotation information doesn’t come along for the ride. To overcome this problem, Yandell and his BDGP colleagues developed a generic database schema called Chado that they used in combination with the Sequence Ontology to create machine-readable XML documents from the annotations that could be used in a Blast search to gain information about the unannotated Drosophila pseudoobscura genome.
Ewan Birney of the European Bioinformatics Institute described two new algorithms for comparative analysis under development at the EBI. One, Promoterwise, is designed for aligning sequences that are not co-linear, which can improve the detection of promoter regions or other stretches of sequence that are inverted or translocated, he said. The second, Alignwise, is being developed to align three or more genomes to predict protein-coding genes.
Inna Dubchak and Michael Brudno of Lawrence Berkeley National Laboratory and Stanford University, respectively, presented new developments for the Lagan pairwise alignment tool developed at Stanford. One variation, M-Lagan, produced a three-way alignment of the human, mouse, and rat genomes in three days on a 24-CPU Linux cluster, Dubchak said. Brudno explained how the other variation, called Shuffle-Lagan, combines local and global alignment to detect overall sequence conservation while allowing for rearrangement events, an approach that Brudno said better accounts for the evolution of DNA.
The “gee-whiz” award easily went to a new 3D genome browser from Canada’s Michael Smith Genome Sciences Center. Named Sockeye, after a local variety of salmon, the Java-based viewer displays genomic features from the Ensembl database as multicolored 3D objects atop a chromosome view that stretches off into the horizon. The visualization tool is particularly useful for comparative genomics, according to one of its developers, Mikhail Bilenky, because sequence similarity scores or gene expression data for multiple genomes can be represented along the chromosome view as bar graphs that shoot up into the third dimension. A few minutes into Bilenky’s presentation, when he provided the URL for the software (http://www.bcgsc.bc.ca/gc/bomge/sockeye/), the 3D genome viewer began popping up on laptop screens around the lecture hall as attendees took it for a spin. Version 1.0 of Sockeye is due out by summer, Bilenky said.
The most valuable bioinformatics resource for comparative genomics, however, may not be a killer app, but a killer data set. Meeting co-organizer Lincoln Stein, who confessed that he’s “having a lot of fun with comparative genomics” himself, pointed out that the current state of the art involves comparing two genomes, “but soon people will be looking at six or seven simultaneously.” Indeed, several talks at the meeting relied on the so-called “zoo” data set from Eric Green’s lab at the NIH’s Intramural Sequencing Center, which contains genomic sequence from targeted genomic regions in multiple vertebrates. Green’s data set is designed specifically for the purposes of comparative analyses and is available at http://www.nisc.nih.gov/ open_page.html?/projects/zooseq/ pubmap/PubZooSeq_Targets.cgi.
The zoo data is proving to be an effective testbed for new comparative analysis tools, but some researchers are setting their sights on more specialized comparative resources. Gene Myers, for example, who recently left Celera Genomics for an academic post at the University of California, Berkeley, told BioInform that he is “lobbying for another 10-12 Drosophila genomes,” which would serve as “a really wonderful data set for the community.” Myers said that the National Drosophila Board met several weeks ago in Tucson, Ariz., to prepare a white paper to be submitted to the NHGRI requesting support for a project to sequence several closely related species. “We can do 10 Drosophila genomes for one-third the cost of another barnyard animal,” Myers said.
The resulting data set would prove valuable for bioinformatics, he explained, because the smaller genomes could be stored on a laptop, would require far less computing power to process, and could lead to the development of a new generation of comparative analysis tools well before multiple mammalian genomes come online. In addition, he said, the well-established fruitfly community would provide ample opportunity for experimental verification of in silico analyses.
Others are promoting a similar project for Anopheles, arguing that the recent evolutionary history of the mosquito — not to mention its dependence on humans and its involvement in the spread of malaria — makes it of particular interest.
Clearly, as the flood of genomic data continues to increase in both breadth and depth over the next few years, bioinformatics developers will have an even wider choice of tools and resources upon which to practice their craft. “We’ve sequenced the genome, but we haven’t decoded it,” said Myers. “Now we want to know what’s in it.”