The expanding universe of bioinformatics conferences offers everything from the hard-core algorithm development of Recomb and ISMB to the vendor-driven trade show atmosphere of commercially organized events. But users and developers seeking to share ideas about the real-world applications of bioinformatics are often ill-served by these choices, according to the organizers of one of the newer stops on the bioinformatics circuit.
Now in its second year, Genome Informatics, a joint Cold Spring Harbor Laboratory/Wellcome Trust conference, is seeking to fill that niche. Over 250 researchers gathered in the bucolic setting of the Wellcome Trust Genome Campus in Hinxton, UK, September 4-8 to present new bioinformatics tools and approaches and discuss how they are putting them to use.
“The emphasis is on working software, architectures, and applications of software,” summed up co-chair Lincoln Stein of Cold Spring Harbor Lab, which will host next year's meeting in May.
The secluded setting of Hinxton — nestled in rolling farmland a half-hour outside of Cambridge — offered few distractions, and the conference schedule presented ample networking time. A number of speakers noted that their talks were inspired by impromptu discussions they had at last year’s gathering.
Attendees were presented with a broad range of approaches. With sequencing projects ramping up for everything from microbes to primates, high-throughput workflow pipelines and comparative approaches stood out as the key topics, while the old standbys of assembly, annotation, and gene prediction tools played a secondary role.
Comparing Apples, Oranges, Mice, Yeast, Chimps...
Orly Alter from Stanford University presented a technique developed with Pat Brown and David Botstein to compare gene expression datasets from two different organisms — the first such method to perform this task, according to Alter. Using generalized singular value decomposition, a technique borrowed from the machine vision community, Alter reduced the whole-genome expression sets of yeast and human to smaller sets of “genelets” and “arraylets” that represent the significance of genes in each dataset relative to the other. The resulting model is useful for normalizing expression data as well as classifying genes from both organisms into similar functional groups, Alter said.
Several new tools for multiple sequence alignment from Serafim Batzoglou at Stanford can align regions of much greater length than previously available, he said. Lagan, a pair-wise alignment tool, and M-Lagan, a multiple alignment system based on Lagan, use a combination of a local aligner, sparse dynamic programming, and the Needleman-Wunsch algorithm to rapidly align megabase-sized genomic regions of orthologous species. Batzoglou said that M-Lagan aligned 12 vertebrate regions of up to 1.8 Mbp in 4.5 hours using 400 MB of memory in a 2 GHz Pentium 4. The Lagan package is available at lagan.stanford.edu.
Doublescan, a program from the Sanger Institute’s Irmtraud Meyer, tackles two bioinformatics jobs at once: It predicts gene structures while aligning the sequences of two genomes. The method works only on evolutionarily related organisms, and was applied successfully to the comparative analysis of mouse and human as well as C. elegans and C. briggsae, Meyer said. A paper describing the method will appear in the October issue of Bioinformatics.
A group of researchers from the Max Planck Institute, Genoscope, and CNRS applied several comparative analysis techniques to their study of human chromosome 21. Using the completed sequences of several organisms, they were able to add 13 new genes to the 225 protein-coding genes predicted with Genescan, MZEF, and Grail in 2000. Exon structures, coding sequences, and coordinates are available at: chr21.molgen.mpg.de.
In a talk and a poster, the Whitehead Institute shared some details about its Calhoun pipeline for automated whole-genome annotation, which it is currently using to annotate the human genome as well as several microbes and eight fungi, including Neurospora crassa. The system offers a flexible interface that can be configured for three core use cases: species-specific websites with annotation data, structured searches for biologists interested in a particular gene or pathway, and downloadable data sets with data mining options for users who want to perform their own analysis.
Jeff Nie of the Medical College of Wisconsin discussed the ASAP (A Systematic Annotation Package) pipeline developed for the rat genome. The pipeline contains modules for sequence extension, analysis, annotation, and visualization, and integrates a number of bioinformatics applications, including Phred/Phrap, RepeatMasker, MetaGene, Blast, and LocusLink. ASAP is available at: asap.ahabs.wisc.edu/annotation/ php/ASAP1.htm.
Japan’s Rice Genome Research Program has also developed an automated annotation pipeline called RiceGAAS (Rice Genome Automated Annotation System). The system performs a Blast-based homology search as well as gene prediction using several programs and integrates analytical results to establish coding regions. The system is available at ricegaas.dna.affrc.go.jp.
The Institute of Molecular and Cellular biology in Singapore has developed a workflow pipeline for annotating the Fugu genome called BioPipe (www.biopipe.org). Based on the pipeline annotation system developed for Ensembl, but oriented more toward comparative genomics, according to developer Elia Stupka, the pipeline is currently being used to annotate Ciona and rice.
Laughing All the Way to GenBank
“How many of you have a local copy of GenBank?” asked Robert Citek of Orion Genomics during his talk, to which more than half of the audience raised their hands. “How many of you have it on your laptop?” he asked next. Aside from himself, only one other hand was raised. Citek proudly shared his secret to shrinking down the formidable resource to laptop size.
His system, called MyGenBank, developed with Ian Korf of Washington University, consists of sequence data, a mySQL database, and Perl scripts for managing and querying the database. MyGenBank can be configured to optimize either speed or disk space, and is available at http://sapiens.wustl.edu/ ~ikorf/MyGenBank.html#intro.
Back to the Bench?
Genetics pioneer Sydney Brenner ruffled a few feathers in his keynote address, in which he pitched his latest pet project — creating a function-based cell map by 2020.
“I’m worried about what’s called bioinformatics,” Brenner said. “There are a lot of my contemporaries who believe it’s just people who don’t want to work in a lab and want to laze in front of a computer screen.”
An early proponent of computational approaches to biological research, Brenner said he doesn’t necessarily share his peers’ views, but disagrees with “people who think they can find everything that way.”
Never one to mince words, Brenner challenged the assembled bioinformaticists to “forget the genome.”
“The more you annotate the genome, the more you make it opaque,” he said. “We need to focus on our cells.”
Brenner questioned the ability of computational approaches to derive functional knowledge from genomic sequence alone. The future, he posited, requires going back to the bench. Old-fashioned data on the biochemistry of the cell would then be used to flesh out the cell map, which would serve as “a framework to think of genomes and their products.”
The 2020 completion date was chosen based on the time it took the Human Genome Project to bear fruit from the time of its conception, Brenner said, adding that the association with “good vision” played a bit of a role as well.
Working in a few zingers for the genomics community, Brenner at one point joked that he settled upon the term “instantiation problem” to describe a component of the cell map project because it would be difficult to tack on the ubiquitous “omics” suffix.
And his definition of data mining — “What’s my data is mine and what’s your data is also mine” — was met with a burst of cheers and applause.