RESTON, VA. —Everything old was new again at this year’s TIGR/Jackson Laboratory Computational Genomics conference, held here Oct. 21-24, where talks focused on genome assembly, annotation, gene prediction, and other old-school bioinformatics methods.
Coinciding with the publication on Oct. 22 of the “final” version of the human genome, which pegged the upper end of the human gene count at a seemingly low 25,000, the meeting underscored just how far first-generation bioinformatics methods like ab initio gene prediction still have to go.
Richard Durbin, head of the informatics division at the Wellcome Trust Sanger Institute, said that genome sequences for new species, such as mouse, were expected to significantly improve the characterization of the human genome, but this has not been the case. “After the mouse was published, we thought we’d be able to write a quick paper nailing down the human genes, but that’s still an open question,” he said. “It’s a desperate situation.”
Durbin said during his keynote address that the Doublescan algorithm that he and his colleagues at Sanger developed to simultaneously align the sequences of two species and predict their genes “didn’t give the performance gain [over single-genome methods] that we expected.”
Lior Pachter of the University of California, Berkeley, who co-developed a similar method called SLAM, echoed Durbin’s disappointment during a special Friday-evening gene-prediction workshop. “We’re thinking very hard about this, but it’s very difficult to do what seems obvious in a proper way,” he said. The problem, he explained, is that “we need the alignments to do good annotations, and we need good annotations to do the alignments.”
But the stewards of the genome are taking steps toward improving this situation. Durbin said that Sanger’s computational and manual curation teams are working with NCBI and the UCSC genome bioinformatics group to “converge” on a consensus set of human coding sequences that all the public resources agree on. Durbin said that this set is currently in the neighborhood of 16,000 genes.
In addition, NHGRI’s ENCODE (Encyclopedia of DNA Elements) project is planning on sponsoring an ab initio gene prediction assessment workshop modeled after the CASP (Critical Assessment of Structure Prediction) effort for protein structure prediction. Roderic Guigó of the Institut Municipal d’Investigacio Medica in Barcelona discussed the proposed workshop in his talk outlining gene prediction efforts underway as part of ENCODE.
Guigó said the VEGA (Vertebrate Genome Annotation) database compiled by Sanger’s HAVANA (Human and Vertebrate Analysis and Annotation) group has been updated to include manually curated data for 10 out of the 13 ENCODE regions. The next step, he said, will be experimental validation of those genes, along with additional computational predictions to identify regions that may have been overlooked in the manual step.
A gene prediction workshop would have two goals, Guigó said: to evaluate how well computational methods are able to reproduce manual and experimental methods, and to “assess the completeness of our current knowledge.”
Details of the assessment workshop are still being finalized, Guigó said, but he said that the HAVANA/VEGA annotations should be complete by the end of the calendar year, at which time participants would submit predictions for the ENCODE regions whose HAVANA/VEGA annotations have not been publicly disclosed. These annotations would be released some time in the spring, when the workshop would take place to discuss the methods and the results.
In another sign that interest in gene-prediction methods is actually gaining rather than diminishing in the bioinformatics community, organizers for a new website, www.genefinding.org, announced the new resource during the meeting. Bill Majoros, a TIGR researcher and co-founder of the website, told BioInform via e-mail that it is intended to be “institute-neutral and to be used both by gene-finding researchers as a way to collaborate with colleagues and as a way for others to get information about gene finding, including both informal descriptions of gene-finding approaches, as well as more concrete resources such as publicly available gene finders, source code, training data, etc.”
Majoros added that the standing-room only gene prediction session at Computational Genomics “was a huge success, and several of the organizers of [the conference] have started talking about possibly making the workshop an annual event.”
But gene prediction wasn’t the only well-established bioinformatics method getting a thorough re-examination at the conference. A number of talks focused on genome assembly — a topic that many in the bioinformatics community might have considered solved several years ago. It turns out that substantial challenges remain both for assembling the genomes of new species using capillary sequencing as well as for next-generation sequencing methods.
Two researchers from the Broad Institute — Manfred Grabherr and Jade Vinson — discussed difficulties associated with assembling the Canis familiaris (dog) and Ciona savignyi (sea squirt) genomes, respectively.
Grabherr noted that the assembly of the dog genome was supposed to run a bit smoother than it has in practice, because the breed chosen for sequencing — a boxer — was expected to have the lowest heterozygosity out of the 120 breeds considered. It turned out, however, that 60 percent of the boxer’s genome is heterozygous, which leads to problems because large numbers of polymorphisms look just like repeats to most assembly algorithms. Gragherr said that the Broad team had to develop a new scoring method for building contigs that could distinguish between repetitive sequences and haplotypes.
Vinson discussed a similar problem with Ciona, and suggested that the “spectacular success” that the genomics community experienced assembling Drosopohila, human, and mouse — which all exhibit very low polymorphism — may have been due to the luck of the draw. It’s possible, he said, “that most of the other genomes out there are going to be highly heterozygous.”
In the case of Ciona, he said, an initial assembly using the Arachne algorithm resulted in a genome that was twice the length as expected, with half the coverage. It turns out that the organism’s two haplotypes were so different from each other that the algorithm pieced them into a single double-length strand instead of two separate ones. Indeed, the heterozygosity rate between haplotypes in Ciona runs as high as 4.6 percent — nearly as much as the difference between human and baboon.
Vinson said that the Broad researchers devised a method of piecing together the individual haplotypes separately based on a “splitting rule” step that removes overlaps so that Arachne doesn’t interpret them as repeats. So far, he said, PCR validation indicates that this extra step leads to a more accurate assembly.
Others, meanwhile, are paving the way for assembling the sequence fragments generated by next-generation sequence analysis methods. Susan Reslewic of the University of Wisconsin, Madison, discussed the use of optical mapping for haplotype analysis. Her team is collaborating with Michael Waterman at the University of California, Los Angeles, to develop new computational methods to align the DNA fragments generated by optical mapping onto a reference genome for further analysis.
Christian Haudenschild of Lynx Therapeutics discussed another approach to genome sequencing. Although the company’s MPSS (Massively Parallel Signature Sequencing) technology has largely been used for gene-expression analysis, Haudenschild said that the company, soon to merge with Solexa, wants to start applying it to genome sequencing.
Lynx is developing a new version of MPSS in which DNA is cloned directly on glass slides rather than beads, which will increase the throughput of the approach by a factor of 10. Lynx has begun to develop methods for assembling the very short reads of around 20-25 base pairs that MPSS generates. The problem, Haudenschild said, is that the assembly process is still very computationally intense.
Lynx is also collaborating with IBM Research to develop statistical methods to eliminate noise from the experimental signal. A joint paper on the method has been accepted to PNAS, and the software — called JuMPSStart — will be freely available from IBM upon publication.