By Karen Hopkin
It’s the night before the latest Ensembl annotated genome data release and Ewan Birney is home at his keyboard. “I can’t talk now,” he calmly says. “We’ve got a problem running at speed, so I have to get back online and mess with it.”
Crashes, glitches, bottlenecks, bugs, temporary down time for “unscheduled maintenance”: all pretty standard setbacks for computer programmers. Even with such technical snags, things are much calmer now than they have been. “Last year was a nightmare,” says Birney, who heads the team at the European Bioinformatics Institute that is responsible for Ensembl. When the human genome sequence rolled in, the sheer volume of data — and the urgency to analyze it for consumption by the scientific community — made for some marathon coding sessions.
With the genome published and the Ensembl release running smoothly, Birney is anxious to clear the next hurdle: designing an algorithm to predict alternative splicing, an important problem made more so by the need to explain how a mere 30,000 genes can generate an organism as complex as a human.
Ensembl currently predicts alternative splicing, says Birney, “but we don’t do it well.” Part of the problem is that users may want different things. Some want to look at a gene and see all the transcripts that are theoretically possible. Others want to see only those transcripts that are known to be produced in the cell. “To what extent do we let the algorithm dream things up, and to what extent will it be constrained by experimental evidence?” Birney asks.
He votes for sticking close to the data, so his first challenge is how to mine the EST database with the greatest efficiency. To determine which parts of a gene are transcribed in cells, one searches the EST database for sequences that correspond to the gene of interest. The ESTs are then laid on top of the gene sequence on a computer screen, and the locations of introns and exons are predicted based on the tags’ positions.
Trouble is, the EST database is riddled with junk — contaminants, experimental errors, even sequences that were entered incorrectly by informaticists. So Birney would like to design an automated program to separate the chaff from the wheat. Ensembl currently uses a crude filter that tosses some of the good sequence out with the junk.
With clean data in hand, the algorithm would then match the ESTs to the genome. Many of the programs used in the past were either fast and sloppy or accurate but “dog slow.” That problem may have been solved by Exonerate, a new algorithm written by Birney’s colleague Guy Slater. The program is so fast and flexible, the Ensembl team has started using it to align mouse shotgun sequence with the assembled human genome.
Once the ESTs are in place, another algorithm is needed to analyze all the possible intron-exon combinations, finally deriving a list of the transcripts that best accommodate the data. With any luck, this problem will be solved by GenomeWise, an algorithm that Birney has written largely in airports and on the train as he commutes between Cambridge and his home in London.
So far GenomeWise can sort out splicing solutions for a test case of about 100 kb of DNA covered by 30 to 50 ESTs. “But we haven’t taken it for a test drive on the whole genome,” says Birney. Before doing that, the researchers will run the program on chromosome 1, which is, at 250 megabases, a good proving ground. “If anything is going to go outstandingly wrong,” says Birney, “it’ll go wrong on 1.”
Meantime, other researchers are designing similar programs — even Birney’s former Sanger mentor Richard Durbin. “Keeps us honest,” Birney notes of the friendly rivalry. “At the end of the day, of course we’ll use the program that works best,” he says. “It doesn’t matter who wrote it.”