When Webb Miller set his sights on comparative genomics more than 10 years ago, he hoped to be able to compare regions of the human and mouse genomes by the time he retired.
Now, still five years shy of his target date, Miller’s name appears on a paper recently published in Nature analyzing one of the largest comparative genomics studies of targeted regions in 13 vertebrate genomes, including human, chimp, cat, dog, mouse, and zebrafish.
A mathematician and computer scientist at Pennsylvania State University, Miller runs a lab that produces algorithms to facilitate the alignment of multiple genomes. His interest actually prompted the NHGRI-led project that resulted in the Nature paper: Miller remembers some three or four years ago heading down to the genome institute at the behest of Francis Collins to sit around brainstorming the future of the genome project.
“My most enthusiastic suggestion was that they talk somebody into doing this kind of a project — targeted sequencing of a region of high interest in a bunch of vertebrates to serve as a platform for all sorts of studies,” he says. Among the possibilities he envisioned were developing better bioinformatics tools for sequence analysis, improving sequencing strategies, and understanding the results of different levels of finishing.
Afterward, Eric Green’s NIH Intramural Sequencing Center kicked off the sequencing part of the project, and Miller’s crew, among others, launched efforts to build new software to handle the spate of data. In close connection with the NHGRI project, Miller started a public website known as MultiPipMaker, which represented a new and improved system of comparing genomes.
One of the main problems to date has been the variability of sequence alignment, Miller says. “The old ways of doing these alignments required that you first specify the reference genome, and then create your alignment. What you got was dependent on … the reference genome that you picked.” With Miller’s new program, results aren’t biased toward one species or another. “It presents you with consistent views of the matches,” no matter if you start with mouse, fugu, or anything else, he says.
But there are still plenty of bioinformatics challenges in comparative genomics, Miller points out. “So far we’ve only dealt successfully with automatic multiple sequence analysis under the assumption that the matching regions occur in the same order and orientation in all of the species.” It’s an obvious flaw: large-scale rearrangements, inversions, deletions — to name just a few of the results of the genome’s evolutionary hopscotch — can completely stymie automatic analysis algorithms.
And even that’s just the start of the problem, Miller says. Once the data can be compared, how will it be presented? A browser is one option, he says, but “frequently you need other ways of data mining, like a query language that would let you investigate genome-wide questions.” With his work cut out for him, here’s hoping Miller isn’t completely set on retiring in 2008.