It's hard to imagine a systems biology field expanding more rapidly than comparative genomics. A research area that was barely a glimmer three or four years ago now commands entire tracks, and packed audiences, at many of the top-notch scientific conferences. "There's an explosion" of this field, says Evan Eichler, associate professor at the University of Washington, pointing out that every large-scale sequencing center has started a comparative genomics effort. Armed with a few finished genome sequences as well as dozens of draft genomes, ever-growing legions of researchers are trying to capture the power of these scientific advances by using comparative genomics to study function, variation, and genetic conservation among and between species.
For one thing, it's the next obvious step. "You get two genomes and what do you do with them — you compare them," Eichler says. The use of comparisons has a rich history among biologists, who are comfortable with and adept at better understanding new organisms by comparing them with ones that are better known.
What may have begun as an academic exercise to put two things together and see where they differed has become an essential tool in trying to elucidate gene function — especially in the human genome, where function is unknown for some 50 percent of genes and predicted genes. "Comparative genomics has emerged as the first order analysis to try to infer function," says Ross Hardison, director of the Center for Comparative Genomics and Bioinformatics at Pennsylvania State University.
Dan Rokhsar, who oversees comparative genomics research at the Joint Genome Institute, says cis-regulatory enhancers are a prime example of this kind of work. "In trying to look at the functional parts of the genome that we don't know much about," he says, previously mysterious or unknown mechanisms such as the cis-regulatory enhancers are now seen "in relief based on comparisons with other genomes."
Of course, function is just one spoke of the comparative genomics wheel. Inter- and intra-species variation are also coming into focus thanks to scientists' ability to line up genome sequences and see which elements match or don't match. David Haussler's team captured the attention of researchers with their unexpected and almost unbelievable discovery last year that more than 400 long sequences were found to be perfectly conserved across human, mouse, and rat. Researchers like Hardison are still scratching their heads about it: "What biological structure requires that level of conservation at every position?" he wonders.
Needless to say, with findings like Haussler's, it's clear that the young fields of comparative genomics and evolutionary genomics are growing up in lockstep. While some scientists focus fairly strictly on one or the other of those avenues, the results of each are pertinent to their peers in both fields. Bruce Roe, director of the genome center at the University of Oklahoma, oversees comparative work involving cow, rat, chimp, baboon, and zebrafish. That'll provide insight into genetic function, he says, but will also give a picture of how evolution occurred as these organisms split off from what would become the human lineage from as early as chimp, 5 million years ago, to as long ago as zebrafish, 400 million years ago.
But the potential of comparative and evolutionary genomics is much greater than figuring out our ancestry or even understanding which genes turn on what. In the strictly human-centric view, results from these fields have the potential to inform medicine. For one thing, comparative genomics will likely highlight better model organisms that can be used for things like drug development studies. It might even help us predict virulent outbreaks, points out Ward Wheeler, curator of invertebrate zoology at the American Museum of Natural History. Take cholera, for example: "What were the events that led to pathogenicity in this creature?" he asks. "Are there certain syndromes of the origins of pathogenicity that would help us to design strategies to deal with that, or potentially to predict future areas of danger?"
None of this could come to bear without the technologies driving the field today. While sequencing and bioinformatics remain mainstays in this community, research using microarrays is rapidly catching up, and work with RNAi and other new techniques is just starting to take hold. "It's an exciting time," says Eichler — and a good time to check into the field and take a closer look at the enabling technologies.
Even in a field as nouveau chic as comparative genomics, the stodgy sequencing technology that has been around for a dog's age has remarkable merit and is largely considered the foundation of the discipline. What would scientists compare, after all, without genome sequences?
"Sequencing is where it's at, for at least the short term," says Eichler, who has spent much of the past year studying the chimp genome and attempting to understand how it compares to the human sequence. His peers, who are still generating megabases of sequence, aren't arguing. Dan Rokhsar at JGI, who is involved in the National Science Foundation's Tree of Life program, says his team is "trying to get sample genomes from all the different phyla of animals near the bottom of the evolutionary tree." With sequence data from such a range of organisms, he hopes to gain a better understanding of "how genes are born, how they die, how novelties emerge within genomes," he says.
Wheeler at the Museum of Natural History uses sequencing to "generate data from specific loci" and compare DNA from, for example, 1,000 taxa at a time. This type of work has proven so critical to researchers at the museum that it has opened its own comparative genomics institute and is adding to the horsepower in its sequencing labs.
In Oklahoma, Roe's lab also focuses on specific regions, but more broadly: he and his team sequence the regions orthologous to human chromosome 22 in as many organisms as they can get their hands on. "The whole idea is that what we're looking at is regions that are conserved throughout evolution," he says.
In similar work with a very different goal, the folks at Eric Green's NHGRI lab have spent a lot of time using sequences to try to understand exactly how useful comparative genomics will be. A question that they address often, he says, relates to "the pattern of diminishing returns" — that is, at what point does adding another genome sequence to your comparison no longer add significantly to your findings? In other work, his team is, like many of its counterparts, scrutinizing the genome to find clues to function. "We have vanishingly little by way of tools for identifying the functional part of the genome that doesn't code for proteins," Green says. "One of the leading ones that's available for us is comparative sequence analysis."
It may be a leading tool, but it's not the only one, notes Stan Rose, CEO of custom array services provider NimbleGen. The sequence is just a start. "Getting the human genome sequenced was very exciting," he says, "but it was very much a beginning, not an end."
All About Arrays
And just as microarrays proved to be the next step for early analysis of sequence data, so too have they emerged as a way to elicit comparative information from genomes. Scientists compare individuals of the same species — much the way Perlegen uses its genome wafer to find SNPs in humans — or spot down genes from one species and then run the assay to test sequence from a different species and see how they compare.
Lixin Zhang, assistant research scientist at the University of Michigan, has been working on a project that's a perfect example of this. His lab's focus is on the genetic diversity of bacteria, and he had trouble early on finding a good platform to compare many different kinds of bacteria at once. Conventionally, he says, scientists would put a sequenced genome on a chip and then run a test genome across it to find which genes are missing. But because of "the genetic diversity of many bacteria, any single one sequenced genome does not represent the gene repertoire of the whole species," he says. His solution was to create a new technology: "Instead of putting sequenced genes on a chip, we put the total [genes] of hundreds or thousands of isolates of bacteria on an array." Then Zhang and his team can choose any kind of probe and screen thousands of bacteria at a time, gaining far more comparative data than they would have with the traditional single-genome array.
At the University of Illinois, Urbana-Champaign, Gene Robinson employs arrays to study the social habits of bees, which he compares to the better annotated and understood Drosophila. "We rely heavily on cDNA microarrays for generating new candidate genes based upon their expression pattern," he says. He has found, for instance, that a gene known to have a role in foraging in the fruitfly has a similar purpose but functions in a completely different, more complicated way in the honeybee. "The larger take-home message there is the idea that complex behavior can be seen to be built on simpler behavior modules," he notes. In this way, comparative genomics is critical to his work. Robinson says with the honeybee sequence due to be completed by the end of last year, he expects to have a whole-genome bee array in the works soon to continue to advance this type of research.
Meanwhile, Jonathan Freedman, who heads up the toxicology core at Duke University's Center for Environmental Genomics, says his team uses arrays to provide insight into the role of environmental toxins in yeast, C. elegans, zebrafish, and mouse. "We use information we get in one organism to direct what we do in higher organisms," he explains. In a collaborative comparative genomics project with MIT and the Fred Hutchinson Cancer Research Center, Freedman has worked with array vendor Paradigm (now Icoria) to design arrays for C. elegans and zebrafish, which he uses to study expression levels as the organism is exposed to certain doses of a toxin. Moving forward, he has plans to complement the array studies with RNAi research on the same theme, hoping to use gene knockouts to find which genes are involved in defense and repair mechanisms.
Bring on RNAi
And just as it has crept into Freedman's array lab, RNAi has begun to gain footing in the comparative genomics space. Oklahoma's Roe, for one, is using RNAi probes to examine the function of genes in zebrafish. In addition to the usual antisense experiments, Roe says, "just for the heck of it we've made sense probes" and tested those out in zebrafish. The surprising result: "About 10 percent of the genes that we looked at actually do have an antisense that's measurable," he says. The finding was so unexpected, he adds, that "our collaborators kept saying, 'Oh, you guys screwed up the experiments.'" But to Roe, the fact that genes carry their own antisense messages is understandable, particularly for ones that are involved in embryonic development — the genome, he explains, has to be able to turn off certain one-time-only genes once they've performed their duty.
At Washington University, Senior Scientist Makedonka Mitreva is using the successful RNAi work done in C. elegans to inform research on some 30 parasitic nematodes she and her team are studying. "We always do an RNAi comparison for each of our species," she says.
Clearly, none of the bench technologies would mean a thing without the inevitable bioinformatics supporting them. Ross Hardison at Penn State credits expert Webb Miller with being the raison d'etre for his comparative genomics center.
Like many computational biologists in this field, Miller has put serious work into improved genome alignment algorithms, without which comparative genomics would be impossible. Cross-species alignments are particularly challenging because of insertions, deletions, gene copy number changes, and other evolutionary mechanisms. "When you come to a genomic region and you find two copies of it in one species and four or one in the other," says Gill Bejerano, a member of David Haussler's Santa Cruz lab who performed the research on ultraconserved regions of the genome, "it's a big challenge to decipher what happened there. Did something get lost, did something get added?"
It could be a case of too many cooks in the kitchen, Hardison says. "We almost have too many alignment methods. … There's a whole cottage industry of trying to come up with the best way of doing it." Until some standards are defined for pattern recognition algorithms — one of the developments that is hoped will come out of NHGRI's ENCODE project — the sheer number of programs available could actually be a stumbling block.
Another bioinformatics issue is simply being able to compare research across species, notes Robinson. He and his team have put together a project called BeeSpace, aimed at honing bioinformatics tools for the comparative genomics field. A major push is for new text-mining approaches that would be able to find similar concepts or genes with different names in different organisms or scientific arenas, he says.
The Road Ahead
As comparative genomics researchers push their field forward, new technologies and approaches will continue to pop up. Scientists point to innovations like array CGH, or comparative genomic hybridization, and representational oligonucleotide microarray analysis or ROMA from Mike Wigler's lab at Cold Spring Harbor, used to detect copy number variations, as two of the latest technology advances in the field. Both of these are just starting to get serious attention at scientific meetings and promise to become major players in the field.
Through studies of variation, function, and evolution, scientists like Bruce Roe are working toward what they see as the ultimate goal of comparative genomics. "We need to have this sort of dictionary of what genes are present and what they look like in other animals," he says.
That's asking a lot, but they know that. "These are big questions, but the data have never been so rich," says Evan Eichler.
Scientists agree that while comparative and evolutionary genomics are making leaps and bounds, there are still a number of hurdles facing the field. Here are some of the key challenges they pointed out:
Sequencing still costs too much. Dan Rokhsar at the Joint Genome Institute says, "Anything that makes genome sequencing faster and cheaper is good, because the amount of information you get from comparing three genomes is much greater than the amount of information you get from comparing two genomes."
Increases needed in sequencing throughput. Costs aside (and when can costs be put aside?), sequencing technology needs to see some major breakthrough in volume to be able to handle the demand for comparative genomics, says Ward Wheeler at the American Museum of Natural History. To complete what he'd like to do — full genome comparisons for, say, 1,000 organisms at a time — modern sequencing technology simply can't handle the load.
Superior front-end automation. For Wheeler's thousand-taxa studies, he says, basic lab automation presents a major sticking point. "We have thousands of different creatures that have thousands of different biochemistries," he says. Not only does he need better automation for extracting DNA, but he has to have highly automated PCR to make these experiments go as quickly as possible.
Elusive endpoint. Unlike the clear goal in sequencing the human genome, there's no obvious way to define progress in comparative genomics, points out Eric Green at NHGRI. "Have we uncovered 50 percent, 10 percent, five percent, one percent? I'm not even sure we'll know ... when we're there," he says.
Genome alignment. "In general aligning multiple genomes is not a completely solved problem," points out Gill Bejerano at the University of California, Santa Cruz. "It's a big challenge" to match genomes properly and then figure out what happened at sites where they differ, he adds.
Beyond single bases. Comparing genomes base by base provides great information, but is only the tip of the iceberg. Large-scale variation — and how to track it — will be a major factor going forward, says Evan Eichler. "Getting a handle on that is going to require different technology."