Though the field of comparative genomics is still young, it’s maturing quickly — perhaps by necessity, as improvements in high-throughput sequencing generate reams of data with each new genome studied.
“There are a lot of different kinds of ’omics being touted,” says David Haussler at the University of California, Santa Cruz, “but genomics is still king. DNA … is going to continue to be a huge source of information.”
As comparative genomics evolves, the next step after finding sequence regions of interest has become building assays to determine the function of those areas. Haussler’s group, like many others in the field, is already busy taking computational predictions into the wet lab to get answers.
The cornerstone of comparative genomics rests on functional elements, or regions of DNA that have been conserved over time among distantly related species. However, with the sequencing of the chimpanzee genome in 2005, phylogenetic footprinting has given way to searches for changes among closely related species.
Katie Pollard, a former postdoc in Haussler’s lab who is now at the University of California, Davis, discovered last year a slew of regions in the human genome that appear to have undergone accelerated change since the evolutionary split from chimps. Her work — a clever comparison of regions that were highly conserved in organisms leading up to chimp with regions that were widely divergent between chimp and human — opened the door to tracking these so-called human accelerated regions (most of which are found in non-coding parts of the genome) that might offer insight into what makes us, well, human. Annotating those regions has become a main focus for Pollard and an interest for the rest of the community as new lab techniques are brought to bear to elucidate function.
“People are starting to combine [phylogenetic footprinting] with powerful methods — ChiP-on-chip for detecting transcription factor binding sites, and other studies that are leading to some excitement,” Haussler says. Determining function for these non-coding regions, however, largely depends on proving that predicted changes in sequence actually cause some functional difference, and this is work that needs to be performed in the lab. “There’s no way that the whole story of evolutionary change could ever be told by bioinformatics,” he says. “It’s going to be a lot of wet lab work.”
OK, But What Does It Do?
“The prime example of how we operate is this HAR1 element,” Haussler says, referring to the first human accelerated region Pollard and his team studied. “HAR1 is a classic example where it led to the discovery of an entirely new gene that now requires functional characterization.”
HAR1, the focus of Pollard’s research that was published in Nature last August, is one region of non-coding DNA that showed particularly rapid change in humans — during the 400 million years since chicken and chimp diverged from their common ancestor, there were only two substitutions in this region; but during the 5 million years or so since the divergence of humans and chimps, there were 18.
Pollard then switched to wet lab work to see if she could find out what this particular piece of DNA code was doing. As it turned out, HAR1 is an RNA gene expressed in specific cells of the developing human neocortex from the seventh to 19th week of gestation, a crucial period for cortical development in humans — and what might set us apart from chimps, whose cortex is much smaller.
“We had an idea that it was functional because it was nearly identical between the chicken and the chimp, but we didn’t know what it did and we didn’t know what the effect of human-specific changes might be,” Pollard says. “So then there was a lot of wet lab work.” Functional experimentation after finding the gene is exciting, Pollard says: “To me that’s when it became a lot more interesting.”
Pollard is continuing her work at UC Davis, where she leads a team trying to find what functional effect the human accelerated regions have in phenotype. In a recent paper published in PLoS Genetics, Pollard reports on an expanded list of 202 HARs. Out of these 202 regions, all but five have turned out to be uncharacterized non-coding sequences. She hypothesizes that most of these are regulatory regions, and is currently working on several candidate regions that contain motifs known to bind to transcription factors.
“I think a lot of people believe that coupled experimental and computational work is the direction of the field,” she says. “It’s pretty easy to just do some sort of data analysis and publish the results, but it’s a lot more interesting if it’s accompanied by some experimental work.”
Beyond HAR1
University of Chicago evolutionary biologist Marty Kreitman, whose work focuses on a well-characterized transcription factor binding site in Drosophila, agrees that functionally validating algorithmic data is the next step. “It’s definitely an important next chapter,” he says of combining sequence conservation data and bioinformatic predictions.
Kreitman’s work focuses on evolutionary turnover of transcription factor binding sites. Along with Michael Ludwig at the University of Chicago, he has published numerous papers on the evolution of regulatory DNA in Drosophila. Their focus is on a transcription factor binding site commonly found in eukaryotes and organized into what are called cis-regulatory modules, or CRMs. These CRMs can bind multiple transcription factors in variable sites along this stretch of DNA; critical portions of the binding site can come and go, arising de novo due to mutations, or turning over and disappearing. Kreitman and Ludwig’s work has shown, however, that the binding site sequence can undergo changes without an observed functional change.
“The module’s job is simply to turn on transcription at the right place and the right time in development, and keep it off at all other times and places,” Kreitman says. “There may be more than one evolutionary solution to the structure of the CRM that can do that job. So it kind of wanders through evolutionary sequence base and just creates alternative binding site architecture, or logic, each of which does the same job equally well.”
Not many of these sites have been well characterized, and that’s where comparative genomics is changing how these sites are located and studied. Back in the day, CRMs had to be found empirically. Now, that tedious and time-consuming process has been made simpler by whole genome comparisons and predictive computation.
“Using genome comparisons — what is sometimes called phylogenetic shadowing — one can see these stretches of conserved DNA in non-coding regions, and those conserved regions might be good candidates for harboring some of these CRMs,” Kreitman says.
“You use the intersection of those two features — the evolutionary conservation, to the extent that it exists, and the binding site prediction, to the extent that it works — and the two together definitely increase the ability to bioinformatically predict where these modules might exist,” he adds.
While bioinformatics can make predictions as to where these CRMs are, “someone has to go and functionally test whether or not they’re real,” Kreitman says.
Still Evolving
Pollard’s success with human-chimp comparisons is indicative, she believes, of the direction comparative genomics will continue to take in the future. Working on comparisons of closely related species, such as the chimp and the upcoming Neanderthal genomes, will shed light on our particular human origins, as well as have important implications in disease and medications.
The common thread is translation: translating our increasing and increasingly complex amounts of data into testable hypotheses. “Bioinformatics gives you the hook, it gives you the clue that starts to unravel the story,” Haussler says. “But when you pull on that thread, there’s an enormous ball of yarn out there that needs to be untangled.”