A proposal for a large-scale Human Epigenome Project could spawn a new field of bioinformatics focused on analyzing and visualizing complex chemical modifications of DNA that vary greatly across tissue types, individuals, and time.
In the Dec. 15 issue of Cancer Research, Peter Jones of the University of Southern California, Robert Martienssen of Cold Spring Harbor Laboratory, and colleagues argue that "the time is now ripe" for the scientific community to "join in a coordinated effort" in support of a Human Epigenome Project, which would "identify all the chemical changes and relationships between chromatin constituents that provide function to the genetic code."
The paper summarized the findings of a workshop on the subject that the American Association for Cancer Research sponsored this summer, but it was also a call to arms. High-throughput technologies — particularly microarrays — for analyzing DNA cytosine methylation patterns and histone modifications have progressed to the point where it is now possible to "genomicize" a number of studies that are currently being conducted on a "piecemeal" basis, the authors note.
But not all the technological issues supporting such an effort are ready for prime time. Unlike sequence data, which is linear in nature, the Human Epigenome Project "would require the analysis of data that are orthogonal in nature, owing to the fact that there is not just one epigenome: epigenetic states vary between tissues, among individuals, and in healthy versus disease states."
"There is not just one epigenome: epigenetic states vary between tissues, among individuals, and in healthy versus disease states."
The bioinformatics challenges of the project are "immense," according to John Greally of Albert Einstein College of Medicine, and include "the dull, boring issues of semantic definitions that allow distributed data to maintain coherence." In response to an e-mail query, Greally said that there will "need to be some sort of MAGE-ML/MIAME equivalent for epigenomics, first and foremost."
Analytical tools, he said, will be "a luxury after we've done a lot of gruntwork to get the epigenomics data into shape for sharing and analysis" — a task that "dwarfs any genome project in terms of complexity."
But much as the Human Genome Project spurred development of new bioinformatics tools, Jones, Martienssen, and colleagues believe that "the establishment of a formal HEP would drive the further development of technology."
Martienssen told BioInform that "it's certainly true that there are big [informatics] challenges, but computational power has been growing exponentially, thankfully, and I think that the mathematical treatment of these questions is getting much more sophisticated than it used to be."
Martienssen cited work by Rebecca Doerge, an expert in statistical genomics at Purdue University, as a step in the right direction. "She brings statistical methodology to all of these areas in a way that statisticians don't normally think about, but I think they're getting very interested in them now, so we're hoping to have quite an influx of people from the theoretical sides of mathematics, especially statistics, coming into the field, because of these challenges," he said.
Martienssen added that Paul Flicek, a researcher at the European Bioinformatics Institute, has made a great deal of progress in analyzing tiling array data, which promises to be the technological workhorse for the Human Epigenome Project.
Flicek told BioInform that epigenomics actually presents several layers of challenges for bioinformatics. The first hurdle, he said, is the analysis of the raw data coming off the arrays. Flicek is using an approach based on hidden Markov models that takes advantage of "the context information" that is unique to tiling arrays. "The tiles are next to each other, and they don't respond completely independently, and that makes tile path data subtly different than a gene expression array, for example," he said.
In addition, he said, it's likely that the epigenomic data that researchers are dealing with now will not be the same data that biologists will see in a few years. Comparing the field to genome sequencing, he noted that "nobody wants the sequence trace files except people who really, really care about it, and it's probably going to be true that nobody's going to want … the actual enrichment values off an array that tell you where the histone methylation is. But right now, because the raw values off the array are so new, that's what we're playing with, and that's what I've got my hands in."
Eventually, he said, "We'll go from these huge piles of data to much smaller pieces — things that will be easier to store, if actually harder to display."
That's where the next difficulty arises, Flicek said, because current genome browsers like UCSC and Ensembl will likely fall short when it comes to displaying epigenomic data. "If there are 50 cell types and 30 tracks for each cell type, you sort of lose the real advantage of the genome browser, which is that nice straight line that you can look down the sequence coordinate and see all these things line up," he said.
But it's still unclear how a next-generation genome browser might capture this information effectively. "It would be spectacularly cool to have this 3D genome browser that does all these fancy things," Flicek said, "but if it's just a pretty picture and doesn't have the power to help you intuit something new from the data, it's probably not that useful."
Nevertheless, Flicek said that the Ensembl group at EBI is currently using data from the ENCODE project as a test case, and hopes to release methylation data and other information from the ENCODE regions "at least in DAS tracks" by the summer.
In addition, he said that an EU-funded project called HEROIC (High-throughput Epigenetic Regulatory Organization in Chromatin) is expected to start generating whole-genome tiling array data for the mouse in 2006, which will provide another useful test case for the bioinformatics tools under development at EBI.
Flicek noted that the proposal in the Cancer Research paper "is still in the future, but the data that encompasses the bigger world of epigenomics — be it methylation, or histone marks, other things like that — that data is starting to arrive, and it's very close to starting to arrive in whole-genome-sized chunks. And as it does, we want to be ready for it, and we want to be able to try to use it in the most sensible way that we can."
Martienssen said that even if a large-scale Human Epigenome Project doesn't gain the support of the funding agencies, "it is going to happen anyway, but without a really recognized, central effort, it will be very piecemeal, it will be inefficient, and it will be slow."
He added that in addition to HEROIC, there are at least two other EU-funded projects that are studying "various aspects" of epigenomics, but "they're all pretty small-scale projects, and they're never going to get everything that we need, and if things continue at that pace, it would take 50 years to do the whole human genome."
Martienssen estimated that the HEP would require a budget "in the hundreds of millions of dollars" over the next five to 10 years "to get it accomplished in a way that's systematic enough and, very importantly, with strong bioinformatics support and development that allows people to not only make sense of the data, but also to make it accessible in a way that anybody working in the biomedical area can immediately realize the significance of anything they find in that context."
He added that he's had very favorable feedback on the proposal. "Judging by the response, think we will see huge interest in this area in the next few months," he said.
"I think the time has come for NCI to sit down and really formulate a plan by which this could be funded."
— Bernadette Toner ([email protected])
The 'Other' Cancer Genome Project
The Cancer Research paper proposing the Human Epigenome Project was published the same week that NCI and NHGRI kicked off a $100 million pilot project to determine the feasibility of mapping the genomic changes involved in all types of human cancer — timing that Martienssen described as not necessarily intentional, "but certainly good."
Both efforts promise to provide molecular-scale insight into the mechanisms of human cancers, but there are a few differences, Martienssen said.
"There's no question that both types of information are going to be key. I think the important realization is that you do need both," he said. "For example, the tissues and cell lines and so on — all those things are as much of an issue for the cancer genome project as they are for the epigenome project, and coordinating those samples would be wonderful."
However, he noted, "the epigenetics community is more of an array-based, somewhat lower resolution — nucleosome-resolution — sort of technology … whereas the cancer genome project obviously wants to do nucleotide resolution, and actually sequence every base in each of these genomes."
The projects are "complementary and different," he said, "and I think they both have merit and should be taken together."