Mark Gerstein, the Albert L. Williams Professor of Biomedical Informatics at Yale University, is a member of the international Encyclopedia of DNA Elements, or ENCODE, project consortium, which earlier this month published nearly 30 papers providing insight into the human genome.
Gerstein is a co-author on the project’s primary paper, published in Nature, as well as a handful of other papers the consortium published in Genome Research. The research was based on the five-year pilot phase of the study, which set out to catalog all the functional elements in 1 percent of the human genome.
This week, BioInform spoke to Gerstein in his office, about ENCODE and some of the other projects underway within his group. What follows is an edited version of the interview.
What is the history of [your role in] the ENCODE project?
They have five main branches to the project and they give them abbreviations — GT [genes and transcripts], TR [transcriptional regulatory elements], MSA [multi-species sequence analysis], VAR [variation], and ECR [chromatin and replication]. I’ve spent a lot of time analyzing pseudogenes as part of the GT group. And you know, what happened is we kind of came together and put this massive manuscript together that recently came out in Nature.
We also put together many other companion manuscripts that actually filled an entire issue of Genome Research. And Yale, particularly my lab, participated in quite a number of those companion papers; I think it was like five or six of them. It’s fair to say we were a major participant in ENCODE and were quite happy with it.
Now, going forward, we’ve applied for the second round of the ENCODE funding. We will see what happens and if we are included in that. …
[The National Human Genome Research Institute] has also announced a new program called modENCODE, or Model Organism ENCODE, which is a kind of ENCODE-type of approach organized around model organisms [that] we are participating in too. So we are quite keen on ENCODE, and … I wouldn’t say it’s shown us about the exact function of every single nucleotide, or even that 1 percent of the genome [that is being studied in the project], but it’s really had a lot to say about the degree to which the genome is transcribed, every single nucleotide transcribed, and the degree to which a given nucleotide position participates in the regulation or is bound to it by various factors and so forth, and also the degree to which it can be conserved in other organisms and so forth. I think ENCODE has [provided] a lot of information on that.
What has been your most startling discovery [in ENCODE] to date?
I would say the most interesting thing we’ve done in ENCODE has been the analysis of pseudogenes. My lab has had a long, historical interest in pseudogenes — and pseudogenes being protein fossils. Previous to ENCODE, we’d actually done a draft annotation of all the pseudogenes in the human genome, and found that there were at least as many as there were genes. And we had been interested in classifying them into different groups and understanding how they were involved in aspects of genome function, and so on and so forth, and one of the interesting things we did with ENCODE was we really did this annotation very carefully on the one percent of the human genome, in collaboration with the other parts of ENCODE that were interested in pseudogene annotations — those other parts being the HAVANA [Human and Vertebrate Analysis and Annotation] group at the [Wellcome Trust] Sanger [Institute], and the groups at [the University of California, Santa Cruz] and [the Genome Institute of Singapore].
From that we found two or three interesting things about pseudogenes. One thing we found was there were 200 pseudogenes and [in] about one fifth of those 200, in particular, 38 of 201, we found very strong evidence that they were transcribed, and that’s actually very surprising because people normally felt that if something was a pseudogene, it was like, a dead gene, or a fossil, just literally a pattern in the genome that resembled a gene but didn’t have any function, and here we found that a good fifth of the pseudogenes were transcribed, had some form of activity or life in them and that was actually very interesting.
So why is it that what we thought were dead are alive, or appear to have some aspects that are alive? That was, I think, a truly interesting finding…
[Another] finding was that … for each of the pseudogenes, if we find them in other organisms — so now imagine we have our 200 pseudogenes and let’s just say for simplicity’s sake that we can find of the 200 roughly 90-odd some percent of them in the chimpanzee. So for each of those 90, we can take the human one in the chimpanzee and we can align them. And then we can now look at the conservation not of the whole pseudogene, but the individual nucleotides in the pseudogene. We can see how the sequence of the pseudogene is varying. How [do] the pseudogene’s sequences vary in comparison to the gene sequences? And one of the things we were able to show is that most of the pseudogenes’ sequences were varying what is called, neutrally. They appear to be varying in a very non-constrained way, the way that just general bases of the DNA tend to vary opposed to the constrained way that genes vary. Genes tend to be under negative selection — they tend to be very conserved.
However, we were also able to show that a fraction of the pseudogenes are under selection and appear to be more constrained than people would expect. You might have thought that the pseudogenes that were under selection might be the same as the pseudogenes that are transcribed, but it turns out that is not the case. Some of them overlap, but it’s not like the active pseudogenes tend to be the conserved ones.
Our lab [also] started to develop approaches to intergenic annotation, thinking about how we might annotate the intergenic space, so obviously, an interest in annotating pseudogenes is a major activity. But another activity was trying to group together these binding sites, regulatory regions, into bigger structures. People have previously found that here’s a site for a transcription factor, and here’s another site for a transcription factor, and the ENCODE region experimentally mapped many of these binding sites out. We found ways of clumping together sites of what we call forests or very dense regions of factors, and deserts, regions of not many factors, to create a higher kind of grouping or cluster of binding sites, a higher level of annotation.
In addition, the ENCODE project also worked on trying to find which regions of the genome were transcribed — whether or not they occurred in genes. So [that was] one of the interesting findings people had made previous to the ENCODE project — and we participated in this kind of thing. Now people had made that finding, and the next step is to find all those transcriptionally active regions and then try to make an annotation or try to start to link them together and cluster them together to find loci to bigger objects.
We developed some procedures for clustering them together into loci. In particular, we found overall in the ENCODE region about 7,000 novel transcribed regions, that appear to be transcribed, that weren’t in genes, and we could take about 1,300 of those regions and link them together into about 200 novel transcribed clusters. So we created annotations for about 200. And again, that 200 is a fairly large number considering that there’s only about 400-some-odd genes completely contained in the ENCODE region.
How are bioinformatics tools aiding in your research, in particular ARC [Active Region Comparer] in the DART [Database of Active Regions and Tools] database [which was described in one of the Genome Research ENCODE papers]?
We build a lot of tools. We think tools are very important in bioinformatics, and this tool was built to try to help people analyze the intergenic region annotations. … The Active Region Comparer allows you to compare groups of active regions to see if they overlap or don’t overlap. It might seem trivial to do that, but it’s actually hard to grapple with all the genomes and the coordinates and the regions of those things. So this tool lets you take two groups of ... regions and see if they overlap; to the degree that they overlap, the bases they share, and so on and so forth.
One of the things that drove the need for DART is … that DART is this classification of transcribed regions. There’s also this pure database aspect of [where] people were doing these experiments and finding these regions of the genome that were transcribed. There is purely this issue of how do we store this on the computer and how do we represent it on the computer and give someone all these transcribed regions? And you know, at a totally simple-minded level, you could just give them coordinate positions. This is from a nucleotide here to a nucleotide there, but I don’t think that’s really sufficient. People want to have a more flexible system for grouping them into sets, for visualizing them, for intersecting them and so on and so forth.
What are some of the other key computational tools you’ve developed at Yale?
Well, we’ve developed a number of tools that we’re very proud of. My lab focuses on bioinformatics —that’s the overall focus of the lab. We have three to four subsidiary focuses. One being intergenic analysis of the human genome, and that’s what we are talking about in relation to ENCODE. The other is, after you’ve determined the genes in the human genome, how do they work together as a system? So that is sort of systems biology or network biology.
The third is, you can take each of the nodes of the network from each of the genes and drill down and try to understand it as a molecule: What does it look like as a chemical entity? And here we get a little more into what we call structural genomics or computational biophysics.
These tiling array tools, we have the tiling web site and what we call Tilescope and something called ExpressYourself for processing tiling array data, for … designing tiling arrays and so forth. We have developed DART for taking the results from all the tiling arrays, and then in the network biology realm we’ve developed some systems for analyzing networks of genes and proteins, and looking at their topology and finding hubs and bottlenecks, and networks, looking at the paths of the network, looking at motifs …
There are two interlinked systems for that. The original system is called TopNet and this was developed a couple years ago by a grad student. And the second version we called tYNA. That is supposed to be a cute acronym for TopNet-like Yale Network Analyzer. It also looks like tRNA.
And that is a system you can get to on the web and it is also associated with another module called PubNet. We also developed this thing called the SIN, which stands for Structural Interaction Network. We published it in Science last year where we took the proteome network in yeast and tried to instantiate it in terms of protein structures. We found that some hubs tended to be composed of simultaneously possible interactions and others mutually exclusive ones.
Within the realm of structures, we developed a number of tools there. In particular we developed a tool called MolMovDB for analyzing macromolecular motions. Given one structure, it predicts how it might move … or it can predict the hinges in a particular structure, or given two structures it will generate a kind of interpolation or animation or morph between the two structures.