Gene Regulation Technical Guide

Table of Contents

Letter from the Editor
Index of Experts
Q1: What's the best way to measure the expression of non-coding RNAs and microRNAs?
Q2: How do you measure the effect of these RNAs on gene regulation?
Q3: Is it more useful to measure DNA methylation or histone modification in order to determine what effect epigenetic factors have on the regulation of gene expression? Why?
Q4: What method do you use to get around the problems inherent in identifying DNA motifs in large ChIP-seq/ChIP-chip datasets?
Q5: What's your primary method to reduce false positives caused by DNA contamination and fragmentation in mapping protein-DNA interactions, while also avoiding false negative from too stringent data filtration?
Q6: What are your preferred pipelines or computational tools for ChIP-seq/ChIP-chip data analysis?
List of Resources

Letter from the Editor

Knowing what's in the genome is one thing — learning how it all works together is quite another. As researchers are learning more and more, gene function is regulated in a variety of ways — through DNA methylation and histone modification, by microRNAs and non-coding RNAs. And a precise study of epigenetics is essential if researchers are to get a clear understanding of how and why genes act the way they do.

Various groups have developed a number of tools to aid in this research — tools like RNA-seq, ChIP-seq, or ChIP-chip as well as a number of newer methods and data analysis pipelines that are suited to different purposes. We asked our pool of experts about which approaches they use to measure miRNA expression, how they determine the effect of these RNAs on gene regulation, how they get around the problems inherent in large ChIP-seq or ChIP-chip datasets, and much more. They told us about how they handle challenges, which software they like to use, and about how they apply the available tools and tricks to their own research. If you need still more information, check out the resources list at the end of this guide for studies and Web addresses where you can find and download some of the tools our experts have discussed.

— Christie Rizk

Index of Experts

Many thanks to our experts for taking the time to contribute to this technical guide, which would not be possible without them.

Marc Facciotti
University of California, Davis

Kun Huang
Ohio State University

Jason Lieb
University of North Carolina, Chapel Hill

Xiaole Shirley Liu
Dana-Farber Cancer Institute

Jun Song
University of California, San Francisco

Kevin White
University of Chicago

Q1: What's the best way to measure the expression of non-coding RNAs and microRNAs?

Since I work primarily with microbes, I have answered these questions with microbial studies in mind.

I think that the answer to this question depends tremendously on the specific scenario in which you are trying to identify and/or quantify non-coding and microRNAs and whether you are trying to quantify abundance of a known RNA species or trying to discover new ncRNC or microRNA in a genome or metagenome. For discovery, particularly if no reference genome is available or you are looking for expression of an RNA in an organism hiding in a community, RNA-seq would seem to be appropriate. Alternatively, if you are studying a specific microbial isolate and have a genome sequence available a high-density tiled microarray can also be very effective — despite the popular notion that microarrays are passé. qPCR is always an option for targeted inquiry.

— Marc Facciotti

For screening and discovery purposes, RNA-seq or smRNA-seq are ideal. However, investigators should pay special attention to the specific protocols for extracting RNA (e.g., polyA-fishing versus other protocols such as using the NuGen kit), library preparation and sequencing methods. For measuring known microRNAs, quantitative methods such as NanoString are a good choice.

— Kun Huang

To measure RNA level, RNA-seq or smRNA-seq are the way to go. With increased sequencing throughput and multiplexing, they are becoming cheaper than microarrays and give better quality data. To measure transcription rate, new techniques such as GRO-seq (developed by John Lis' group) or NET-seq (developed by Jonathan Weissman's group) have been developed lately. The latter two techniques are a little bit trickier than RNA-seq, but the inventors and their collaborators have been working on simplifying and optimizing the protocol. Many groups are starting to adopt these techniques. They are very informative for studying transient or dynamic transcriptional changes, e.g. upon transcription factor activation.

— Xiaole Shirley Liu

Next-generation sequencing and microarrays provide complementary methods for measuring the expression level of non-coding RNAs and small RNAs. Currently, one can sequence six or more microRNA samples simultaneously via multiplex sequencing. One can also design custom tiling microarrays that cover the genomic locations of non-coding RNAs.

— Jun Song

RNA-seq.

— Kevin White

Q2: How do you measure the effect of these RNAs on gene regulation?

Most of the time, we are interested in seeing whether the binding of transcription factors or chromatin regulators as well as histone mark enrichment patterns have effects on gene regulation as measured by RNA-level or transcription rate. For transcription factors, we often observe that the more binding number and strength, and the closer the binding to the gene start, the stronger the transcriptional effect. Epigenetically, gene expression seems to be a quantitative balance between the active and repressive marks around them.

Si/miRNA can regulate genes post-transcriptionally, and there has been a body of scientific work in the last 10 years on this. Recently many non-coding RNAs have been found to regulate gene expression at the transcriptional or epigenetic level, such as piRNA on heterochromatin formation, Xist on X chromosome inactivation, HOTAIR on chromatin enzyme EZH2, and eRNA on nearby gene expression. This area is still very young with many unknowns, and we see new discoveries on this area all the time.

— Xiaole Shirley Liu

Studying the function of non-coding RNAs remains a major challenge. One can try performing RNA-seq before and after knocking down a specific non-coding RNA with siRNA. Comparing the global expression patterns in the two resulting datasets will reveal both direct and indirect effects of that non-coding RNA. Similarly, one can also perform ChIP-seq for epigenetic marks to detect how chromatin is regulated by non-coding RNAs. But, a biological system is replete with feedback loops and inter-connections, so these high-throughput methods will detect many secondary effects.

— Jun Song

Ultimately, the only way to accurately measure the effect of these RNAs on gene regulation is through detailed biochemical and molecular genetics experiments. However, methods such as ectopically expressing a microRNA or generating a mutant for a microRNA gene followed by expression profiling can be a powerful way of generating candidate targets. Such an approach can be further refined by combining with computational predictive methods or with biochemical methods such as RIP-seq for identifying target sites in mRNA transcripts.

— Kevin White

Q3: Is it more useful to measure DNA methylation or histone modification in order to determine what effect epigenetic factors have on the regulation of gene expression? Why?

This type of inquiry seems, naturally, more prevalent in eukaryotic circles. That said, there should be more focus on these mechanisms or analogous ones in bacteria and archaea. My basic sense is that both of these regulatory mechanisms can be critically important in certain circumstances. Ultimately, understanding the roles of both DNA methylation and histone modification is critical. I'll be non-committal and simply say that putting a focus on improving our ability to gather as much quality information regarding all mechanisms of gene regulation will ultimately be key.

— Marc Facciotti

For a small number of samples — particularly for cell line studies — I prefer histone modifications, which can be studied using ChIP-seq for a few key marks (e.g., H3K4me2 and H3K27me3), as their effects on gene expressions are more evident. DNA methylation can also provide evidence of epigenetic events, but the specific effects require further in-depth analysis to discover the details. For larger number of samples, however, and particularly for clinical studies with tens or even hundreds of samples, genome-wide DNA methylation studies using MDBCap-seq are ideal given the low cost of MDBCap-seq.

— Kun Huang

This depends on your definition of "epigenetic". In my view, it is more useful to know histone modification status because relationships between DNA methylation and transcription can be complex. In most cases, the relationships between histone modification and transcriptional status are more straightforward.

— Jason Lieb

Both DNA methylation and histone modification are useful measures of epigenetic status, and they each have advantages and disadvantages. DNA methylation is a more stable mark, and the profiles can be generated from smaller starting material so [they] could be used for tumor/biopsy profiling. There are many different approaches for DNA methylation profiles. Different groups need to pick the protocol best fitted to their needs, depending the budget, number of samples, level of coverage (whole genome versus CpG-rich regions), and quantitativeness (from sequencing depth). Histone marks are less stable (especially acetylation) and need large amount of fresh cells (as compared to fixed tissues) for genome-wide profiling, so are harder to do on tissues or tumors. Histone mark ChIP-seq protocol is very well established, and gives excellent data at reasonable cost. Compared to DNA methylation, histone marks probably provide more mechanistic insights, since enhancer histone marks (such as H3K4me1 and many acetylation marks) are very sharp and can shed light on transcription factor regulation, and promoter (such as H3K4me3) and gene body marks (such as H3K36me3) are easier to assign to genes. At the same time, there are many histone marks with very different effects, and the characteristics and functions of many histone marks are still very poorly understood. Good computational methods and analyses are important to make sense out of both DNA methylation and histone modification data. There are many challenges in analyzing and understanding epigenetic regulation, but they also create very exciting opportunities. That's why epigenetics is so interesting!

— Xiaole Shirley Liu

I believe that we currently do not understand the full consequences of DNA methylation and histone modification. The genomic location and the precise nature of these epigenetic marks both play a role in determining their function, and our current understanding of the relevant biological rule remains rudimentary. Thus, both cell type-specific DNA methylation and histone modification would be useful to measure.

— Jun Song

Certain histone modifications certainly show the highest correlations with gene expression status (e.g. H3K4me3 at transcriptional start sites). In some cases, such as with integrated cancer genome analysis, DNA methylation can help resolve different molecular classes of disease that correspond to transcriptional profile status. However, the most powerful way of identifying regulatory elements is a combination of methods such as DNAse hypersensitivity sequencing, monitoring histone modification status, and mapping specific regulatory factors such as p300 or site-specific transcription factors. The ENCODE and modENCODE projects, and a growing number of individual labs have been quite successful at mapping regulatory elements using such integrative approaches.

— Kevin White

Q4: What method do you use to get around the problems inherent in identifying DNA motifs in large ChIP-seq/ChIP-chip datasets?

There are numerous issues that lead to difficulty finding DNA sequence motifs in large scale ChIP-seq, -chip data. First, it is possible that the protein of interest may not bind to the DNA at all, but rather binds indirectly through protein-protein associations with other DNA binding proteins. In such cases it may not be reasonable to expect to find a motif. Sometimes, the ChIP data may be filled with false positive peaks that arise through experimental or peak detection artifact. This calls for careful analysis of the data and perhaps refinement of the experimental design. Finally, motifs may be present, but may be highly degenerate, making detection by standard methods difficult. The degeneracy may arise from several factors too that may or may not have real biological function. So, it is ideal if sub-families of a parent motif can be detected. To overcome the latter two cases (high false positive and degenerate motifs) we've developed a new software tool that extends the utility of existing motif finders by (a) increasing their sensitivity and (b) allowing multiple motifs to be reported — including, of course, degenerate versions of the "same" motif. (A manuscript describing this tool is currently under review so I hesitate to say much more.)

— Marc Facciotti

This is a difficult problem. We typically use multiple approaches including both matching known motifs to the regions and identifying enriched known motifs and de novo motif discovery with tools such as ChIP-Motif.

— Kun Huang

We typically look in the 100 to 200 base pair window surrounding the summit of called peaks. An important step in reliably identifying motifs is using a proper background sequence for tests of enrichment. We usually use the flanking DNA sequence immediately surrounding the 100 to 200 base pair region tested; this ensures that if your TF binds only at promoters, you use promoter sequence as background rather than random genomic sequence and reduce false positives. We use a variety of software, including CisFinder, HOMER, MEME, and BioProspector.

— Jason Lieb

Actually comparing to motif finding from promoters of co-regulated genes, motif finding from ChIP-chip/seq data is easier because the data quality is better. The only challenge is the large data volume. Several tools are very good for this analysis, including the SeqPos function in Cistrome, motif discovery function in CisGenome, and MEME-chip.

— Xiaole Shirley Liu

I try to pool together as much information as possible to reduce false positives. For example, evolutionary conservation, nucleosome positioning, open chromatin, and significant recurrence of motif combinations all provide useful information which, together, can often point to functional DNA motifs.

— Jun Song

Finding motifs present in large ChIP-seq datasets is actually not that problematic. We have used a wide variety of methods including de novo motif identification. However, due to the ever growing availability of position weight matrices, PWMs, for many different classes of transcription factors, perhaps the fastest and simplest method is to score the enrichment of all available PWMs in a given dataset. Most of the time we find the top expected motif by this method, and, oftentimes, we identify motifs that correspond to other factors that bind coordinately with the TF being assayed. Such an approach also has the benefit of systematically mapping where each motif resides within each peak of binding throughout the genome.

— Kevin White

Q5: What's your primary method to reduce false positives caused by DNA contamination and fragmentation in mapping protein-DNA interactions, while also avoiding false negative from too stringent data filtration?

First, from a microbial standpoint, we have selected to conduct ChIP experiments on natively expressed transcription factors. This, we think, minimizes the likelihood of false positives associated with plasmid borne over expressed transcription factors. Second, we don't put complete faith in any automated peak detection algorithm. Our strategy has been to build our own automated peak detection tool that minimizes false negatives at the expense of some false positivepeaks being reported. We then manually curate peak lists to eliminate the clear false positives by hand. This may sound time consuming, but for small microbial genomes it really isn't, even for numerous datasets. The time spent up front seems to be well worth the headaches and wasted time associated with chasing down and verifying false leads later.

— Marc Facciotti

From my experience, the key is in the sample/library preparation step, and we work very closely with the biologists to understand the QC process. One big issue, based on our observations in many experiments, is that the cross-linking and sonication steps are critical, but don't always receive the attention they deserve. Unlike the targeted approach such as ChIP-PCR or ChIP-chip, where the contaminations or "bad" segments were not amplified or measured, the ChIP-seq will be heavily affected by these steps. Thus the segment size needs to be carefully controlled and monitored using Bioanalyzer before sequencing.

For potential contamination in the library that has already been sequenced, the source (e.g., virus) of contamination often has small genomes and they tend to be highly amplified. It is thus important to remove excessively repeated sequences in the raw data. In addition, sometimes they cannot be mapped to the reference genome and thus significantly reduce the mapping rate. In some cases, by Blasting the highly repeated (but unmapped) sequences, we can detect the source of contamination.

— Kun Huang

Immunoprecipitations should be performed only with validated high-fidelity antibodies in the presence of blocking reagents such as BSA. A new technique recently published by Frank Pugh's lab, ChIP-exo, will certainly be helpful in increasing signal to noise.

— Jason Lieb

For studying protein-DNA interactions, ChIP-seq is the best method. DNA contamination (from other species or cells) is not as big a problem, but antibody quality greatly influence the final data quality and noise level. Fragmentation such as sonication conditions also plays a role. Good quality control on the antibody, consistent protocol, and biological replicates are the best ways ensure data quality, and the ENCODE and Epigenome projects have been pioneering in this effort to provide excellent-qualitydata to the community. Analysis method is also important in estimating data noise and false discovery rate of the ChIP-seq peak calls. Peak callers such as MACS provide fold-change, p-value, and FDR for each called ChIP-seq peak, and methods such as IDR help estimate a good cutoff for peak calling based on replicate agreement.

— Xiaole Shirley Liu

We use the data obtained from sequencing control DNA in order to model cell type-specific background noise and biases. We also perform multi-sample normalization by separating genomic regions that contain biological signal from background regions that mostly contain noise. We use various regression models and ideas from stochastic processes to filter out biases and artifacts.

— Jun Song

We either use an IgG mock or input DNA as a control. There are a variety of peak calling algorithms that can be run, under a variety of parameter settings that are often data set specific, to assign binding sites and associated levels of statistical significance. ENCODE and modENCODE use a method called IDR (irreproducible discovery rate) to try to assess the quality of each dataset.

— Kevin White

Q6: What are your preferred pipelines or computational tools for ChIP-seq/ChIP-chip data analysis?

I can't say that we have a favorite. We've built our own in-house peak-detection software that deals with some of the oddities of the genomes we work with. We analyze the resulting peak lists with custom R and Python scripts and any other existing tool that makes sense for a particular question.

— Marc Facciotti

We found commercial software such as Partek can be quite useful when dealing with routine analysis for transcript factors. However, we typically use two or three other peak calling software packages (e.g., MACS, HOMER, SISSR) to check for consistency of peak detection. For long regions of enrichment for prevalent proteins such as RNAPII and histone marks, we use an algorithm developed in-house.
We apply multiple approaches for motif finding including identifying enriched known motifs and using de novo motif discovery with tools such as ChIP-Motif. The Partek motif analysis tool is also quite useful.

— Kun Huang

We typically use ZINBA for ChIP-seq data and MA2C for any legacy ChIP-chip data.

— Jason Lieb

There are several pipelines developed by the community. We like Cistrome and CisGenome the best.

— Xiaole Shirley Liu

I believe that every dataset is unique and requires innovative methods to discover biological important phenomena. Pipelines are good for the initial processing of data, but it is important to keep in mind that major discoveries will depend on what kinds of questions one asks and how one answers those questions; I find that pipelines are usually inadequate to replace an inquisitive mind.

— Jun Song

This continues to be a fast moving area. What we are using today may not be what we are using tomorrow as improvements continue to be made on the methods for peak calling. However, at this moment, with Mike Snyder's group at Stanford, we are currently re-analyzing all the human, fly, worm, and mouse ChIP-seq data from the modENCODE projects to provide a standardized dataset for the community using BWA and SAM tools for aligning and quality control, MACS2 for peak calling, and IDR for reproducibility analysis.

— Kevin White

List of resources

If our experts haven't answered all your questions, these additional sources may help.

Publications
Chen Y, Negre N, Li Q, Mieczkowska JO, Slattery M, Liu T, Zhang Y, Kim TK, He HH, Zieba J, Ruan Y, Bickel PJ, Myers RM, Wold BJ, White KP, Lieb JD, Liu XS. (2012). Systematic evaluation of factors influencing ChIP-seq fidelity. Nature Methods. Epub: doi 10.1038/nmeth.1985.

Churchman LS, Weissman JS. (2011). Nascent transcript sequencing visualizes transcription at nucleotide resolution. Nature. 469, 368–373.

Core LJ, Waterfall JJ, Lis JT. (2008). Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science. 322(5909):1845-8.

Diaz A, Park K, Lim DA, Song JS. (2012). Normalization, bias correction, and peak calling for ChIP-seq. Statistical Applications in Genetics and Molecular Biology. 11(3):Article 9.

Ferguson JP, Cho JH, Zhao H. (2012). A new approach for the joint analysis of multiple ChIP-seq libraries with application to histone modification. Statistical Applications in Genetics and Molecular Biology. 11(3):Article 1.

Han Z, Tian L, Pécot T, Huang T, Machiraju R, Huang K. (2012). A signal processing approach for enriched region detection in RNA polymerase II ChIP-seq data. BMC Bioinformatics. 13 Suppl 2:S2.

Rashid NU, Giresi PG, Ibrahim JG, Sun W, Lieb JD. (2011). ZINBA integrates local covariates with DNA-seq data to identify broad and narrow regions of enrichment, even within amplified genomic regions. Genome Biology. 12(7):R67.

Rhee HS, Pugh BF. (2011). Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell. 147(6):1408-19.

Song JS, Johnson WE, Zhu X, Zhang X, Li W, Manrai AK, Liu JS, Chen R, Liu XS. (2007). Model-based analysis of two-color arrays (MA2C) . Genome Biology. 8(8):R178.

Spiro S. (2012). Genome-wide mapping of the binding sites of proteins that interact with DNA. Methods in Molecular Biology. 881:137-56.

Tran V, Gan Q, Chen X. (2012). Chromatin immunoprecipitation (ChIP) using Drosophila tissue. Journal of Visualized Experiments. (61). pii: 3745.

Wang C, Tian R, Zhao Q, Xu H, Meyer CA, Li C, Zhang Y, Liu XS. (2012). Computational inference of mRNA stability from histone modification and transcriptome profiles. Nucleic Acids Research. Epub: doi 10.1093/nar/gks304.

Xu J, Zhang Y. (2012). A generalized linear model for peak calling in ChIP-seq data. Journal of Computational Biology. Epub: doi 10.1089/cmb.2012.0023.

Zhu JY, Sun Y, Wang ZY. (2012). Genome-wide identification of transcription factor-binding sites in plants using chromatin immunoprecipitation followed by microarray (ChIP-chip) or sequencing (ChIP-seq). Methods in Molecular Biology. 876:173-88.

Zuo T, Liu TM, Lan X, Weng YI, Shen R, Gu F, Huang YW, Liyanarachchi S, Deatherage DE, Hsu PY, Taslim C, Ramaswamy B, Shapiro CL, Lin HJ, Cheng AS, Jin VX, Huang TH. (2011). Epigenetic silencing mediated through activated PI3K/AKT signaling in breast cancer. Cancer Research. 71(5):1752-62.

Web Tools

BioProspector
http://robotics.stanford.edu/~xsliu/BioProspector/

BLAST
http://blast.ncbi.nlm.nih.gov/Blast.cgi

BWA
http://bio-bwa.sourceforge.net/

CisFinder
http://lgsun.grc.nia.nih.gov/CisFinder/download.html

CisGenome
http://www.biostat.jhsph.edu/~hji/cisgenome/

Cistrome
http://cistrome.org/Cistrome/Cistrome_Project.html

HOMER
http://biowhat.ucsd.edu/homer/motif/index.html

MACS
http://liulab.dfci.harvard.edu/MACS/

MA2C
http://liulab.dfci.harvard.edu/MA2C/MA2C.htm

MEME
http://meme.sdsc.edu/meme/intro.html

SAM
http://samtools.sourceforge.net/

SISSR
http://sissrs.rajajothi.com/

ZINBA
http://code.google.com/p/zinba