Bioinformatics people have seen quite varied types of experimental data over the past decade. In earlier days, bioinformatics was almost synonymous with sequence analysis; we spent much of our time assembling, aligning, and interpreting gene and protein sequences. Once biology gained a more comprehensive understanding of the transcriptome, it seemed as if everyone wanted to quantify expression, and many of us moved on to microarray data, processing matrices of numbers linked to probes representing specific genes. Now we are still working on sequences and numbers, but increasingly our quantitative data is linked to genome position. We spend a lot of our day doing math on genome coordinates, which is requiring whole new methods and analysis tools.
We have always loved genome browsers because, by combining different data types along a chromosome, they can help us to visually identify relationships and patterns. Traditionally we used them for comparing gene structures, identifying potential regulatory elements, checking out genome conservation, and other sequence-based analyses. Now we find that our projects are more likely to include examining tracts of quantitative data such as transcription factor binding, histone modifications, and genome-wide transcription. Our experimental colleagues, who did not used to have much to do with genome browsers, are now uploading outputs from their own ChIP-seq and RNA-seq experiments and often have more numbers than they know what to do with. Linking numbers to genes is no longer enough; we are more aware of how much is going on between genes and between exons, and we need to link our data to genome coordinates. Easy-to-use short-read aligners and peak-calling tools are a big help, but the downstream genome math can still have its bottlenecks.
With gene-based measurements, any sort of matrix that we could view in a spreadsheet worked quite well. Now that we have to share our coordinate-based measurements among ourselves and with software like genome browsers, new data formats have come along to make this more straightforward, and we actually have a reason to talk about wigs and beds in the same sentence. We really like the BED — Browser Extensible Data — format as it is tab--delimited text and can be easily manipulated in a spreadsheet or with our favorite Unix tools. It can also link genome regions to values and is less verbose than GFF files. WIG (wiggle) format is even more concise, but not as friendly, and there are a bunch of other formats that genome browsers are expanding to understand. But converting between these formats is not always trivial, and we look forward to an easy all-to-all conversion tool.
These coordinate-based formats usually work fine for us, except for a couple of reasons: we have trouble remembering when we should start counting at zero and when at one (and can get off-by-one bugs), and some people name the first chromosome as "chr1," though others call it "1" (and your browser may not understand both). In addition, newcomers to this system would never guess that for genes on the negative strand, the field called "start" actually identifies a gene's or exon's end. Finally, documentation is key for all coordinate-based files, as they are virtually worthless if we don't know which assembly they represent. We include this information in the filename and, if possible, in the body of the files, so that years from now we can be sure about which genome version was profiled by our assay without having to do any detective work.
After our experiment has produced a coordinate-linked data set, we probably want to compare it to other experiments, either from our lab or from other labs. Converting coordinates across assemblies, or even between species, is usually a snap with a tool like UCSC Bio-informatics' liftOver. Their genome browser, or nibFrag command-line tool, also makes it easy to extract selected regions of genome sequence.
Somewhat trickier is the task of selecting subsets of our genome-wide data. What if we want ChIP enrichment values for a set of our favorite promoters? Or how about comparing ChIP peaks from two different experiments? We had written a bunch of scripts to do this ourselves, but we stopped doing so when BEDTools was created. This helpful suite of applications can get us the intersection, union, or difference from a pair of BED files, including any associated quantitative data. Penn State's Galaxy can also do this sort of interval math, along with a lot of other data manipulation that used to require a lot of programming.
Intersecting two sets of genome regions also gives us an easy way to annotate our transcription factor binding sites or other regulatory elements. We can intersect our novel regions with known genome features such as all exons or promoters and very quickly get potential targets of regulation. BEDTools also has a "closestBed" tool that can help us easily make a histogram of distances of our histone marks, for example, to gene starts. How about if we want to get a ChIP profile for our favorite histone marks across thousands of transcribed genes? It still takes some work to turn these selected regions into an enrichment heat map. We are hoping a software developer will design a solution that makes creating this sort of figure as easy as generating an expression heat map. It would also be great to have an easy way to summarize values over these regions, which can be especially tricky for scores that apply to regions of variable width. This would open the door to more informative statistics — instead of comparing peak overlaps, we could more easily compare the enrichment values behind these peaks. Genome statistics are hard enough as is, with people not always agreeing on the best methods, but simply getting the numbers we want to compare should not be the headache that it can be.
What can we do if we want to visually compare a bunch of different genome-wide data sets? You can quickly find that your local or remote browser has some technical difficulties. We've tried several different ways of dealing with this, such as converting our measurements into less dense ones — every 100 nucleotides, instead of every 10, for example — dropping low-level values, or reducing file sizes by changing to a more concise, or even binary, browser track format. This gets much trickier if you still need granular data like the methylation fraction of each CpG; summarizing over an interval can obscure important information. In that case, it might be time to install a local database-driven browser (like GBrowse) — loading lots of data takes a while, but you only need to do it once.
Genomics is more about numbers than ever before, and bioinformaticians and bench researchers need the tools and skills to juggle genome coordinates with their associated data. Genome-wide experiments are a lot more attractive if the experimenters can analyze the data without needing a team of programmers to turn their ideas into code.In addition to developing complex analysis algorithms, bioinformaticians can still find lots of basic tools to create to help their experimental colleagues.
Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a senior bioinformatics scientist in Fran's group. We spend a lot of our day doing math on genome coordinates, which is requiring whole new methods and analysis tools.