We are usually so excited about our projects that as soon as we get data from a new experiment, we can't wait to analyze it. We want to find out as soon as possible whether it supports our hypothesis or which genes play a role in our favorite process. It's not until the answer is unexpected or a "maybe" that we generally start to question the quality of our analytical inputs. But as we get a bit more experienced and more systematic, we're trying to make quality control preliminary steps for analysis rather than just part of the post--mortem examination. Here, we'll discuss some of our experiences with data quality and ask many more questions than we can answer.
When we were doing traditional molecular biology, experimental controls were as important as they are now — we had to make sure that our observations could not be explained by uninteresting technical reasons. Quality control was also important, but it often meant simply that our gel looked nice enough to satisfy us, our most exacting labmate, and our PI. Now when we do genome-wide assays, like with microarray analysis, our real raw data are images — which soon get converted into a lot of numbers — so quality control becomes more challenging. Similarly, using high-throughput sequencing, the primary data is DNA sequence, but our most common experiments quickly turn these sequences into counts, so numbers seem to be everywhere. Looking in particular at projects dominated by arrays or sequencing, how can we know if our primary data is any good? If each sample looks fine, how can we be sure about every measurement in that sample?
We think of quality control as having two requirements: first, we need some methods to assay quality (in whichever dimension we're interested), and second, we need some reasonable thresholds to produce a multiple-choice conclusion of pass, fail, and perhaps "questionable" if we are near the threshold. In practice, the methods are readily automated, but the conclusions, at least in our experience, require human interpretation, partly because so many things can go wrong, and what looks wrong in one study may not be unexpected in another. On the other hand, strange results that conflict with an experimental expectation may be a good reason to look at the raw data in more detail, even if the assay quality seems to be fine.
The tricky cases are always those in which our sample is not clearly good or bad, and we want to balance the desire for measurements of impeccable quality with the desire for completeness and high statistical power. A related issue, especially for us bioinformatics people, is that of data transformation: can we use mathematical or statistical modifications — including what might be called "normalization" — to turn a questionable data set into a good one, or are we just fooling ourselves? Instead, or in addition, can we react to a questionable data set by removing the problem points? This might seem like a straightforward thing to do, but on the other hand, we'll need an effective way to tag those bad points, and the devil may be in the details. Sometimes we see an investigator simply remove outlier samples because, like the Sesame Street song says, "one of these things is not like the others." We generally cringe at this method, and always wonder if the so-called outliers are showing something different for a potentially informative reason. Given some of these general concepts, how can we apply them to microarray and high-throughput sequencing experiments?
Applied to arrays
Microarrays are great for assays like measuring RNA abundance of many genes in a biological sample as with expression profiling, but quality control can be quite similar for other applications of the technology as well. We try to pay attention to quality on several different dimensions: the whole array, gene-level summaries, probes, and spots.
Probe annotations invariably have a wide variety of problems, largely because of evolving transcript definitions. For Affymetrix arrays, rather than re-annotating the array design ourselves, we prefer using a resource like Michigan's BrainArray custom CDF files, which re--define high-quality probe sets based on current gene annotations. For other types of arrays, it can be challenging to re-map probes to the genome or transcriptome to determine which probes obviously represent one specific gene — -especially unless we know the effect of mismatches on hybridization efficiency. Where some genes are represented by multiple probes, gene-level summaries are often performed to simplify further analysis. Regardless of the method we choose to apply to summarize the RNA level of a gene — as there appears to be no obvious best way — starting with valid probe measurements is a big help.
The other dimensions of quality control — entire array and spot-level — are specific to a hybridization, both requiring that signal intensities be optimal across the array. In addition to quality control assays within the manufacturer's scanning and quantification software, R/Bioconductor has several good packages — such as EBI's arrayQualityMetrics — to assess microarray quality. These packages, which often include recommended quality thresholds, generate lots of figures that can help us identify questionable arrays before we do any processing. Even if the array as a whole looks great, some scanning applications can flag spots of questionable quality, sometimes consistent with a funky image from the scanner. As with entire arrays, we may want to drop, or reduce the weight of, these potentially problematic features.
And to sequencing
High-throughput sequencing can also be used to perform analyses of RNA abundance and other genome-scale assays. As with arrays, some issues are technology-specific, but sequencing methods have enough in common that we can still come up with a general checklist. We may want to pay attention to several different dimensions here, too, like the qualities of the sequencing runs as a whole — as well as each individual read and position — and potential inclusion of adaptors, linkers, or other technical sources of contamination.
Unlike arrays, there is no single image that obviously summarizes a sequencing run, but quality control software tools can quickly generate collections of figures and lists — often by sampling from a large Fastq file — that effectively reflect the quality of the data set. We use at least one of these tools before doing any genome mapping. Our favorite is Babraham Bioinformatics' FastQC, a stand-alone application that performs a series of quality analyses, each with typical pass/fail thresholds. It provides summary profiles of quality scores, which can indicate error-prone runs and help us determine mismatch thresholds when it comes time to map our reads. Nucleotide prevalence by read position can turn up any biases that we may want to be aware of. If sequence duplication rates are higher than expected, we may not have as complex a library as we would like. Even if our data set looks fine overall, we may have a subset of reads with generally low-quality scores. We find a list of most-prevalent reads is very helpful, especially if we have processing artifacts. Given sample-wide summaries, even if a data set is of fine quality, we may want to apply filters to remove some sequences (of overall poor quality or matching contaminants) or trim others (to remove primer bits or perhaps low-quality ends).
A tool like the FASTX-Toolkit from Cold Spring Harbor Laboratory can do a lot of filtering and trimming, which should help increase the quality of our data. Sometimes bad reads may drop out when we try to map them, but the last thing we'd want is questionable or bad quality data to influence our final results.
Regardless of our experimental details, we are finding that thorough assays of experimental quality is an important part of our analysis, best performed sooner rather than later. Besides identifying which parts of our data sets are good or bad, we can also get lots of hints about how best to process our data. We always want to choose statistical methods with adequate power, but if questionable data is an issue, we may want to use methods that are especially robust and tolerant of outliers. Knowing the quality of our experiments will also help us determine how confident we can be about our exciting results.
Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a senior bioinformatics scientist in Fran's group.