NEW YORK (GenomeWeb) — Sometimes qPCR data is simply missing. It is standard practice in data processing to replace "non-detects," or reactions failing to produce a minimum amount of signal, with a cycle threshold value of 40. A study published last week in Bioinformatics, however, showed that this practice likely introduces bias and spurious interpretations, and proposed a new method.
Biostatisticians at the University of Rochester Medical Center measured the extent of the bias using three data sets containing non-detects. They then tried to get at the reasons behind the non-detects, and developed an alternative method that treats non-detects as missing data resulting from amplification failure. The algorithm they created to handle non-detects in qPCR data is now available as open source software on the website Bioconductor.
Matthew McCall, a postdoc in the Department of Biostatistics and Computational Biology and author on the recent study, noticed the problem in previous work analyzing a separate dataset. He saw a pattern in which samples that had non-detects also seemed to be consistent outliers, he told PCR Insider in an interview.
McCall said he then discovered that most software replaces each non-detect value with a Ct value of 40. "That gets incorporated into analysis of delta Ct and delta delta Ct values," he said. "Our consistent pattern of outliers was explained by the fact that non-detects went into those values. So we went back and thought about what [the software was] doing."
McCall dug deep into white papers and technical manuals to find the methods used by other software packages.
For example, according to the study, Applied Biosystem's DataAssist sets non-detects equal to the number of PCR cycles performed, typically 40, with the option of setting a lower maximum allowable Ct value to which any greater value is set, or excluding these values from subsequent calculations.
Integromics' RealTime StatMiner distinguishes between two types of non-detects — undetermined values that do not exceed the Ct threshold, and absent values, for which no reaction occurred. This software handles non-detects by setting undetermined values to a maximum Ct, such as 40, and absent values to the median of the detected replicates.
These methods assume either no amplifiable target in the sample, or that it could be amplified with a few more cycles. "The impression that a lot of people have is that if a value is missing, then the true [Ct] value is either, in essence, infinity, because the transcript isn't there, or it's something larger than 40," McCall said.
However, McCall said, "It doesn't make a lot of sense for most downstream analyses to take a missing value and replace it with a fixed integer."
To improve on the existing method, the group first had to figure out the nature of the "missingness" in qPCR data. This, they reasoned, would point to mathematical ways to accommodate.
The authors computed the proportion of non-detects and average Ct value across their three qPCR datasets. This work suggested genes with lower average expression are more likely to be non-detects, which led them to conclude non-detects do not occur at random. "A lot of missing values show up when their replicates have expression values in the mid 30s," McCall said. "There seems to be a gap between where the observed Ct value distribution trails off and 40, so it looks to us like as you get into the mid 30s in Ct values you start to have a substantial probability of being a non-detect."
Such non-random missingness in statistics is most classically explained in terms of responses to a survey. For example, missing data in a political science survey on relationship between income and gender could result from subjects declining to answer for reasons somehow related to their income level or gender. In such a case, the probability of being a missing value depends upon the unobserved value itself.
According to McCall, "Of the types of missing data you can have, it's the most difficult because it's not missing at random."
In the datasets analyzed in the paper, the group is fairly confident that there should have been signal for some of the non-detects. They used sets with many replicates, and "with very few exceptions, at least one or two of the replicate values is an observed value," McCall said.
Although it is generally deemed theoretically impossible to prove that data is missing not at random versus missing at random, McCall believes the current study "strongly suggests" non-detects in qPCR are not random.
Given this, McCall decided a less biased method would be to model missing data using an iterative procedure. This method uses what is called an expectation maximization algorithm. "In the simplest sense, you estimate what the missing data values are based on your current model, and then you use those to update your model, and repeat this process," McCall said.
The software McCall designed, which was part of the most recent Bioconductor release on April 14, amounts to a few lines of code in the programming language, R.
It interfaces with another package in the Biocondutor project, called HT-qPCR, for normalization, data visualization, and processing.
"Our package inserts one function into their workflow that does handling of non-detects," McCall said. "It does require a little bit of data formatting early on to use the same types of classes that are used in [HT-qPCR], but once you've done that it's relatively easy, it's one line of code to do our handling of the non-detects, another line to normalize the data, and another to make a plot," he said.
"My hope is that by making it a seamless interface with other Biocondutor packages, that it will be fairly easy for people to try it out if it is something that is applicable to their data."
The group is also contemplating making the software more user friendly. "We are thinking about creating a web application that does this, so that people who don't have a background in biostatistics or bioinformatics, or working in R in general, could upload their data to a website, click through a couple of questions, and be able to use our software without ever having to type something in an R prompt."
McCall said he hopes the new algorithm will help researchers using qPCR get the most out of their samples. "A lot of times people know these non-detects are getting replaced by 40s, they're aware of the problems that might introduce, so they in essence throw away entire genes across their data set because they know that there's a certain amount of non-detects ... They end up restricting their analysis to things that are relatively highly expressed," he said.
"There's the potential to have a much richer data set if you're able to retain the information from those genes and incorporate that in your analysis."