Participants in the second phase of the MicroArray Quality Control Consortium are bumping up against some thorny statistical issues in an effort to identify the best methods for developing reproducible predictive biomarker signatures for use in clinical applications.
During MAQC’s seventh face-to-face project meeting, held last week on the campus of the SAS Institute in Cary, NC, consortium participants presented some initial results of the second phase of the project, called MAQC-II. While MAQC I evaluated the reproducibility of microarray experiments across different labs and platforms, MAQC-II is focusing on the prediction of biological outcomes based on microarray data.
Four participating groups are independently analyzing several clinical and toxicogenomics data sets with the goal of identifying “best practices” for developing classifiers that are reliable enough for use in a clinical setting.
Much of the effort comes down to statistics, and one of the four working groups is devoted to that aspect of the project. The Regulatory Biostatistics Working Group’s goal is to “develop a standard operating procedure document about how to build and validate predictive models,” said Greg Campbell, a biostatistician at the US Food and Drug Administration’s Center for Devices and Radiological Health, and a coordinator of the RBWG.
Campbell stressed that while he and several other FDA staffers are involved in the working group, the documents it produces are not official FDA guidance documents or recommendations.
The SOP document, a 14-page overview of recommended analytical procedures, is meant to guide the MAQC analysis groups as they develop their own statistical analysis plans, or SAPs, which are detailed, step-by-step descriptions of each method used to develop a predictive model. The SAPs and classifiers are “frozen” and submitted to the RBWG, which then evaluates the models based on accuracy, sensitivity and specificity, and reproducibility, or “robustness.”
Campbell stressed that the goal of the initiative is to come up with an analysis plan and stick with it — much as a developer of a diagnostic genomic signature would have to do when submitting a classifier to the FDA for approval.
“The whole point is to select a classifier and validate it,” he said, noting that the RBWG’s role is to encourage the microarray analysis community to “move away from pure, exploratory playing with the data.”
But according to the analysis groups that have been handling the MAQC-II data, it’s not quite that simple.
Russell Wolfinger, director of scientific discovery and genomics at SAS Institute, noted that there are numerous algorithms available for each of the many steps required to build a classifier: from initial preprocessing to data transformation, summarization, normalization, predictor reduction and standardization, predictive modeling, and cross-validation. Each one is a potential source of variability in the final classifier.
Wolfinger noted that if you were to multiply all the possible combinations of analytical processes together, there would be “millions of possibilities” to choose from — a fact that makes it difficult for many groups to decide on a single SAP and stick with it.
Kenneth Hess, an associate professor of biostatistics at MD Anderson Cancer Center, noted that there are other unanswered questions around the minimum number of training samples required to build a reproducible classifier, as well as how many genes to use in the signature and which genes to use.
Hess noted that microarray analysis is not yet a mature field, and that “we’ve only gone from the embryonic stage of genomics to its infancy.” Most microarray experiments comprise thousands of features but only a handful of samples, which is “not enough to build a robust classifier,” he said.
Hess was critical of the bioinformatics community’s common practice of developing a new method and confirming it on a single data set. “That’s a crime,” he said. “It drags the field down.”
In addition, he noted, the field spends too much time developing new methods “when so many fundamental questions remain unanswered.” As an example, he noted, “We still don’t have a way to determine k in k-fold cross-validation.”
Wendell Jones, director of statistics and bioinformatics at Expression Analysis and a coordinator of MAQC-II’s Clinical Working Group, suggested that there might be a little wiggle room in coming up with the best analytical pipeline. Using the analogy of baseball players, who all have a slightly different stance at the plate, he noted that “if one were optimal, everyone would use it.”
SAS “spent days and days of CPU time cranking through hundreds of modeling combinations” only to find that “most of the methods did horribly” in internal validation.
Jones added, however, that while there may be some flexibility in terms of developing robust classifiers, “we do need to know what not to do.”
There are two MAQC working groups working with the RBWG on building classifiers: the Clinical Working Group, which is analyzing patient data from several large-scale clinical studies, and the Toxicogenomics Working Group, which is doing the same for a toxicogenomics experiment.
The Clinical Working Group is a bit behind schedule because of the legalities associated with confidentiality and transfer agreements for the clinical data sets, so the Toxicogenomics Working Group is serving as a “strawman” for the MAQC’s evaluation process, according to Weida Tong, a researcher at the FDA’s National Center for Toxicological Research and coordinator of the toxicogenomics data-analysis effort.
Eight analysis groups have had a crack at the so-called Hamner data set, a mouse lung carcinogenicity study performed over three years at the Hamner Institutes of Health Sciences. The experiment involved 13 chemicals and three controls, and the training data set includes 18 arrays from 2005 and 52 arrays from 2006. Another set of 70 arrays from this year is being withheld as the “confirmatory” set upon which the analysis groups will test their classifiers.
The goal of the study is to determine whether these gene-expression experiments, which expose the mice to chemicals for 13 weeks, can provide the same results as the current standard in carcinogenicity testing: the two-year rat bioassay, which currently costs between $2 million and $4 million per chemical, according to Rusty Thomas, a researcher at the Hamner.
The analysis groups discussed their initial results during the meeting, and the presentations underscored the wide range of methodologies available in developing predictive models. Some groups developed multiple classifiers and then chose the one that gave the best result in internal validation, some focused on the batch effect between the 2005 and 2006 data sets, some used a rigorous statistical approach to select a gene signature, while others used a somewhat arbitrary cutoff of a certain number of genes.
SAS’s Wolfinger noted that his group “spent days and days of CPU time cranking through hundreds of modeling combinations” only to find that “most of the methods did horribly” in internal validation.
Interestingly, he noted, none of the available normalization methods proved effective. “Even no normalization worked as well as the others,” he said.
The real evaluation of the methods won’t take place until the 2007 data set is released, however. NCTR’s Tong said that the MAQC has to resolve some procedural issues related to how the analysis groups interact with the RBWG before it releases the confirmatory data set, though it hopes to do so later this summer.