Array batch sizes, data-analysis methods, and the experience of researchers conducting those analyses all can introduce bias into experiments that use microarray technology to predict disease, drug response, and drug toxicity, according to a number of recent studies.
In a dozen papers published last week in Nature Biotechnology and The Pharmacogenomics Journal, the members of the Microarray Quality Control consortium investigated sources of bias in array-based studies. The reports also provided recommendations for how best to analyze microarray data when predicting clinical outcomes.
"Gene-expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established," the consortium wrote in a summary report in Nature Biotechnology.
According to the paper, before researchers can settle questions about the technical aspects of gene-expression measurements they must first develop accurate and reproducible multivariate gene expression-based prediction models, also referred to as classifiers.
"For any given microarray data set, many computational approaches can be followed to develop predictive models and to estimate the future performance of these models," the MAQC authors wrote. "Understanding the strengths and limitations of these various approaches is critical to the formulation."
The US Food and Drug Administration-hosted MAQC was created in February 2005. The first phase of the project, led by Leming Shi, a researcher at the FDA's National Center for Toxicological Research, evaluated the reproducibility of microarray experiments across different labs and platforms using two RNA reference samples.
Results from those studies were published in a special issue of Nature Biotechnology in September 2006 (BAN 9/12/2006). MAQC-I also helped produce a companion guidance document about submitting genomic data to the FDA.
The second phase of the project, MAQC-II, began later that year and included representatives from 60 groups (BAN 12/19/2006). In the project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample against one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans, according to the main paper in Nature Biotechnology.
In total, the researchers built more than 30,000 models using many combinations of analytical methods. The MAQC-II teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training.
From these studies, the researchers found that model performance depended largely on the endpoint and team proficiency, and that different approaches generated models of similar performance.
According to the main MAQC paper, the "conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene-expression analysis." However, the consortium cautioned that the resulting guidelines are "general and should not be construed as specific recommendations by the FDA for regulatory submissions."
One finding showed that batch size, or the number of samples processed, can impact the results of a study. Half of the papers published in The Pharmacogenomics Journal, for instance, reported that differences in microarray batch size or batch composition, or the ratio of samples from case and controls, introduced discrepancies in results.
Other issues discussed in the papers include the clinical benefits of genomic classifiers, the impact of different modeling factors on prediction performance, the assessment of microarray cross-platform prediction, cross-tissue prediction, one-color versus two-color prediction comparison, functional analysis of gene signatures, and variability in genotype calling due to experimental or algorithmic factors.
One MAQC-II recommendation was for researchers to document the process of building classifiers. Specifically, the consortium said it was almost "impossible to retrospectively retrieve and document decisions that were made at every step during the feature selection and model development stage" during the project.
"This lack of complete description of the model-building process is likely to be a common reason for the inability of different data analysis teams to fully reproduce each other's results," MAQC noted.
In the future, the consortium recommends that all genomic publications include supplementary materials describing the model building and evaluation process in an electronic format.
MAQC-II is also making available six data sets with 13 endpoints that it said could be used in the future as a benchmark to verify that software used to implement new approaches performs as expected.
"Subjecting new software to benchmarks against these data sets could reassure potential users that the software is mature enough to be used for the development of predictive models in new data sets," the authors wrote.
MAQC-II also determined that different clinical endpoints represent different levels of classification difficulty. "For some endpoints the currently available data are sufficient to generate robust models, whereas for other endpoints currently available data do not seem to be sufficient to yield highly predictive models," Shi and colleagues wrote.
Finally, the group found that the accuracy of the clinical sample annotation information may also contribute to the relative difficulty of obtaining accurate prediction results on validation samples. "For example, some samples were misclassified by almost all models," the consortium reported.
As researchers continue to make sense of and apply the recommendations of the MAQC-II study, the third phase of the project, called the Sequencing Quality Control project, or SEQC, is already underway. The study aims to assess the technical performance of different next-generation sequencing technologies for DNA and RNA analyses, and to evaluate the pros and cons of various data-analysis methods.