The second phase of the Microarray Quality Control project reached several milestones during a meeting in Rockville, Md., last month, at which organizers agreed on a timeline to complete the project, reached a consensus on how they will study the resulting data set, and proposed ways to integrate SNP and copy number variation data into the project.
Leming Shi, a researcher at the US Food and Drug Administration’s National Center for Toxicological Research and coordinator of the FDA-hosted MAQC initiative, said that phase II aims to establish good practices for developing and validating microarray-based predictive models.
Whereas the first phase of the project, concluded with the publication of a special issue of Nature Biotechnology in September 2006, evaluated the reproducibility of microarray experiments across different labs and platforms, phase II instead seeks to understand which factors are more important in determining the interval and external validation performance of a predictive model, and why some data-analysis protocols succeed or fail in external prediction.
Shi said he believes that this can be accomplished by creating and mining a large data set of many predictive models developed by MAQC II’s data-analysis teams. At the meeting in Rockville, more than 15,000 models were submitted for use in the project. However, some teams submitted more models than others, creating “an extreme multiplicity of models” in Shi’s words.
“We are addressing a very challenging problem: overfitting in model development,” Shi told BioArray News this week. “In the MAQC‑II we have over 30 teams. Some teams submitted only one model and it looks like they have explored only one model. But if you look at their protocols, then you’ll find out that each team has in fact explored a large number of models.”
Shi said that MAQC II has decided to solve the issue by recommending that its teams report all the models they explored for each endpoint so that project participants will know what models they studied, the different combinations of modeling factors, the estimated performance of these models, and eventually the actual performance in the prediction of the validation data.
Essentially, Shi said, MAQC II aims to “know why some data-analysis protocols fail, and, more importantly, why some succeed.” He said that by reasonably constructing the data set of predictive models using the recommendations agreed upon at the meeting, a lot of “meaningful statistical analysis can be done on that data set for us to come to some conclusions.”
During the meeting, MAQC II set a schedule for publishing the results of the project. According to the timeline, MAQC II should have manuscripts ready for peer review by September — a schedule Shi called “aggressive,” but said that the project is leaving room for flexibility. According to the summary of last month’s meeting, 25 different manuscript topics were proposed with titles such as “batch effects” and “normalization methods.”
Shi said that the project will look to publish a main manuscript that will detail what to do and what not to do when using microarray data in predictive models. “I think the problem we are trying to address in Phase II is more profound than in Phase I,” he explained. “We want to make sure this technology can make accurate and reliable predictions based on an individual’s microarray profile,” he said. “Just imagine this technology and the way we analyze the data — if this cannot be useful in predicting patient outcome, it won’t be helpful for patients, whom we are concerned with the most.”
Shi also said that MAQC II would like to make its results available to the public in a manner similar to how it made the results of Phase I available. However, he said it will take a while to see how many of the draft manuscripts proposed at the Rockville meeting will make it to publication. “If we can eventually come up with 10 or so manuscripts that we all feel comfortable presenting to the scientific community, we will contact some publishers to see whether they are interested,” he said.
SNPs and CNVs
MAQC II comprises five working groups, one of which is the Genome-Wide Association Working Group, which seeks to determine best practices for predictive models that rely on genotyping data, as opposed to gene expression data. The GWAWG was formed at MAQC II’s previous face-to-face meeting, held in May last year (see BAN 6/5/2007).
According to Shi, the GWAWG“fits under the MAQC umbrella,” and the project is emphasizing “not only identifying SNP or CNV lists differentiating two populations, but also combining them so that we can build more predictive models.”
“I think the problem we are trying to address in Phase II is more profound than in Phase I.”
At the Rockville meeting, the GWAWG decided to add CNV data to its purview by forming the Genome-Wide Copy Number Variation Data Analysis Team, co-chaired by Golden Helix CEO Christophe Lambert and Francesca Demichelis from the Institute for Computational Biomedicine at Weill Cornell Medical College.
Lambert told BioArray News this week in an e-mail that the new team has been charged with identifying the “sources of variation in results of analysis of copy number data on whole-genome microarrays” with respect to the goals of detecting and characterizing regions of copy number variation in germline DNA and in cancer DNA; finding associations between copy number variations and phenotypes; and building and validating predictive models of phenotype with copy number variation data.
“In the future, diagnostics and prognostics will require the incorporation of many forms of patient assay information,” Lambert said. “The inclusion of copy number variation is a logical progression to build on the information content of gene expression and genotype information.
“We anticipate that best practices will emerge to assist companies in bringing more effective diagnostics to market,” he said. “It will also help the FDA in making better decisions as more in vitro diagnostic multivariate index assays are submitted for regulatory approval.”
Ideally, the CNV Team is aiming to publish “one or more publications that highlight the sources of variability in analyzing CNV data, and making recommendations for best practices under various circumstances,” Lambert said. He said that Golden Helix presented some “early promising work building accurate predictive models on Type I diabetes using CNV data” at last month’s meeting that may serve as a jumping-off point for the project.
One question is how the development of good practices for the use of array data in predictive models might affect companies like Golden Helix that sell array-analysis software. Lambert said that a “number of software vendors are participating – mainly to stay abreast of the field.”
He added that it “is reasonable to assume that software developers will react to the findings. How they react will depend on what these findings are — and that we won’t know until we get there.”
Beyond software companies, the results of MAQC II could also affect the way companies developing array-based diagnostics use internal predictive models and how the FDA reviews those models when evaluating voluntary exploratory data submissions connected with, for example, an application for 510 (k) clearance.
Shi said that he hopes firms developing array-based diagnostics are “doing the right things.” He also said that he doesn’t think the results of MAQC-II will come to be seen as the “only way to do business.” Instead, he said that the likely outcome “will be that for a certain task there will likely be equivalent combinations to do the same job.” He added that MAQC II can help the array-based diagnostic community by “identifying those practices that should be avoided.”
Because MAQC II lacks a regulatory mandate it is unclear what impact it will have on how the FDA views submissions. However, Shi said that the “scientific outcome of the project could help the FDA in the future to review these kinds of submissions.” He noted that following the publication of the results of MAQC Phase I, the FDA released a companion guidance for pharmacogenomic data submissions.