The second phase of the US Food and Drug Administration-hosted Microarray Quality Control project is on schedule to wrap up by the end of 2009 when the results of the effort are likely to be published in a major journal, such as Nature Biotechnology, according to MAQC II coordinator Leming Shi.
Meantime, the third phase of the project, called the Sequencing Quality Control project, or SEQC, is already underway following a meeting last month at an FDA campus in Silver Spring, Md., with the intent to submit manuscripts for peer review by the end of this year.
Shi, a researcher at the National Center for Toxicological Research in Jefferson, Ark., told BioArray News this week that the major data-analysis projects within MAQC II have been concluded, and that project leaders are now drafting around 10 manuscripts related to it. Half of them may be published together, he said.
“The MAQC II manuscripts will be submitted for peer review within the next couple of months,” Shi said. “However, it is hard to predict the outcome of the peer-review process. Hopefully, half of them will be published together.”
The first phase of the MAQC project published its results in a special issue of Nature Biotechnology in September 2006 (see BAN 9/12/2006). The second phase of the project, which includes representatives from 60 groups, began later that year (see BAN 12/19/2006).
While the first phase of the project evaluated the reproducibility of microarray experiments across different labs and platforms using two RNA reference samples, phase II seeks to understand which factors are more important in determining the internal and external validation performance of predictive models, and why some data-analysis protocols succeed or fail in predicting biological outcomes based on microarray data.
In April, Shi told BioArray News that phase II “aims to establish good practices for developing and validating microarray-based predictive models.” To achieve that, the project analyzed six gene expression and genotyping data sets with 13 endpoints to generate a data set of predictive models “useful for personalized medicine” (see BAN 4/15/2008).
According to the summary from a meeting MAQC organizers held in Rockville, Md., in September, over 16 potential phase II manuscripts were discussed. The main paper in development will discuss the “most important findings of the MAQC II participants with the main objective of reaching consensus on the ‘best practices’ of developing and validating microarray-based predictive models.”
These best practices will subsequently be implemented as the MAQC’s “Data Analysis Protocol,” which organizers hope will be applicable to future data sets outside of the project, said Shi.
Discussing the main paper, Shi said that one aspect of the “best practices” identified by the consortium is to “decide on an option at each of the modeling steps, including batch-effect handling, normalization, feature selection, and classification algorithm.” The other part is to “ensure that the process of estimating a model’s prediction performance is unbiased and the performance estimate is realistic.”
He added that there are other aspects of best practices, “including study design, tissue sample collection and handling, and microarray data generation,” that will be addressed in associated MAQC II papers.
‘Not a Good Light Bulb’
A second MAQC II manuscript under consideration, for example, concerns efforts to remove batch effects from microarray data. Such effects are a type of noise in gene-expression data that can result from systematic differences in measurements due to chip types, sites, hybridization dates, and sample-preparation and scanning procedures performed by individual technicians.
“This manuscript is trying to evaluate all available algorithms [for batch effect removal] through multiple data sets covering different problems and generated using different platforms,” said John Zhang, CEO of Waltham, Mass.-based bioinformatics firm and MAQC participant Systems Analytics.
Specifically, the paper develops quantitative measures of batch effects and evaluates the performance of commonly used batch effect-removal procedures with the objective to improve cross-batch predictions, Zhang said. The paper argues that “proper batch-effect removal can effectively improve the cross-batch prediction performance of the predictive models if the batch effects are significant.”
“A light bulb shining only five minutes or only a few times is not a good light bulb.”
According to Zhang, “We tried to test these batch effect problems with feature selection methods using prediction performance as a measure. We are trying to come up with recommendations that will help researchers make decisions about what methods they choose to use under certain conditions.
“That way, when people have a data set and a batch effect, they can take a look and decide which method to use,” he added.
Zhang told BioArray News this week that the resulting paper will be useful for Systems Analytics because many of its customers are looking to exclude batch effects from data analyses. He added that the second phase of the MAQC project could help raise the quality of future microarray experiment results.
“A light bulb shining only five minutes or only a few times is not a good light bulb,” Zhang said. “Our view is that a good algorithm should be applicable to the majority of data sets,” he added. “There are too many publications where the authors invent a new algorithm and test it on one or two data sets. That is not very robust.”
According to the summary of the MAQC’s September meeting, other papers in development will: assess how normalization methods could predict microarray data-analysis performance; evaluate cross-platform consistency and transferability of microarray-based molecular signatures; evaluate the cross-tissue predictability of microarray genomic markers; compare the prediction performance of models developed on the same set of patients but with different generations of Affymetrix microarrays; and compare one-color and two-color microarray platforms for their ability to classify neuroblastoma based on gene expression profiles.
Manuscripts are also being developed by the MAQC’s Genome-Wide Association Working Group, which seeks to determine best practices for predictive models that rely on genotyping data, as opposed to gene-expression data.
Last year the GWAWG, formed at MAQC II’s meeting in May 2007 (see BAN 6/5/2007), decided to add CNV data to the project and formed the Genome-Wide Copy Number Variation Data Analysis Team, co-chaired by Golden Helix CEO Christophe Lambert and Francesca Demichelis, a researcher from the Institute for Computational Biomedicine at Weill Cornell Medical College in New York.
Lambert told BioArray News at the time that the team aimed to identify the “sources of variation in results of analysis of copy number data on whole-genome microarrays,” and to publish “one or more publications that highlight the sources of variability in analyzing CNV data, and making recommendations for best practices under various circumstances” (see BAN 4/15/2008).
Shi said this week that the GWAWG has been focusing on assessing the impact of various data-analysis approaches on genotype calls and the list of differentiating SNPs. The technical performance of genotyping platforms are also being assessed by different laboratories using the same set of DNA samples, he said.
Meantime, according to Shi, the CNV data-analysis team has “continued to identify sources of variation in results of analyzing CNV data with respect to the goals of detecting and characterizing regions of copy number variation, finding associations between CNVs and phenotypes, and building and validating predictive models of phenotypes with CNV data.”
According to Shi, “several manuscripts are under preparation,” and after the MAQC II results are published, the “best practices” for developing and validating predictive models will be either incorporated in existing FDA guidance documents on pharmacogenomics data submission or turned into a separate guidance document.
As phase II of the MAQC nears its end, the core participants of the project have begun work on its third phase, the Sequencing Quality Control project.
Last week, the FDA announced that it is seeking volunteers from the public to participate in this new project with the aim of providing objective assessments of DNA and RNA analysis technologies and the software designed to manage and analyze the “massive new data sets” that second-generation sequencing tools create.
Requests to participate in the SEQC project at the NCTR should be submitted by Jan. 9. More information about the project is available here.
Shi said this week that the SEQC aims to assess the technical performance of different second-generation sequencing platforms by generating large benchmark data on the platforms. The project will then evaluate the advantages and limitations of various bioinformatics strategies for handling and analyzing the massive sequence data sets; compare data from second-generation sequencers to those from other technologies such as microarrays and qPCR; and assess how second-gen sequencing tools assess the safety and toxicity of FDA-regulated products.
This week, Shi told BioArray News sister publication In Sequence that the “SEQC is a natural extension of the MAQC project” because the SEQC will “need the community’s active participation to be successful.” Also, a “huge amount of expression data has already been collected on the two MAQC reference RNA samples, making them a natural choice for benchmarking RNA sequencing data.”
Shi added that “all major sequencing players have been using the two RNA samples internally for quality control and protocol optimization purposes.”
According to Shi, the SEQC hopes to finalize its study design by next month, and will begin collecting RNA sequencing data on the MAQC reference samples during the spring. Between May and October, project organizers said they hope to analyze data and prepare manuscripts that they plan to submit for peer review in December and publish sometime in 2010.
Shi told In Sequence that the FDA views the SEQC as useful because second-generation sequencing “will be, if it has not already been, used by sponsors as an alternative tool for DNA and RNA analysis and that the resulting data will be used to support their medical product development for diagnostics, prognostics, and treatment selection.”
He said that research scientists and reviewers from multiple FDA centers are participating in the SEQC effort and that he anticipates that “SEQC will help prepare the FDA for the next wave of submission of genomic data generated from the next-generation sequencing technologies.”