A team at the US National Cancer Institute has developed a "happy compromise" for researchers looking to do array-based genome-wide association studies using high-density chips but who cannot afford to genotype the thousands of cohorts required to reach statistically relevant findings.
Rather than running whole-genome genotyping arrays containing common variants on a large number of samples and then imputing the results using publicly available datasets, the NCI research team recommends genotyping all of the study samples using these inexpensive arrays, and then analyzing a subset of the samples with one of two approaches: more expensive higher-density arrays containing rare variants or next-generation sequencing.
The team then calls for imputing the missing genotypes for those participants not genotyped on the denser platform and performing the association study on the augmented dataset. "Instead of depending only on a public dataset, the imputation reference set now includes a genotyped subset of the study population," they noted in a paper outlining the approach, published this month in Genetic Epidemiology.
Lead author Joshua Sampson told BioArray News that as genotyping approaches improve, investigators face a number of questions. "If the study has already genotyped a cohort on an older platform, they have to decide whether it's worth re-genotyping their population using the improved technology," Sampson said. On the other hand, "if a study is about to genotype a cohort, they have to decide whether using the bigger, better, and more inclusive genotyping platform is worth the additional cost," he said.
"Our recent paper attempts to show that the choices need not be so black and white," he added. "Imputation allows a happy compromise in our two-platform design."
Sampson is a biostatistician at the division of cancer epidemiology and genetics at NCI in Rockville, Md. Other authors on the paper include Kevin Jacobs, Zhaoming Wang, Meredith Yeager, Stephen Chanock, and Nilan Chatterjee. Chanock, who is chief of the laboratory of translational genomics at NCI, discussed association studies in an interview with BioArray News last year (BAN 3/1/2011).
Affymetrix and Illumina continue to develop high-density chips containing rare variant content for association studies. For instance, Illumina last year launched the Omni5, which contains nearly 5 million markers. The arrays were largely designed using rare variant content from the 1000 Genomes Project and other sources to detect uncommon susceptibility SNPs with minor allele frequencies of between 1 percent and 10 percent. Sampson and his colleagues set out to reduce the cost of such studies by avoiding genotyping a large number of participants with expensive technologies
In the paper, the NCI team argued that using high-density arrays in combination with next-generation sequencing in large association studies is "prohibitively expensive" for most researchers. The "more economical alternative" is to use less-dense arrays to genotype the study samples and then rely on an imputation procedure trained on a publicly available database to estimate the missing genotypes. But, as the authors warn in the paper, "if the ancestry of the study population is not adequately represented in the database, the imputation accuracy for uncommon SNPs can be less than ideal and confound study results."
Their compromise method calls for using a standard genotyping array to genotype all of the study samples, and then supplementing that data by genotyping only a small proportion of the participants on a platform that has higher coverage for uncommon SNPs. This subset of the study population is then included as part of the imputation reference set.
In the paper, the team evaluated the potential efficiency of the two-platform approach using a dataset containing 756 individuals genotyped on both the Illumina HumanOmniExpress and Omni2.5-Quad, which contain roughly 900,000 and 2.5 million markers, respectively.
While the authors acknowledged that genotyping all individuals on a denser array "would be ideal," they found that genotyping only 100 individuals on the array, in combination with imputation, leads to "only a modest loss of power for detecting associations."
More specifically, they argue that it could be possible to observe more than 80 percent of the detectable associations with as few as 100 subjects genotyped on the higher-density chip, an increase of between 5 percent and 10 percent over the percentage possible when basing imputation only on a public reference set.
At the same time, they noted that that if the relative risks for rare variants are significantly larger than those previously observed for common variants, then the proportion detected would likely be lower, concluding that "this same evidence cautions against depending on imputation if rare variants are found to have large relative risks."
According to Sampson, one could genotype "only a small fraction, perhaps just 1 percent of a cohort, on the bigger platform" as part of the two-platform approach. Then the remainder of the cohort could be genotyped on the lower-density platform with imputation used to fill in the difference.
"The key point is that the small fraction of the cohort genotyped on the larger platform allows the imputation model to be trained on one's own cohort," Sampson said. "This guarantees that the training set includes adequate representation for the desired population," he said. "By genotyping just a hundred individuals on that larger platform, study power can be increased by [between] 5 [percent] and 10 percent, as compared to when only a public reference dataset is available."
The two-platform design is appropriate "whenever two different genotyping methods are available with one method being more inclusive, but more expensive," the authors wrote in the paper. They also noted that while the analysis was presented on the OmniExpress and Omni2.5, the results could be "generalized to other genotyping platforms and eventually next-generation sequencing studies once the quality of calling algorithms has stabilized."