Skip to main content
Premium Trial:

Request an Annual Quote

Genotype Imputation From Array Data Can Approximate Sequencing in Some Cases, Study Finds

NEW YORK How well array-based genotyping followed by imputation can mimic whole-genome sequencing results depends on factors like reference panel, genotype array, sample ancestry, and where in the genome the variant is located, according to a new study.

While whole-genome sequencing has become cheaper and cheaper, it can still be too expensive to rely on for large studies, leading investigators to combine array-based genotyping with imputation. By comparing sample haplotypes to a reference panel of sequenced haplotypes, they can infer what variants a sample might have, even if they are not included on the genotyping array.

Researchers from the University of Michigan and elsewhere have now examined how well this approach reflects deep whole-genome sequencing. To do so, they compared imputation results with whole-genome sequencing data from four studies representing African American, Hispanic or Latino, and European American ancestry populations in the US and people of Finnish ancestry in Finland. As they reported in the American Journal of Human Genetics on Wednesday, the researchers found that in some situations, array-based genotyping followed by imputation can approximate whole-genome sequencing, but they cautioned the results should not be applied clinically.

"While array genotyping and imputation cannot fully replace deep WGS, we found that it can approximate WGS for variants down to specific [minor allele frequency] thresholds depending on genotype array and reference panel choices as well as sample ancestry," co-senior author Christian Fuchsberger from the Institute for Biomedicine in Italy and colleagues wrote in their paper.

For their analysis, the researchers used whole-genome sequencing data from the BioMe, InPSYght, METSIM, and MLOF cohorts. For each individual, they determined what their genotype would have been if generated by the Illumina Core, OmniExpress, or Omni 2.5M arrays, then carried out genotype imputation using a 1,000 Genomes Project panel, a panel from the Haplotype Reference Consortium, and a modified Trans-Omics for Precision Medicine (TOPMed) panel.

For all ancestries and imputation reference panels, the quality of imputation increased with larger array size, the researchers found. Overall, the densest array, Omni 2.5M, had the highest mean observed imputation, as well as both the highest number and portion of well-imputed variants.

TOPMed-based imputation from the Omni 2.5M array most closely resembled that of whole-genome sequencing for variants of lower minor allele frequencies across all ancestral backgrounds. In particular, the researchers found the approach could approximate sequencing at a population level for variants with minor allele frequencies equal to or greater than 0.14 percent in the African ancestry cohort, 0.11 percent in the Hispanic/Latino ancestry cohort, 0.35 percent in the European ancestry cohort, and 0.84 percent in the Finnish ancestry cohort.

Looking at finer-scale measures of genetic ancestry, the researchers additionally noted that individuals with higher levels of African genetic ancestry tended to have higher genotype concordance rates, possible because there may be more rare variants in non-African populations that have undergone population bottlenecks.

Other factors also influenced imputation accuracy and quality. For instance, regions with higher recombination rates, low GC content, increased structural variants, and segmental duplications were linked to lower imputation quality.

To account for this, the researchers developed a software tool dubbed RsqBrowser to aid investigators to estimate imputation quality for specific variants or genomic regions by ancestry and to guide their choice of whether to rely on array genotyping followed by imputation or turn to whole-genome sequencing. RsqBrowser is publicly available at the Michigan Imputation Server.

The researchers additionally examined whether protein-coding variants, which are more likely to be of clinical significance, are well imputed. They found that the concordance rates for rare and low-frequency variants differed widely among individuals by ancestry group and that concordance rates were further associated with finer-scale ancestry.

Due to this variability, the researchers concluded that whole-genome sequencing "cannot currently be reliably approximated in clinical settings with array genotyping and imputation with the reference panels studied here."