Researchers using commercial SNP genotyping arrays should be wary of using their results to make predictions about an individual's genetic risk of developing certain diseases, according to a new study.
The study, from scientists at Stanford University, found that risk predictions derived from next-generation sequencing differ substantially from those obtained from genotyping arrays for several non-monogenic diseases. At the same time, high-density genotyping arrays may give results that are identical with sequencing results for many diseases.
Stanford's Alex Morgan told BioArray News this week that the aim of the study was to "give some insight into which technology might be considered in a disease-dependent way." Based on the study's findings, researchers could decide whether arrays or sequencing would be better suited to investigate a particular disease of interest, he said.
More specifically, Morgan said that arrays might be better suited for diseases that are caused by "only a handful of variants," such as sickle cell anemia or cystic fibrosis. Cancer studies, though, may not only require multiple sequencing runs but other kinds of information, such as methylation patterns.
"A tumor doesn't even have a single genotype, and a number of studies have shown that if you look at different regions in the tumor, there is a lot of heterogeneity," said Morgan. "We can't easily design a chip to measure these."
Morgan is a postdoc in the laboratory of Ron Davis at the Stanford Genome Technology Center. His PhD advisor Atul Butte also contributed to the project, as did Rong Chen, a bioinformaticist in the lab. Morgan discussed the study in March at the American Medical Informatics Association's Summit on Bioinformatics, held in San Francisco. He expects a paper on the work to be published in the Journal of the American Medical Informatics Association in coming weeks.
The impetus to compare arrays and sequencing came out of prior work in Butte's lab, in particular an approach Morgan and others developed called the Risk-o-Gram that relies on an individual's genome along with all the previous studies on variants associated with disease to create a genomic health report.
Personalis, a startup that specializes in human genome interpretation and is led by former Solexa CEO John West, has since integrated the approach into its offering. Butte is also a cofounder of Personalis.
As the Risk-o-Gram does not require the use of sequence data, and "could just as easily use genotyping array data," Morgan and his fellow researchers decided to develop a method to compare the amount of medically relevant information the two different technology platforms might provide. Using sequence data from the 1000 Genomes Project, the Stanford researchers compared Risk-o-Gram profiles for 187 individuals with the predicted results from genotyping arrays for the same individuals. They focused on 55 common diseases with reported gene-disease associations in the scientific literature.
The researchers used two different risk models, one based on the product of likelihood ratios and another on the allelic variant with the maximum associated disease risk. They also constructed risk profiles based on the SNPs that would be measured from two common genotyping array platforms: the Affymetrix GeneChip Human Mapping 500K Array Set, which measures 500,000 SNPs; and the Human Omni1-Quad Beadchip, which measures over a million SNPs.
"This gave us an amount of variability for a whole range of diseases," said Morgan. "We could then look to see if the different technologies produced the same risk profile, or a different one by disease."
Overall, the researchers determined that genotyping arrays "are leaving out some very important disease-associated variants for many diseases, and importantly that these are variants that are likely present in a reasonable number of people," as they showed up in the group's test population.
Even with a "very liberal allowance" for what was potentially imputable from the genotyping arrays, the clinical risk assessment using both the product of likelihood ratios model and the maximum likelihood ratio "differed substantially" between what was reported based on the arrays versus the 1000 Genomes sequence data.
"This suggests that a clinical interpretation of the genome done only using an 'off-the-shelf' genotyping array is likely to be lacking in important information relevant to a patient’s health," said Morgan. He added that for diseases such as Alzheimer’s and type II diabetes with many associated variants not covered by the genotyping array, the overall likelihood ratios can "vary dramatically," by as much as a factor of 20 times.
Morgan acknowledged that newer, higher-density genotyping arrays are available — Illumina's Omni5 BeadChip covers nearly 5 million SNPs, for example — but he said that even if the researchers had compared these higher-density chips with the 1000 Genomes sequence information, the "specifics would vary, but not the general approach."
"Even if arrays expand their coverage, many disease-related variants are rare, so are often not covered by even a very high-density array, unless it was specifically designed to probe rare disease variants," said Morgan.
'Hammers and Screwdrivers'
While the results of the Stanford study demonstrated that "sequencing does provide lots of important health-related variants that are not covered by genotyping arrays," Morgan said that the Butte lab is still "very much in favor" of using arrays, not only for genotyping, but also for expression profiling.
"A lot of my current work is about integrating microarray data, both genotyping array data from large association studies, but also [from] many gene expression studies, to help figure out which regions we want to sequence in depth," Morgan said. "I really feel that [I] want to make the best use of all available technologies."
Indeed, Morgan was coauthor on a recent expression-based genome-wide association study that linked the receptor CD44 in adipose tissue with type 2 diabetes. A paper describing the study was published last month.
"Expression arrays can be very helpful in letting you zoom in on some high-priority functional regions," said Morgan. He also said that genotyping arrays have an "important role to play" in the discovery phase of genome-wide disease association studies, as the "potential hypothesis space for examining associations is incredibly large using a full genome sequence."
Morgan likened the choice between arrays and sequencing to one between different tools.
"It's like a screwdriver or a hammer," Morgan said. "They both do something roughly similar, but some jobs are better suited to one than another," he said. "You just have to know when to use one or the other, and that's what this paper is about, trying to help give a bit of insight into why to choose one over another and when."
Morgan also said that while sequencing may provide more disease-relevant information, it comes with its own set of challenges.
"If you sequence someone, it is very unclear what you do with all the all intergenic regions," Morgan said. "If there are variants there, how do they affect function? So even if you look at the papers where people were fully sequenced … people ignore the intergenic variants and look only in the protein-coding regions to try to figure out what causes disease," he said.
According to Morgan, though, most of the disease-associated variants identified by array-based GWAS have been found in the intergenic space.
"You know the story of the man looking under the lamp-post for his keys," Morgan said. "When asked, he says he actually dropped his keys down the street, but here under the lamp-post the light is better. Perhaps we are like that man … looking at the protein-coding regions because the light is better, not because that is where the disease variants are."