Skip to main content
Premium Trial:

Request an Annual Quote

The Undiscovered Variants


Scattered throughout the human genome are clues as to why some people become afflicted with a particular common disease, while others do not. And genome-wide association studies have been hot on the trail of those clues.

While GWAS have identified a number of variants linked to diseases or traits — one estimate says more than 1,000 common disease- and trait-associated loci have been identified so far — they do not explain the majority of variation seen in populations. Height, for example, is thought to have a heritability of about 80 percent, though the common variants associated with height only explain about 5 percent of height variation seen in populations. Rare variants, gene-gene and gene-environment interactions, or epigenetics, among others, could help fill this gap, as could common variants that have yet to be found.

In addition, rare variants could already be influencing the common variants identified through long-range synthetic associations. Cheaper whole-genome sequencing could help solve this mystery, as could larger GWAS sample sizes, better statistical techniques, and functional analysis approaches.

GWAS have matured since their introduction in the early 2000s. In the beginning, studies were smaller — the first few human GWAS had about a thousand cases and controls. Now studies have gotten bigger, with a few thousand cases and controls. Indeed, a recent GWAS of migraine studied 2,731 cases and 10,747 population-matched controls and was later replicated in 3,202 cases and 40,062 controls.


"I think we've learned a lot about how to do GWAS studies properly," says Jeffrey Barrett at the Wellcome Trust Sanger Institute. In addition, he notes that GWAS have seen technological and statistical improvements. SNP chips can now include nearly a million SNPs as well as probes for copy-number variants. Genotype imputation — or using a reference data set, such as the HapMap or 1,000 Genomes, to predict genotypes using statistics — has also helped move the field along. "Those kinds of things have really also contributed to increasing the power of GWAS approaches," Barrett adds.

But even as the technique matures, GWAS are based on the hypothesis that common variants cause common diseases, and that these studies can pluck those signals out of the noise. "In my opinion, the common -disease, common variant hypothesis has been hugely borne out and reinforced by the bulk of GWAS results," Barrett says, though he adds, "I think it's definitely going to vary a bit depending on what kind of disease and trait you are talking about."

Baylor College of Medicine's Suzanne Leal says that common variants "are definitely involved in common diseases, although they don't seem to play a substantial role in the heritability."

Not there

Common variants, those with a minor allele frequency of greater than 5 percent, haven't quite lived up to their promise as far as explaining the root of common, complex diseases. Alone, they don't fully explain disease or trait heritability, which relies on a calculation of how the phenotype is affected by both the genotype and the environment. In a broad sense, heritability is the ratio of the variance of genotypic values to phenotypic values. It is often determined through twin or sibling studies, so that environmental effects are discernable. "For a locus where you detected an association, you can figure out how much of the heritability is due to that particular locus," Leal says, adding that "it's not always a gene for some of these common variants."


While many common variants have been found to contribute to common diseases, like age-related macular degeneration, most only partially explain the heritability of the disease — about 5 percent — or the variance seen in a population. For example, in type 2 diabetes, researchers have identified nearly 20 variants, but they only explain about 6 percent of the disease risk. There are seven schizophrenia-associated SNPs, but they only account for a small portion of the disease risk as well, even though schizophrenia is highly heritable.

And the common variant conundrum is not limited to diseases. In the classic height example, the common, associated loci only explain about 5 percent of the variance seen. Something is missing.

This so-called missing heritability might be found in a number of places. The phenotypic variation of a population could be due to rare variants, structural variants, gene-gene interactions, gene--environment interactions, epigenetics, as-yet undiscovered common variants, or even various combinations of these.

"I think we have to tackle this problem from different angles. We can't just put everything into one single approach. I think it would be a little foolhardy," Leal says. "We would have to tackle this problem using a multitude of -approaches."

As the Queensland Institute of Medical Research's Naomi Wray points out, despite all the talk about common variants not contributing much to common disease, they still could contribute more than they're given credit for. For height, she says, about 50 percent of the variation has actually been detected from common SNPs, though many of those were not stringently associated. "What that estimate is saying is that there are many more common, associated variants out there, it's just that their effect sizes are so small that we are not able to pick them up with the stringent levels that we impose," she says. "As sample sizes increase, we'll be able to pick up more."


Other researchers say that rare variants — those with a minor allele frequency that is less than 1 percent — could be behind much of the missing heritability. Indeed, in a Cell essay, the University of Washington's Mary-Claire King and her colleague Jon McClellan write that "if common alleles influenced common diseases, many would have been found by now." They go on to suggest that the common disease, common variant hypothesis has been tested and found lacking. Rare variants may hold a clue to disease, they add.

"I think we have to wait to figure out how much of the heritability they are actually explaining," Leal says of rare variants. "But I do believe that they will probably play a substantial role." Much of Leal's research focuses on the role of rare variants in disease and she is developing methods to better uncover them. She has developed statistical approaches, using a combined multivariate and collapsing method or kernel-based adaptive clusters, to detect rare variants from sequencing data.

Synthetic associations

At Duke University, Director of the Center for Human Genome Variation David Goldstein and his colleagues suspect that rare variants are even behind some of the disease associations that are currently chalked up to common variants. In a much-talked-about paper that appeared in PLoS Biology last year, the Goldstein team report multiple rare variants across a region of the genome that could account for GWAS signals that appear to stem from more common variants. These synthetic associations, then, could make up a fraction of the common variants already reported to be associated with disease. Goldstein has since said that not all GWAS signals are due to rare variants — just a portion might be — and researchers should no longer assume that a GWAS signal is due to a common variant, as is commonly thought.


However, it was not always interpreted that way. Queensland's Wray says that some in the field thought the paper meant that when common variants were identified, they were only reflecting rare variants. "Although that might be the case some of the time, it's not going to be the case most of the time," she says.

As part of a recent series of opinion articles that ran in PLoS Biology in January, Wray and her colleagues argued that the importance of synthetic associations has been overblown. "The whole debate was really about the perception of the relative importance of synthetic associations explaining results from GWAS to date," she says.

Wellcome Trust's Barrett also took part in that series — he and his colleagues say that while synthetic associations are possible, it's unlikely that they occur very frequently. Crohn's disease has a known synthetic association with NOD2; there are three rare variants which are not included on SNP arrays that confer a substantial disease risk. While synthetic associations are possible, Barrett says that there are "many different lines of evidence [to] suggest that there aren't too many examples of that."

The point of difference then, is on just how often synthetic associations may occur. "I believe it does occur sometimes, but I don't think it explains the majority of the associations that we're observing with common variants," Baylor's Leal says.

In his reply to the Wray et al. and Barrett et al. critiques, Goldstein suggested that the importance of synthetic associations will be determined empirically, and as for whether they are behind many, some, or few GWAS signals — and he leans toward "some, perhaps many" — he says that only "time will tell."



It may be that neither common nor rare variants account for the missing heritability. At the November 2010 American Society of Human Genetics meeting in Washington, DC, the Broad Institute's Eric Lander suggested that much of the missing heritability may be due to epistatic effects. Lander said that as more common variants are found, the percent of heritability explained seems to increase. He pointed to Crohn's disease and type 1 diabetes to illustrate, and said that population genetics suggests that rare variants will explain less of the heritability than common ones will. "I'm not sure it will fill that out," Lander said, referring to rare variants.

Heritability, he added, is estimated from twin or sibling studies; the additive variation seen from GWAS studies is much less than what those studies suggest the heritability should be. "There's a big hole in that argument: epistasis," Lander said. "If there is any genetic interaction, the population estimate of the 'additive variants' is not [necessarily] the additive variants. All it is is the estimate of the non-dominant terms." He went on to describe a model developed by one of his postdocs that looked at how a GWAS experiment would pick up a disease that could be caused by loci in three different pathways. While the postdoc found all the loci, they only appeared to explain 33 percent of the heritability. "The rest of it is due to epistasis," Lander said, adding that "the pair-wise power to detect epistasisis [is] virtually nil."

Marylyn Ritchie at Vanderbilt University is also searching through GWAS data to look for hints of gene-gene and gene-environment interactions. "For many diseases, we're finding common variants explain some proportion of disease risk. But I think that for a lot of others, we're not seeing the common variants have a lot of explanation," she says. "It's likely that there's some heritability that's explained by rare variants, but I think that a larger part of it is going to be explained by gene-gene and gene-environment interactions."


Ritchie is developing filters for GWAS data based on what is known biologically about the disease being studied. "Our approach is filtering that search space by biology that we know, that different genes are related to each other," she says. The data for her filters come from publicly available sources like Reactome and the Gene Ontology database, and she has been applying them to GWAS data that are generated in collaboration with other groups. "We've applied this approach to four different sets and have found evidence of interactions that replicate, and so we think it's a valid approach. We'll see if it stands the test of time in genomics," Ritchie says.

She adds that a similar filtering approach could be used to search for gene-environment interactions. "For dietary environmental factors, you might look at different vitamin metabolism pathway genes or transport genes. Or for things related to different toxins, you might look at cytochrome P450-related pathways," she says. "I think you can really guide your search through the genome based on the knowledge we have about the environmental factors as well. The limitation there is that we don't know everything in biology yet."

The future

Genome-wide association studies are coming to a crossroads: should researchers soldier on with them, perhaps recruiting more and more participants to boost sample sizes, or should they turn to whole-genome sequencing?

Queensland's Wray says the best thing to do would to be sequence lots of people. "I think the gold standard would be to sequence very large samples [to get] the best of both worlds, but we are not there yet," she says. "And so when you've got the choice, I would fall in the camp of 'We'd better spend our money on collecting larger resources and do genome-wide association studies on them.'" When sequencing common, complex diseases, she adds, researchers may find it difficult to "sort the wheat from the chaff" at present.


Another way to harness the power of larger sample sizes without recruiting more and more people is to take advantage of meta-analyses. "Putting different GWAS of the same disease together so you get a combined sample size in the tens of thousands rather than [just] the thousands has clearly demonstrated the pattern that lots and lots of samples are the key to having good statistical power to detect the small effects," Barrett says. He adds that it can take a lot of work to sort through the different controls or technologies that researchers use, but that those researchers have gotten good at avoiding the majority of meta-analysis pitfalls. Indeed, he says, they've been quite successful. "The initial GWAS of whatever disease identified one or two genes and the first meta-analysis identified 10 or 12 in many cases," he says.

Whole-genome sequencing, however, probably isn't too far off. "If you want to ask what's the best possible experiment to do to understand any genetic question, you want complete genome sequences for thousands of people, really," Barrett says. "Sequencing is getting cheaper and faster and faster, and I think in the not too distant future we will be moving towards analyses that aren't miles away from being like GWAS, but are based on the complete sequence rather than a SNP chip."

Barrett says his work is just beginning in Crohn's disease, and other groups are starting similar projects with type 2 diabetes and autism. He also says that it will take time to sequence large numbers of samples to get to that gold standard. "I think we'll see the beginnings of those results soon, but much like GWAS, the real power will come when you have thousands of samples and that'll probably take a couple of years to ramp up," he says.

However, sequencing likely isn't the end-all and be-all. "I think sequencing will get us part of the way," Vanderbilt's Ritchie adds. "I think for the remainder, we get back to the lab and do some functional characterization of the variants to understand potential mechanisms or which variants actually cause a phenotype or demonstrate a phenotype."

The Scan

Highly Similar

Researchers have uncovered bat viruses that are highly similar to SARS-CoV-2, according to Nature News.

Gain of Oversight

According to the Wall Street Journal, the Biden Administration is considering greater oversight of gain-of-function research.

Lasker for mRNA Vaccine Work

The Scientist reports that researchers whose work enabled the development of mRNA-based vaccines are among this year's Lasker Award winners

PLOS Papers on Causal Variant Mapping, Ancient Salmonella, ALK Fusion Test for NSCLC

In PLOS this week: MsCAVIAR approach to map causal variants, analysis of ancient Salmonella, and more.