NEW YORK (GenomeWeb News) – A Wellcome Trust Sanger Institute and Yale University-led team has sifted through data from three 1000 Genomes Project pilot efforts to find a set of authentic loss-of-function variants in the human genome.
The work offers a peek at the types of genes typically affected by these changes and clues for distinguishing disease-related variants from those that are more harmless.
As the researchers reported online today in Science, not quite half of the nearly 3,000 loss-of-function variants initially considered remained following steps to remove false-positive variants associated with sequence and annotation errors. Based on this preliminary loss-of-function variant catalog, the team estimates that each person, on average, carries roughly 100 loss-of-function variants and has around 20 genes that have been rendered inactive by loss-of-function mutation that affect both copies of the gene.
"Each of us can be walking around with at least 20 genes basically inactivated," the study's first author, Daniel MacArthur, told GenomeWeb Daily News. "So that's pretty impressive." MacArthur, who was formerly based at the Wellcome Trust Sanger Institute, will begin heading a genetics group at Massachusetts General Hospital next month.
Given the role that loss-of-function changes to protein-coding genes plays in diseases such as cystic fibrosis and muscular dystrophy, MacArthur explained, most geneticists stumbling upon this sort of mutation in an individual's genome would suspect it to cause disease or have some other deleterious effect.
But as more and more apparently healthy individuals have their genomes and exomes sequenced, he added, investigators have unearthed a raft of apparent loss-of-function variants that are both intriguing and puzzling.
In an effort to try to figure out how many of the suspected loss-of-function variants are real — and to begin trying to understand their biological roles and repercussions — MacArthur and his colleagues analyzed genome and exome data generated for 185 individuals from Europe, West Africa, and East Asia during the three pilot phases of the 1000 Genomes Project.
To sift out false-positive variants caused by sequencing or annotation errors, the group started by doing extensive experimental genotyping on three different custom Illumina arrays. They also did assays based on custom Sequenom MassArrays and gleaned additional information from the genome of an anonymous European woman sequenced not only for the 1000 Genomes project but by several other groups as well.
In their efforts to filter out false-positive loss-of-function variants, the researchers helped in improving the Gencode annotation that they were using as a reference as well, MacArthur noted, since they submitted corrections to Gencode whenever incorrect annotations or gene models were discovered.
"The GenCode annotation set that we used for this project has now been improved substantially as a result of the errors we found in this project as well as the ongoing manual annotation work," he said. "As the next round of clinical sequencing studies get done, those errors aren't going to pop up again."
Of the 2,951 possible loss-of-function variants assessed, 1,285 remained following the false-positive filtering steps.
When they looked more closely at 253 genes containing verified loss-of-function variants, researchers found many genes with low conservation between species or with roles in processes such as olfactory reception.
As a whole, the loss-of-function variant-affected set contained genes from across the genome with a wide range of functions. Perhaps not surprisingly, though, the team did see a dearth of genes involved in crucial processes such as anatomical development or transcription.
Overall, each verified loss-of-function variant was very rare, suggesting many are mildly or severely deleterious and weeded out of the genome via natural selection, MacArthur explained. Consistent with that notion, the team saw more than two-dozen loss-of-function variants in known disease genes and another 21 variants suspected of contributing to disease, suggesting many study participants were heterozygous carriers of recessive disease-related mutations.
By comparing the features of variants that are known or suspected of contributing to disease risk with loss-of-function variants that are more benign, researchers have started finding clues for differentiating between recessive disease genes and non-essential genes containing loss-of-function changes.
For instance, they reported that disease-associated genes tend to be much more evolutionarily conserved, exhibit more extensive protein-protein interactions, and are less functionally redundant than non-essential genes affected by loss-of-function mutations.
A model that represented such differences "would be useful in a case where you had sequenced a rare disease patient and found maybe half a dozen mutations that could account for that person's disease," MacArthur explained.
"What you need in those cases are ways to prioritize the affected genes that are most likely to be disease-causing," he said, noting that researchers in co-author Matt Hurles' Sanger group are already developing an algorithm to predict whether genes are likely to be involved in recessive disease risk based on loss-of-function data.
Getting a handle on the loss-of-function variants that are present in human populations may also help in uncovering rare gene-inactivating mutations that offer some protective benefit against disease in some contexts, MacArthur noted.
"Those are really attractive drug targets for pharmaceutical companies," he explained, "because you already have sort of biological proof of principle [in affected individuals] that knocking out those genes is both safe and also effective."
Beyond some of the potential clinical applications of the research, those involved in the loss-of-function annotation effort are also keen to get a handle on functional roles of these variants. To that end, researchers plan to build up their catalog of loss-of-function variants found in human populations using sequencing data.
From there, MacArthur said, they hope to put together a custom genotyping array that covers as many of the loss-of-function variants as possible so that they can test large numbers of individuals for whom disease or phenotypic trait information is available.
The researchers have already started to assess loss-of-function variant patterns in phase I data from the main 1000 Genomes Project. They are also gearing up to look at a collection of around 30,000 human exomes sequenced in the US, representing roughly 15,000 individuals enrolled through disease studies and about as many unaffected controls.
"With those numbers of samples we should be able to get a really good catalog, digging down into the 0.1 percent range of the frequency spectrum," MacArthur said.