NEW YORK (GenomeWeb) – Using a recently established public database of short structural variants, researchers led by neuroscientist Allen Roses have discovered a simple sequence repeat that may cause Lou Gehrig's disease, obtaining their result far more quickly than previously possible.
Although the specific location of the variant, associated with a high risk of amyotrophic lateral sclerosis (ALS), is under embargo until the American Society of Human Genetics' annual meeting in Vancouver next month, principal investigator Roses discussed with GenomeWeb the methodology his team used to pinpoint the marker, which is part of his research to elucidate the role of short structural variants (less than 50 base pairs in size) in neurodegenerative diseases.
Because short structural variants are understudied in human genetics, Roses, neurobiology professor at Duke University School of Medicine, along with other colleagues form the university and from Alameda, California-based Polymorphic DNA Technologies, have created a publicly available database of such variants. They also developed a bioinformatics tool that researchers can use to prioritize for further study markers that may potentially cause or increase the risk for complex conditions.
Using this database, Roses, Duke University bioinformatician Michael Lutz and others identified a simple polyT sequence repeat that occurs in five different lengths (T14, T15, T16, T17, and T18) near a well-known ALS-linked gene. They looked for the structural variant by genotyping a cohort of 191 ALS patients and 526 controls.
Homozygous T18/T18 genotypes were only observed in 3 ALS patients and no controls. The T17/T18 genotype represented about a third of the cases, but was observed in less than 4 percent of controls, providing a high degree of statistical significance. The surprising finding, according to Roses, was that three of the ALS cases were diagnosed previously as having mutations in other ALS-associated genes, TARDBP and C9ORF72.
Researchers replicated the initial findings on the structural variant in a second cohort, which included larger proportions of ALS patients diagnosed with mutations in a number of genes associated with familial ALS, such as SOD1, FUS, TARDBP and C9ORF72, as well as patients diagnosed as having sporadic ALS. The researchers demonstrated that the structural variant identified 28 of 29 ALS patients with a specific SOD1-A4V mutation.
Data from a large genetic pedigree cohort built by neurologist Teepu Siddique at Northwestern University has shown that ALS patients with the SOD1-A4V variant have an aggressive disease course. "In most cases of rapid death from ALS, it usually starts with early bulbar-related symptoms, which affect things like swallowing and breathing," Roses told GenomeWeb. "SOD1-A4V patients start with foot drop, yet rapidly progress, [with symptoms] climbing up the body." While the 28 ALS patients with the structural variant had the same rapid clinical course seen in patients with SOD1-A4V, it wasn't associated with other singular SOD1 mutations thought to be involved in causing the disease, he explained.
Moving beyond GWAS
There are more than 7 million structural variants in the human genome, which can show up as deletions, insertions, simple sequence repeats, copy number variations, block substitutions, and inversions. Among short structural variants, Roses is particularly concerned with the role of more than 2 million simple sequence repeats or short tandem repeats that tend to have numerous alleles at a single locus and vary by a single nucleotide.
These SSVs have been historically understudied due to the limitations of next-generation sequencing to accurately detect the length of repeat sequences. Other approaches, such as phased sequencing or cloning techniques with Sanger sequencing, are expensive at the whole-genome level. Moreover, the genetics field has been primarily focused on locating the genetic regions associated with diseases using standard SNP platforms and via genome-wide association studies (GWAS).
SNPs only gets you in the ballpark if you know where the SNPs are.
"This widely used whole genome screening method makes use of SNPs that, in many examples, are described as identifying a gene but may be many kilo bases from the gene with other genes between them," Roses said.
In an Alzheimer's & Dementia paper published in June, Roses, Lutz, and others noted how GWAS in Alzheimer's have routinely excluded the APOE region, which is strongly associated with risk for the disease and other complex conditions, and in doing so, these studies have failed to understand that role of another gene, TOMM40, very close to and in linkage disequilibrium with APOE.
"New genes associated with [Alzheimer's] are proposed frequently based on SNPs associated with odds ratio [less than] 1.2," the authors wrote. "Most of these SNPs are not located within the associated gene exons or introns but are located variable distances away. Often pathologic hypotheses for these genes are presented, with little or no experimental support."
Although Roses has performed GWAS on several diseases while beta-testing new commercial platforms, he is not a fan of the approach in his current work and has criticized the genetics field's continued preoccupation with SNP data, particularly if the aim is to make medicine more precise.
"SNPs only gets you in the ballpark if you know where the SNPs are," he said. The experiments that he has done to identify short structural variants in late-onset Alzheimer's, Lewy bodies Alzheimer's, Parkinson's disease, and now ALS, has yielded "accurate genetic information, with multiple variant markers at specific genetic loci," he asserted.
For example, his team developed an algorithm combining multiple polyT length alleles at the rs10524523 locus within intron 6 of TOMM40, with APOE genotypes and people's age to estimate normal people's risk for mild-cognitive impairment due to Alzheimer's between the ages of 65 and 83 years.
In Alzheimer's & Dementia last year, Roses and colleagues also described how they cloned and sequenced an intronic region of the SNCA gene that was prone to repeats and structural variants, and identified a haplotype that "acts as an enhancer element" and is associated with increased risk of Lewy body Alzheimer's.
Earlier in 2009, Ornit Chiba-Falek from Duke University, a member of Roses’ team, had published data on a microsatellite repeat Rep1, which she concluded regulates transcription of the SNCA gene in the human brain, elevates levels of SNCA mRNA, and increases the risk of Parkinson's.
Roses has now turned his attention to ALS, a disease where motor neuron degeneration and atrophy causes patients to progressively lose all ability to control their muscles. Around 90 percent of ALS cases are currently thought to occur sporadically, while 10 percent are familial and caused by mutations in C9orf72, SOD1, TARDBP, and FUS.
Researchers have previously identified many mutations in these genes and concluded that they cause familial ALS and may contribute to sporadic forms of the disease. In fact, in 1991, Roses' lab first published the association of a SNP location on chromosome 21 with familial ALS, which led to the identification of SOD1 as the possible disease risk locus.
Since then, around 160 SOD1 SNPs have been identified and "christened" as genetic causes of the disease, Roses said, but he noted that this has been done without any clear Mendelian clinical genetic analyses. He has always suspected many of these markers were really nearby SNP associations and not the primary mechanism underlying ALS that could be the focus of targeted drug discovery programs.
His latest ALS experiment seems to support this theory. "An additional surprising finding was that the genotypes containing the structural variant, previously uncharacterized, also defined clinical ALS that was [thought to be due to] inherited SNP mutations in three other 'ALS gene regions' located on three separate chromosomes," Roses said over email.
A gift to advance the field
Roses' view isn't without support. Last year, researchers led by Peter Sudmant from the University of Washington reported on an integrated structural variation map of more than 2,500 genomes, and wrote in Nature that "structural variants are enriched on haplotypes identified by genome-wide association studies."
Roses' approach in the ALS project and other studies is part of a broader hypothesis that slight changes in the expression of proteins in the brain can cause complex neurological illnesses. Studies have increasingly shown that short structural variants, specifically short tandem repeats, control and contribute to the variation in gene expression.
In earlier efforts in Alzheimer's, his team used phylogenetic analysis and phased sequencing data to home in on the structural variants of interest. That more labor-intensive and expensive research inspired Roses to do something about the dearth of information on short structural variants in public databases. Most public repositories, such as dbVar and ENSEMBL, provide extensive information on structural variants larger than 50 base pairs, while smaller variants are generally submitted to dbSNP.
To build a database that is specific for short structural variants, Roses and colleagues from Zinfandel Pharmaceuticals, Duke, and Polymorphic DNA Technologies plucked dbSNP for those variants and scanned the Human Reference Sequence for all simple sequence repeats between 1 and 50 base pairs in size. They collated some 4 million short structural variants and 2 million simple sequence repeats into a public database that has been available since a June publication in Human Mutation (see here; username: user1, password: rna389).
I don't buy green bananas anymore.
The database includes information on these variants, such as their location, transcription factors, and microRNA binding sites. Users can input data on regions and signals gleaned from GWAS, and a scoring algorithm will identify and prioritize potentially causal variants. Users can then take that report and conduct genotyping and other studies to more definitely explore the function of the identified short structural variants.
With this database, "we do not need to do the phylogenetic mapping first, like we had to do to discover TOMM40'523," Roses said. That mapping cost around $500,000 and it took several months to develop accurate assays for the different polyT lengths using Sanger sequencing. Comparatively, for the ALS study, one researcher was able to query the database for polyT and polyA repeats in genomic regions of interest, and produce an initial list of short structural variation sites for each gene in an hour.
"If you want to find out if there are structural variants near your gene of interest, it takes you 10 minutes," he said. "We've taken the hard work out of it."
Roses' group also used this evaluation system when studying the genetic underpinnings of Lewy body pathology in Alzheimer's, and the scoring algorithm gave high scores to a region of the SCNA gene. Based on this, researchers cloned and sequenced the gene region using samples from patients with and without Alzheimer's, and identified a haplotype that appeared to increase SNCA-mRNA levels in human brain tissue and contribute to the formation of Lewy body pathology. Roses' team is now using genome editing and induced pluripotent stem cells to validate the functional effect of this short structural variant.
Researchers have also retrospectively tested whether their evaluation system could accurately parse out promising structural variants for further study and eliminate ones that are unlikely to be functionally relevant. The algorithm gave high scores to both the Rep1 microsatellite repeat they previously identified as increasing risk for Parkinson's and the TOMM40'523 polyT they found to be associated with risk of late-onset Alzheimer's and age of onset. Meanwhile, the algorithm gave low scores to several short structural variants in SCNA that Roses' team had previously deemed to have no functional effect on the risk of Lewy body pathology in Alzheimer's.
Roses is hoping that this database and evaluation system will help move genetic research on common diseases beyond GWAS, which, with a few exceptions, has identified SNPs that contribute minimally to complex illnesses and haven't resulted in treatments. Based on his research to date, he is betting that structural variants can explain key disease mechanisms, can inform the development of new drugs with blockbuster potential, and explain disease risks for a larger swath of the population.
That's what he's trying to demonstrate with TOMM40'523. Zinfandel Pharmaceuticals and Takeda Pharmaceuticals are validating an algorithm that factors TOMM40'523 polyT lengths, a person's age, and APOE genotype to identify individuals at the highest risk of losing memory and thinking skills due to Alzheimer’s before age 80. In the same trial, researchers are assessing if a very low dose of a drug called pioglitazone can delay the onset of memory and thinking impairments in high risk patients. If this study is successful, it will result in a repurposed drug (pioglitazone is the active ingredient in diabetes drug Actos) and a diagnostic that can assess the age of onset for Alzheimer's related cognitive decline for the majority of the population.
This work, and Roses' research throughout his career, has been challenged by others in the field. His discovery of the APOE4 allele as a risk factor for Alzheimer's was controversial more than 20 years ago, and is his more recent efforts on TOMM40'523, despite multiple groups having replicated the contribution of the variant to Alzheimer's age of onset and specific cognitive impairments.
Despite his detractors, at 73, Roses continues to make new genetic discoveries, publish papers, and launch wine-themed start-ups. The short structural variant database, Roses said, was a gift to the medical field in the hopes that others can continue to explore the scientific ideas and questions he has worked on throughout his life.
"I don't buy green bananas anymore," Roses quipped. "In the remaining one to 10 years that I might live, when people still don't believe this and still don't believe that, hopefully, there will be lots of people generating similar data for other diseases that I couldn't even begin to collect."