Skip to main content
Premium Trial:

Request an Annual Quote

Repeat Expansion Disease Detection From Short Reads Shows Potential for Clinical Use


SAN FRANCISCO (GenomeWeb) – Repeat expansions cause more than 30 inherited disorders, yet diagnosing them correctly can be challenging. Overlapping phenotypes make diagnosis by the gold standard methods — Southern blot or PCR — time consuming, while whole-genome and exome sequencing with short reads often miss the expansions.

Now, researchers from the Walter and Eliza Hall Institute of Medical Research in Melbourne have developed an algorithm, exSTRa (expanded short tandem repeat algorithm), that can detect repeat expansions from either whole-genome or exome sequencing data generated by Illumina or other short-read sequencing technologies.

In a study published in the American Journal of Human Genetics this month, the team described the algorithm, demonstrated its performance on four cohorts representing 11 known repeat expansion disorders, and compared it against three other algorithms for detecting repeat expansions from whole-genome sequencing data.

Melanie Bahlo, senior author of the study and a professor at the Walter and Eliza Hall Institute of Medical Research, whose lab focuses on understanding the genetic causes of neurological disorders, said that her team plans to use the method in the Epi25 study, an international project that aims to sequence the exomes of up to 25,000 individuals with epilepsy, as well as on other patient cohorts of neurological disease. In addition, she is interested in working with researchers on the clinical side to develop it into a diagnostic test. The algorithm is available as an open source tool.

"We are very keen to see it used in the diagnostic space and we feel that it is already mature enough to make use of it — of course coupled with the gold standard validation," she said.

Bahlo noted that while other algorithms have been developed to detect repeat expansions, what makes exSTRa unique is that it is amenable to both exome and whole-genome sequencing data. Also, unlike other algorithms, it does not require a large set of normal controls.

In addition, Rick Tankard, lead author of the study and now a postdoctoral researcher at Murdoch University, said that the design of the algorithm is fundamentally different from others in that it is based on analyzing the sequence of the reads themselves, as opposed to calling repeats based on alignment.

The algorithm "looks at each sequence read and assesses its repeat content — how much repetitive sequence is in each read," Tankard said. Large numbers of controls are not needed, he added, because it makes use of a model called the outlier detection test. Because the vast majority of individuals will be normal at any given repeat expansion allele — even among a cohort of individuals with genetic disorders, everyone is unlikely to have the same repeat expansion — an outlier detection test can identify those who do have a repeat expansion. exSTRa is also designed to look at a set of known repeat expansions — in the study, the researchers analyzed 21 of these — but Tankard said that it could easily be tweaked to broaden its focus.

For their study, the researchers first assessed the method in a simulation study and then tested it in four cohorts of already-diagnosed individuals with 11 different known repeat expansion disorders. They also compared exSTRa with three other methods: the Illumina-developed ExpansionHunter, STRetch, and Tredparse.

In total, the researchers analyzed more than 200 individuals, all of whom had undergone some type of sequencing — either exome sequencing with Agilent capture, whole-genome sequencing with the TruSeq Nano protocol that includes a PCR step, or PCR-free WGS.

The researchers found that the performance of exSTRa and the three other algorithms varied depending on the sequencing method. For instance, unsurprisingly, exSTRa performed the best on exome sequencing data, since the other algorithms had been designed to work on whole-genome data. Nonetheless, exSTRa still missed some repeat expansions because the alleles were not covered in the exome capture.

Overall, exSTRa had a sensitivity between 67 percent and 100 percent, depending on the sequencing method, with a specificity of over 97 percent in all cohorts. The authors noted, however, that in two of the cohorts, there were only three and four affected cases, respectively, which resulted in the wide range of sensitivity and also affected the other methods.

In total, exSTRa called 79 out of 101 known expansions, ExpansionHunter called 75, STRetch 77, and Tredparse 71. ExSTRa was able to call at least one case of all 11 known repeat expansions. All methods performed poorly in calling the FMR1 repeat expansion, which causes Fragile X syndrome, and STRetch was unable to call any of the SCA6 expansions, which cause spinocerebellar ataxia.  All four methods accurately called all 13 of the Huntington's disease repeat expansions, two cases of another ataxia-associated expansion, and one case of a spinal muscular atrophy.

Going forward, Bahlo and Tankard plan to use the exSTRa method in different ways. While Bahlo's group will continue studying patient cohorts or neurological disorders, Tankard plans to use it to look for repeat expansions in longevity studies of healthy elderly populations.

Bahlo anticipates that in the future, exSTRa could be incorporated into diagnostic exome or whole-genome tests to boost their diagnostic rates. The benefit of exSTRa over some of the other algorithms is that it can work in conjunction with exome sequencing tests, which is currently more common than whole-genome sequencing for rare disease diagnosis, she said.

In the meantime, other researchers are looking to move to long-read sequencing for diagnosing disorders caused by repeat expansions, using platforms from Pacific Biosciences and Oxford Nanopore Technologies, where reads are long enough to span an expansion.

For instance, researchers at the Parkinson's Institute and Clinical Center in Sunnyvale, California have been working on a CRISPR/Cas9-based capture enrichment technique in combination with sequencing on Pacific Biosciences' Sequel system to identify pathogenic repeat expansions that cause Parkinson's disease.

Bahlo said that both the short-read technique her group developed and other techniques that make use of long-read sequencing could play important roles in diagnostics and in discovering novel repeat expansions.

Currently, "the cost to do a whole genome [on PacBio and Oxford Nanopore Technologies] is too high to be used in routine clinical diagnostic settings or even in large cohort studies," she said. As such, she predicted that the first iteration of tests using long reads to detect repeat expansions would be targeted panels. In addition, long read sequencing could also serve as a faster and more economical method for validating repeat expansions identified from short-read exomes or genomes, she said.

At the moment, such repeat expansions must be validated by either Southern blot or PCR, the gold standard methods, but such validation can be very time consuming since it typically requires sending samples out for testing. Also, as new repeat expansions are being discovered, labs may not have the assay on hand, so getting validation can take a very long time.

In addition, Bahlo noted, even when clinical tests based on targeted long-read sequencing are developed, there will still be a role for algorithms like exSTRa in analyzing exome and whole-genome data that has already been generated. Such algorithms could be used to analyze cases where a diagnosis was not found or in cases where there is a suspected misdiagnosis, she said. Databases like dbGap, for example, are "full of whole genomes and exomes that won't be repeated with long-read sequencing, and analysis with our method and others will still be highly relevant," Bahlo said.