BALTIMORE – Researchers from the Genomic Answers for Kids (GA4K) project at Children's Mercy Research Institute in Kansas City, Missouri, have demonstrated the clinical utility of long-read sequencing and machine learning for improving diagnoses for rare pediatric diseases.
In a study published last month in Genetics in Medicine, GA4K investigators described their early efforts to analyze the genomes of over 1,000 pediatric patients with suspected rare genetic disorders and their families. It involved combinations of short-read exome sequencing, short-read genome sequencing, and PacBio HiFi genome sequencing and resulted in a diagnostic rate of 11 percent for patients with prior negative genetic testing and almost 35 percent for patients with no previous genetic testing.
As part of the initiative, the team also built an open-access database containing rare variants, de-identified pedigrees, and coded phenotypes to propel further research into the toughest rare diseases that are currently intractable by clinical sequencing.
“What we want to do is systematically go through the blind spots of clinical sequencing,” said Tomi Pastinen, director of the Children's Mercy Genomic Medicine Center and the lead investigator for GA4K, alluding to the fact that over half of rare disease patients still struggle to find a proper diagnosis even after comprehensive genetic analysis by the current standard.
According to Pastinen, the GA4K project is a “foundational effort” for Children’s Mercy Research Institute, which opened last year. While the recent paper highlighted the project’s first-year achievement, GA4K is already in its third year and has sequenced the genomes of more than 3,000 patients to date.
There are multiple aspects to the GA4K project, Pastinen said. For one, it aims to tackle the “community-wide dilemma” of what to do once clinical exome sequencing comes back negative. Initially, GA4K extended the genetic testing workflow to short-read whole-genome sequencing, which Pastinen said is still relatively rare in the clinical space, for those who received negative exome-sequencing results. It further extended the efforts to long-read whole-genome sequencing for those patients still unsolvable by short-read whole-genome sequencing.
For the published study, the researchers analyzed 1,080 patients with conditions ranging from congenital anomalies to neurological and neurobehavioral clinical presentations later in childhood. Most of the patients had not received any genetic diagnosis. According to Pastinen, all samples analyzed for the paper had gone through Illumina exome sequencing, and almost all also underwent whole-genome sequencing using either Illumina or MGI sequencing technology. About 550 of the samples were further investigated using Pacific Biosciences HiFi long-read sequencing.
Another aspect of the project, Pastinen said, is to build a genetic database for undiagnosed diseases, easily accessible to other researchers, to facilitate future studies, especially those using long-read sequencing. “A big part of our approach is sharing the data with the rest of the community in order to crowdsource negative [genetic test results] genomes,” he said.
Researchers currently face challenges for genomic data sharing, he noted. For instance, the GeneMatcher service represents an accessible tool for investigators to match unpublished variants with associated phenotypes, he said, but the process can lead to false positive variants and overmatching, given the narrow scope of data exchange during the matching process. On the other end of the spectrum, Pastinen said, though depositing the entire sequencing and phenotype data to NIH’s database of Genotypes and Phenotypes (dbGaP) offers researchers granular genomic data, it requires “significant bioinformatics resources,” given the hundreds of terabytes of data.
To address these challenges, he said, part of GA4K’s mission is to construct a “low barrier,” completely open-access database that stores the most likely pathogenic variants with de-identified phenotypes and pedigrees while omitting the burden to upload whole genome sequences. He said the database is easier to browse than conventional gene-matching services and offers a larger scale of evidence to avoid false-positive matching while still conserving computational resources.
To achieve some consistency in variant prioritization, the team explored using open-source machine learning algorithms to help interpret the variants. “What we wanted to generate is a uniform system,” said Pastinen, adding that he hopes the machine learning approaches will remove at least some part of inconsistency from manual variant analysis, which he said “can be very subjective.”
Additionally, Pastinen said machine learning algorithms can help save analysis costs and turnaround time. One of the biggest costs for sequencing-based genetic diagnosis is the analyst’s overhead, he said, and while machine learning will not completely replace manual interpretation, it will likely be able to automate a significant percentage of genomic analysis, freeing trained geneticists to “focus their effort on the final sign-out of the diagnosis, rather than the looking for the needle in the haystack.”
With more than 500 samples sequenced with PacBio HiFi long reads for this paper, and double that number for the entire GA4K project so far, Pastinen said another important aspect of the data sharing is to help establish a framework to advance long-read sequencing in the clinical space. He said one “initial barrier” for the team to apply PacBio long-read sequencing for rare disease testing is the lack of a long-read reference database. “If you don't have any reference data, you can’t make the call whether this is a normal event, or whether it is a potential disease-causing rare event,” he said.
The paper also demonstrated the strength of using long-read sequencing to improve rare disease diagnoses. Specifically, it showed that PacBio HiFi long-read sequencing increased the discovery rate of rare coding structural variants by more than fourfold compared with short-read sequencing. In addition, Pastinen pointed out that long-read sequencing was more powerful in discerning smaller structural variants — from 50 base pairs to a few thousand base pairs in length — and repeat expansions than short-read sequencing or clinical microarray.
These findings largely mirror other studies in the increasingly booming long-read sequencing research space. For instance, also using PacBio HiFi sequencing, Susan Hiatt, senior scientist at the HudsonAlpha Institute for Biotechnology, has generated long-read sequencing data for six proband-parent trios with neurodevelopmental disorders who previously had negative genome sequencing.
The results have shown “long-read data does allow more accurate alignment and variant calling,” said Hiatt last week at the American College of Medical Genetics and Genomics annual meeting, adding that her research has shown long-read sequencing found more biologically relevant counts of de novo single nucleotide variants and more de novo Alu insertions, and showed better mappability in low-complexity and low-mappability regions of the genome compared with short-read sequencing.
Meanwhile, using Oxford Nanopore Technologies long-read sequencing, Danny Miller, a pediatrics and medical genetics resident at Seattle Children's Hospital and the University of Washington, presented pilot data demonstrating the clinical utility of long-read nanopore sequencing to solve variants in segmental duplications during the ACMG conference.
“Segmental duplications are challenging to sequence with short reads because they exist in multiple locations in the genome, and short reads are going to align with some similarity to those regions,” Miller pointed out. “And this is a good example of how long reads can help clarify questions when you have a variant in those regions.”
Despite the promises for long-read sequencing, for it to be widely adopted clinically, there are still many barriers to overcome. Specifically, when it comes to sequencing cost, long-read sequencing is still “significantly more” expensive compared with short-read sequencing, said Emily Farrow of Children's Mercy Hospital, who is a coauthor of the Children's Mercy paper, during the ACMG meeting. Then there are speed and throughput, she said, adding that “they are not equivalent to short reads; there's just no way around it right now.”
“For us to get 60 genomes, and we have a fleet of multiple [PacBio] instruments, it'll take several weeks,” said Pastinen, agreeing that throughput can be a bottleneck for long-read sequencing.
Moreover, he said long-read sequencing also demands higher DNA quality, becoming another possible hurdle to the technology’s adoption. Although the blood DNA isolation method used in his team's study is the same as the one used for short-read sequencing and microarray analysis, Pastinen said, “if you go into sources of DNA that are non-blood and potentially older DNA samples, you will be in more trouble with long-read sequencing because it does require high-quality DNA.” He also noted that for direct-to-consumer testing where buccal swabs are collected, the DNA quality may be too low for long-read sequencing.
Data analysis is another potential bottleneck for the adoption of long-read sequencing. When it comes to Oxford Nanopore sequencing, Miller said, the pipelines to analyze the sequencing data can be “complex, computationally expensive, and frequently changing.”
Moving forward, Pastinen said GA4K plans to continue developing rare disease testing methods using long-read sequencing. “What we're envisioning and actually currently pursuing is expanding [long-read sequencing] into looking at not only the DNA sequences alone but also RNA sequences from these patients,” he said.
Ultimately, Pastinen is hopeful that his team can turn long-read sequencing into a “first-line comprehensive genetic test” that can end rare disease patients’ diagnosis odyssey. “That's why we have invested a large fraction of our effort into long-read sequencing,” he said. “It has the potential to cover most of the clinical indications if we develop it further.”