BALTIMORE – Researchers at the National Institutes of Health and their collaborators are exploring the use of nanopore sequencing to analyze thousands of brain samples as part of a study of Alzheimer’s disease and related disorders.
The long-read sequencing project, enabled by NIH’s recently established Center for Alzheimer's and Related Dementias (CARD), aims to build a framework for the large-scale application of nanopore sequencing while filling the knowledge gaps about genomic variation in Alzheimer’s disease and other neurological disorders.
Founded about two years ago, CARD is a collaboration between the National Institute of Neurological Disorders and Stroke (NINDS) and the National Institute on Aging (NIA), said Cornelis Blauwendraat, an NIA investigator who is involved with the CARD long-read sequencing project.
“One of the goals of CARD is to generate resources,” said Blauwendraat, adding that the long-read sequencing project, which is a collaborative effort between NIH and other academic scientists, intends to sequence around 4,000 brain samples to create a sequencing and bioinformatic resource for other researchers.
According to Benedict Paten, a professor at the University of California, Santa Cruz, who is part of CARD, the goal for the project is to create something similar to the Broad Institute’s Genome Analysis Toolkit (GATK), which offers tools for genome variant analysis that are mainly designed for Illumina data, but for nanopore sequencing. While there are other long-read sequencing modalities, such as the Pacific Biosciences HiFi sequencing, Paten said the team settled on nanopore sequencing for the project because of its cost advantage and scalability. “The [PacBio] HiFi is amazing technology, but right now, it is expensive and time-consuming to scale,” he said.
According to Blauwendraat, the roughly 4,000 samples included in the project will be frozen brain tissues, mostly obtained from biobanks in North America, that cover Alzheimer’s disease, Lewy body dementia, and other dementias.
While working with brain tissues may involve “a little bit more effort,” there are also added benefits to studying these samples, Blauwendraat said. For one, compared with other sample types, such as blood, the brain can offer direct insights into mosaic variation involved in neurological diseases. Additionally, nanopore sequencing can not only sequence the DNA in these samples but also find methylation signals.
Blauwendraat emphasized that the project is making an effort to include subjects of diverse backgrounds. “We don't just want to sequence a couple of thousands [of samples of] European ancestry and call it a day,” he said.
While the long-read sequencing project is still optimizing its workflow, Blauwendraat said that the wet lab protocols typically start with sectioning the frozen brain tissues into small pieces for DNA isolation once they arrive from the biobanks on dry ice.
However, given that the brain is a “fairly fatty” organ, Blauwendraat said the team had to optimize the DNA extraction protocol for different brain regions in order to achieve appropriate DNA yields for nanopore sequencing while preserving the integrity of the DNA molecules. “It's a very delicate process,” he explained. “We really want to preserve the long reads.”
After QC, the DNA will undergo shearing to obtain fragment sizes between 30 kb and 35 kb, which is optimal for nanopore sequencing for this study. The sheared DNA will then be turned into libraries and sequenced using the Oxford Nanopore Technologies PromethIon platform.
For this project, Blauwendraat said, the goal is to sequence one sample per flow cell to achieve 30X to 40X genome coverage and an N50 of about 30 kb. The team operates two Oxford Nanopore PromethIon 48 and one PromethIon 24 sequencers, he added.
In addition to sequencing, the project also aims to establish a computational infrastructure to achieve phased assemblies of nanopore sequencing data at scale. “Until recently, there was really no end-to-end pipeline to have phased assembly for nanopore [sequencing],” said Mikhail Kolmogorov, a National Cancer Institute investigator who is also part of CARD.
To address that, Kolmogorov said, the team has developed pipelines to achieve large-scale de novo genome assembly using nanopore sequencing data only. “We have been spending a lot of time making sure that the assembly is very accurate,” he said. “We want to produce the best diploid assemblies possible.”
According to Kolmogorov, the final analysis output of the project will include a collection of phased small and structural variants. In the end, all of the data from the samples — including the raw sequencing data, the alignment and assembly data files, as well as the methylation data — will be made “broadly available to any qualified researchers” through AnVIL, an NIH-designated data repository platform, the group said.
Given the scale of this project, Blauwendraat said, the group had to overcome a few technical bottlenecks. One of them is to carry out the wet lab procedures at scale. To address that, the project has “invested heavily” in robotics, he said, and the team is currently testing out various automation platforms for sample preparation and QC.
Still, Blauwendraat said, the brain cutting step, which cannot be easily automated, presents a challenge and will remain more labor-intensive.
Another bottleneck is the large amount of data, given that every sample will generate about one terabyte of data. To overcome that, he said CARD has provided the project with fiber optics and high-speed internet to facilitate data transfer.
Regarding nanopore sequencing’s error rate, Paten said that for detecting single-nucleotide variants genome wide, nanopore sequencing currently outperforms Illumina sequencing based on Genome in a Bottle (GIAB) statistics. “There is a small deficit in precision [for nanopore sequencing], but it's made up for by the fact that you miss fewer variants, and there is more of the genome that ends up getting covered as a result,” he said.
However, “the homopolymers are a problem still” for nanopore sequencing, Paten said, adding that he thinks that it's “just going to be a work in progress.”
So far, the group has sequenced about 250 samples. While the researchers want to make sure that everything works well before starting to scale up, they are hopeful to release a preprint describing the methods used in the project toward the end of this year.
Eventually, the team is hoping to shed light on genomic variations pertinent to neurological diseases that were previously unattainable using short-read sequencing.
“I'm very interested in characterizing structural variants in the germline but also on the somatic and mosaic level,” said Fritz Sedlazeck, a researcher at Baylor College of Medicine and another academia collaborator of the CARD long-read sequencing project.
One reason for missing heritability of neurological disorders is the lack of knowledge about complex repeats within the genome, he said. With the nanopore long-read data generated in this project, researchers can now dive into these parts of the genome and identify variants that are potentially associated with Alzheimer’s disease and related disorders.
“I don't think anyone can claim that they're going to solve [these brain disorders] with 4,000 samples, but we're going to put up a fight and discover some new cool stuff,” Sedlazeck said.