SAN FRANCISCO (GenomeWeb) – It's long been known that standard short-read next-generation sequencing cannot sequence through all regions of the human genome. Now, however, researchers from the Mayo Clinic have sought to characterize the extent of the problem and its potential impact on our understanding of human health and disease.
In a study published last month in Genome Biology, researchers estimated that approximately 37,000 regions within more than 6,000 gene bodies in pathways relevant to human health, development, and reproduction are "dark," meaning they cannot be sequenced by short-read sequencing technology. In addition, the team highlighted the ability of long-read technologies from 10x Genomics, Pacific Biosciences, and Oxford Nanopore Technologies to resolve those dark regions, and they developed an algorithm to rescue variants found in the dark regions of short-read sequencing datasets. When they applied that method to a known Alzheimer's disease gene, they identified a novel mutation in a dark region present only in individuals with Alzheimer's and not controls.
The problem of so-called dark or camouflaged genomic regions has been known for years, but "it wasn't clear how big the problem was," said Mark Ebbert, lead author of the study and an assistant professor at the Mayo Clinic, whose research focuses on neurodegenerative diseases. He said that the researchers sought to characterize the extent of the issue after coming across the problem in their own research into Alzheimer's disease. "We kept bumping into genes that we were surprised were dark," he said, since the genes were well known and thought to play a role in the disease, such as the gene CR1. That gene is one of the "top-five Alzheimer's genes and about 26 percent of it is camouflaged," Ebbert said.
In the study, the researchers first used Illumina sequencing to determine the extent of the problem, selecting 10 male individuals who had been sequenced as part of the Alzheimer's Disease Sequencing Project.
They considered a region dark if there were insufficient numbers of reads that aligned to the regions or if reads aligned to the region but were not of high enough quality to call variants in the region due to mapping issues.
Overall, the researchers identified 36,794 regions that they considered dark in 6,054 gene bodies. Dark regions spanned 2,855 protein-coding exons within 748 protein-coding genes.
Next, the team wanted to test the ability of long-read technologies to resolve those regions, so they analyzed whole-genome datasets that had been generated using 10x Genomics' linked-read technology, as well as by PacBio and Oxford Nanopore's technologies. In addition, they evaluated whole-genome data sequenced on Illumina but with 250-base paired-end reads.
They found that using 10x Genomics' linked-read technologies was able to resolve just under half of the protein-coding exons, while PacBio sequencing resolved about two-thirds of those regions, and Oxford Nanopore resolved around 90 percent.
John Fryer, a senior author on the study whose Mayo Clinic lab focuses on the pathogenesis of Alzheimer's disease, said that "library size had a big impact on how much of these regions were resolved." He said that the team analyzed other genomes sequenced with long-read technologies as they were being put in the public domain. "The longer the library, the dramatically more resolved the dark regions," he said.
Looking further into the dark regions, the researchers wanted to characterize their impact. So, they looked at genes known to be related to human disease where at least 5 percent of protein-coding regions were considered dark, and found 76 genes associated with 326 diseases that fit the description. The genes included those related to autism spectrum disorder, schizophrenia, hearing loss, spinal muscular atrophy, and inflammatory bowel disease.
Going forward, Fryer said that he thought the study made clear the need for long-read sequencing. "We're missing a lot of the genome" when we only use short-read sequencing technologies, he said. However, he acknowledged that cost is still a major concern for the long-read technologies, and researchers are looking to "get the most bang for the buck."
Ebbert added, however, that he anticipates the cost of these technologies will continue to go down, particularly as throughput is increased with recent introductions by both Oxford Nanopore and PacBio of higher-throughput instruments.
In addition, there is already a lot of data that has been generated with short-read sequencing, so rather than resequence those samples, Ebbert said that the team was interested in seeing if they could "rescue" variants from the dark regions. In the study, the researchers described a method by which they first focused in on a specific dark region of interest that was dark due to poor mapping quality of the reads. They then extracted all the reads from that region, masked the regions that are similar in the reference genome, and then tried to re-align. This allowed the aligner to align the reads and enabled variant calling to proceed normally.
In the study, the researchers tested this method on one specific region containing the CR1 gene related to Alzheimer's. They found that by masking two of three highly similar exons in the gene, they could align reads and call variants, although noted that they could not pinpoint in which exon those variants were located.
Using the method, they rescued 4,214 variants from the Alzheimer's Disease Sequencing Project genomes, including a frameshift mutation that they found only in five cases and in no controls.
"It's a relatively straightforward method to rescue variants," Ebbert said, but added that it is not a perfect solution. "It's a Band-Aid, and ultimately, as long-read sequencing technologies improve, they will be the long-term fix, but this has value in the short-term."
Ebbert added that the researchers plan to follow up on the frameshift mutation they identified in the CR1 gene by working with collaborators to see how prevalent it is in Alzheimer's disease cases and to ensure that it is not present in controls. "If we continue to see this trend, we'd look to do a more targeted study in a larger cohort," he said. "And then we'd also plan to do some functional studies to see what effect this has."