Skip to main content
Premium Trial:

Request an Annual Quote

Decode Genetics Demonstrates Nanopore Scalability With Large-Scale Human Structural Variation Study

Premium

NEW YORK – In the largest study published to date using long-read sequencing to characterize structural variants in human genomes, a team led by researchers at Amgen's Decode Genetics has analyzed the DNA of more than 1,800 individuals from Iceland using Oxford Nanopore Technologies' sequencing technology.

As they reported in a preprint posted on BioRxiv on Nov. 20, the researchers, led by corresponding author Bjarni Halldórsson and senior author and Decode CEO Kári Stefánsson, used Oxford Nanopore's PromethIon and GridIon sequencers to generate long-read data for 1,817 Icelanders, identifying about 23,000 structural variants (SVs) per person, mostly insertions and deletions – more than three times as many as is possible with short-read sequence data alone.

In addition, they found that large SVs, which are more likely to impact protein function, are rare. "[A]s a result, we believe that large-scale SV studies will be essential to understand their role in the genetics of disease," they wrote, adding that "we believe that this work sets a foundation for further large-scale studies of SVs, allowing investigation of their full frequency spectrum."

The study is one of the first outcomes of Amgen's £50 million ($64 million) investment in Oxford Nanopore last year. At the time, Stefánsson had said that the firm's nanopore sequencing technology would provide "a much better handle on structural variants that confer risk of a wide variety of diseases."

"This is an intriguing long-read study of common and rare germline structural variants pursued at a stunning scale," commented Jan Korbel, a senior scientist at the European Molecular Biology Laboratory, in an email.

Earlier this year, Korbel, along with Evan Eichler at the University of Washington, Charles Lee at the Jackson Laboratory for Genomic Medicine, and other members of the Human Genome Structural Variation Consortium published a study in Nature Communications in which they used a variety of technologies to comprehensively characterize variants in the genomes of three parent-child trios, identifying almost 28,000 SVs per genome. That work involved Illumina short-read sequencing, PacBio long-read sequencing, Bionano Genomics optical mapping, 10x Genomics/Illumina synthetic long reads, Hi-C, and Strand-seq single-cell/single-strand technologies but only used nanopore sequencing for validation purposes.

"I think [the Decode study] is a really important milestone," said Fritz Sedlazeck, an assistant professor at the Human Genome Sequencing Center at Baylor College of Medicine. "It showcases for the first time population-scale sequencing using long reads."

Sedlazeck, who published a tool called Sniffles to identify SVs from long-read data last year, said that other large-scale studies that use long-read data for SV detection are on their way, but none has been published yet.

The largest long-read human genome study to date, he said, was published in Cell earlier this year by Eichler and others, who sequenced 15 genomes with PacBio technology, identifying almost 100,000 common structural variants. "There are now maybe 50 or 100 or maybe 200 PacBio human genomes out, published by several groups, but we don’t have a harmonized set, like 1,000 human genomes, that are sequenced by one group, like this paper shows," he said.

Having access to the raw data from the Decode study would be "an awesome resource for genomics" Sedlazeck said, but according to Decode, that will not be possible due to restrictions from the Icelandic ethics and science review boards and Icelandic law, which prohibit the sharing of personally identifiable data.

For their project, the Decode team sequenced the genomes of 1,817 Icelandic individuals, including 369 trios, who were recruited as part of various studies the company is conducting, and for whom extensive phenotypic data is available. For all but 24 of the participants, Decode also had short-read sequencing data and chip-based genotyping data on file. Most of the DNA came from blood samples but the study also included 119 heart tissue samples.

The researchers mainly used the Oxford Nanopore PromethIon, on which they ran 2,232 flow cells, as well as the GridIon, on which they ran 127 flow cells, to sequence the samples to about 15x average coverage. The data were generated between May 2018 and June of this year.

Almost 90 percent of the reads aligned to the human reference genome. The median sequencing error rate per individual was 15.2 percent, consisting of 6.7 percent deletion errors, 4.8 percent substitution errors, and 3.8 percent insertion errors. Half the data were in reads larger than 15 kilobases.

Using the long-read data, genotyping data from long-read and short-read sequencing, and imputation, the team then constructed a set of about 48,000 high-confidence SVs across individuals, or about 23,000 per person.

According to Sedlazeck, the Decode team developed new methods for SV calling, building on his Sniffles algorithm, that reassess breakpoints and weed out false positives. However, these methods rely on short reads, he said, and for future large-scale projects, it would be important to develop methods that don't require a second data type.

About 30 percent of the SVs the researchers found are rare, occurring in less than 1 percent of the population, and 2 percent of these rare SVs overlap with a coding exon.

The scientists also looked at whether any of the SVs were correlated with phenotypes and disease, relying on data from previous genome-wide association studies, and found that 30 SVs that impact coding exons showed correlations with 82 GWAS catalog markers. These included, for example, known associations of SVs with psoriasis, diabetes, and age-related macular degeneration, as well as new associations of SVs with white blood cell count and systolic blood pressure. They also found a known deletion associated with hematuria, a known insertion related to alterations in meiotic recombination, and a known deletion associated with cystinosis. In addition, they discovered an association between a rare 14 kilobase deletion that affects PCSK9, a target of cholesterol-lowering drugs, and low LDL cholesterol levels.

Sedlazeck said that the Decode analysis mainly focused on insertions and deletions – the most common types of structural variants – but that rearrangements and duplications also play important roles in Mendelian and other diseases.

He noted, too, that the sequence coverage, while 15x on average, is much lower in certain regions, meaning that the haplotypes are not adequately covered by reads.

Korbel agreed that with 15x coverage, "it probably will not be possible to reliably analyze structural variation in a single human sample," however, "the design chosen is great for a population-scale study as here convincingly demonstrated for the Icelandic population."

According to a Decode official, the company had already sequenced samples from about 50,000 Icelanders using Illumina sequencing and undertook the new study because long-read sequencing detects about three times as many SVs as short-read sequencing and because Oxford Nanopore sequencing can also detect DNA methylation.

He said Decode has been testing Oxford Nanopore's technology for the past three years and found that "it was able to consistently call SVs and methylation."

The study has been submitted to a journal for publication, he added, and a list of the variants will be published as part of the final version of the manuscript.

Going forward, the Decode researchers plan to correlate the SV and methylation data with disease and other phenotypes, and to improve the analysis algorithms and methodology. "We are also planning to use ONT in clinical sequencing studies," he said.

According to Korbel, a different approach would be needed to analyze individual patient genomes, for example, to find the underlying cause of rare diseases. "Approaches that integrate information from different genomic techniques have shown great promise in this regard, such as combining long-read sequencing with template-strand sequencing (Strand-seq), which can be used for phasing long reads into chromosome-length haplotypes," he said.