SAN FRANCISCO (GenomeWeb) – Two research groups have demonstrated the potential of Oxford Nanopore Technologies' MinIon device for whole-genome sequencing and de novo assembly of large genomes, using the human and the tomato genome as examples.
The two groups — a consortium of laboratories from the UK, the US and Canada as well as a group led by researchers at RWTH Aachen University in Germany — reported their work in publications on the BioRxiv server this month.
The consortium that published the de novo assembly of the human genome previously released a human genome dataset at an Oxford Nanopore-sponsored meeting last year. Since then, though, the group has adopted the newer R9.4 chemistry.
The team used 39 flow cells to generate 91.2 gigabases of sequence data, or around 30x coverage. In addition, Josh Quick, a doctoral researcher in Nick Loman's lab at the University of Birmingham, developed an ultra-long read protocol for the MinIon. Using that method, the group generated an additional 5x coverage of the genome, obtaining an N50 read length of 99.7 kilobases.
The researchers sequenced the genome of GM12878, a well-studied human cell line. Five labs participated in the sequencing, using the latest R9.4 chemistry and the 1D sequencing protocol. The initial sequencing generated more than 14 million reads with a read N50 of 10.6 kilobases. On average, they obtained 2.3 gigabases of sequence per flow cell.
Basecalling was performed by Metrichor and assembly using Canu. The initial assembly was composed of 2,886 contigs with an N50 contig size of 3 megabases. When the researchers aligned the contigs to the GRCh38 reference genome, they showed it had a consensus accuracy of 95.2 percent. Polishing with Illumina sequencing improved consensus accuracy to 99.88 percent.
In addition, they reported that the genome was comparable to previous GM12878 assemblies, identifying similar numbers of structural variations. However, the nanopore assembly did have higher numbers of deletions due primarily accuracy issues in homopolymer regions.
Next, the researchers added the ultra-long reads to the assembly. To generate those longer reads, they used Oxford Nanopore's Rapid Run kit, but saturated it with high molecular weight DNA. The ultra-long reads had a read N50 of 99.7 kilobases, with the longest read reaching 882 kilobases. Adding these reads increased the contig N50 to 6.4 megabases. In addition, it enabled the MHC region to be captured in a single contig.
Adam Phillippy, head of genome informatics at the National Human Genome Research Institute and an author of the study, said that it showed the potential of nanopore sequencing, in particular the long read protocol. The protocol yields molecules that are of "similar length as optical mapping [results]," he said, "but also gives base information, not just tag locations, that can let you get very continuous assemblies." Phillippy has worked with numerous sequencing technologies and recently collaborated with the NHGRI and the US Department of Agriculture team that assembled the goat genome de novo, using a combination of Pacific Biosciences technology, Hi-C sequencing, and Bionano Genomics' optical mapping technology.
The MinIon is "still limited by throughput," Phillippy said, but if it were possible to generate 30x coverage of the genome using the ultra-long read protocol, "you'd be able to have these really contiguous assemblies. In addition, he noted that the accuracy still lags behind that of PacBio. Corrected PacBio reads typically have greater than 99 percent accuracy, versus 92 percent for the nanopore reads in the BioRxiv paper, the authors wrote. Assembly with PacBio also still has a larger contig N50. A group that assembled a de novo Korean reference genome achieved a contig N50 of 17.9 megabases using only PacBio sequence data, for instance.
"PacBio currently gives longer contigs and has higher accuracy, but there's a lot of headroom for growth on the nanopore side," Phillippy said. In particular, the "prospect of continuity is great with the ultra-long reads" on the MinIon, he added.
Phillippy noted that one particular challenge for the MinIon is homopolymers. As demonstrated in the BioRxiv study, the basecallers struggle calling homopolymers longer than five bases in length. The researchers tested three different basecallers — Metrichor, Nanonet, and Scrappie — on a subset of reads that mapped to chromosome 20. Scrappie, a newer basecaller, does much better at calling homopolymers, Phillippy noted, which was an encouraging sign.
In a second BioRxiv paper, researchers from RWTH Aachen University demonstrated they could use the MinIon to assemble the tomato genome de novo. The genome is smaller than the human genome, at just over 1 gigabase. Similar to the group generating the human genome assembly, the researchers found that their assembly was "structurally highly similar to that of the reference" but that it had a "high error rate caused mostly by deletions in homopolymers." After polishing with Illumina data, they reduced that error rate and had a gene completeness of 96.53 percent, which "slightly surpassed" that of the reference genome.
The researchers sequenced the genome using 31 MinIon flow cells, generating around 111 gigabases of data that passed filter, representing around 100x coverage of the genome. The group had a wide range of output per flow cell, varying between 1.1 gigabases and 7.3 gigabases. Average read length also varied significantly, between 6.4 kilobases and 14.9 kilobases.
The team tested three assemblers — Canu, Miniasm, and SMARTdenovo. Miniasm was able to generate the longest N50 and required the least amount of compute time, however it resulted in the highest error rate. In addition, when the researchers tested the functional completeness of the assemblies, they used a tool called BUSCO, which looks for conserved genes. BUSCO estimated a gene completeness score of .21 percent, 26.46 percent, and 26.74 percent for Miniasm, Canu, and SMARTdenovo, respectively. Thus, moving forward, the researchers used the Canu assembler to pre-correct the original reads and then assembled the resulting data using SMARTdenovo. That generated an assembly consisting of 899 contigs with a contig N50 of 2.45 megabases. They then used Illumina sequencing for polishing and ultimately boosted the BUSCO gene completeness score to 96.53 percent.
Phillippy noted that one interesting aspect of basecalling on the MinIon is that the base accuracy of the final assembly seems to vary depending on the type of organism being sequenced. For instance, he said, the final accuracy for the human assembly was slightly lower than what he's achieved for microbial genomes. He said this could be due to the types of data that the basecallers are trained on. For instance, he believes that Oxford Nanopore primarily trained their basecallers on Escherichia coli data. Since basecallers use machine learning, they tend to perform a bit better when analyzing data types that are similar to the training sets. For instance, he said, different types of organisms tend to have different epigenetic profiles, so if a basecaller is used to mostly seeing unmethylated DNA, and then a highly methylated genome is sequenced, it could result in errors at those regions.
In general, though, he said the MinIon performs equally well on microbial, plant, and human genomes, and the basecalling differences seem to be small. Other variables can also impact basecalling accuracy, including DNA extraction, sample prep, and the overall quality of the DNA itself. However, he said, it would be interesting to look further into the idea of training basecallers on DNA from a diverse range of organisms.