NEW YORK (GenomeWeb) – Researchers from the University of Maryland and the National Biodefense Analysis and Countermeasures Center have developed an method that relies on a technique for comparing the similarity of webpages to assemble large genomes from Pacific Biosciences sequencing data.
The group described the method in Nature Biotechnology this week and said it could be applied to other types of long-read sequencing data, such as from Oxford Nanopore's MinIon instrument.
The researchers decided to focus on an assembly method for long-read technology because PacBio has been increasing the throughput of its systems, making it feasible to de novo sequence not just bacterial genomes, but also eukaryotic genomes.
However, while the current assembly methods are suitable for small bacterial genomes, they do not scale well for larger genomes because they rely on comparing full sequences, adding significant time and computational cost, Sergey Koren, a co-lead author of the study and a bioinformatics scientist at NBACC, told GenomeWeb. For example, he said, assembling the Drosophila melanogaster genome took almost a month, "making mammalian genomes impractical to assemble."
In the recent study, Koren and the other researchers adapted an approach for comparing web pages, called MinHash, for genome assembly. They created the term MHAP for MinHash Assembly Process. When measuring similarity between webpages, "comparing all words in a document is intractable," Koren said. Instead, MinHash estimates similarity within an error bound, an approach he thought would work well for increasingly long sequence reads.
Essentially, MinHash reduces text on webpages to a set of fingerprints called a sketch. When applied to sequence data, "rather than looking at every k-mer in a sequence, it only looks at a specifically designed subset," Koren said. The subset is chosen based on permutations of the k-mers, where a single k-mer is retained from each permutation. "The number of permutations, not the sequence length, dictates how accurate your estimate is," he added, with one permutation resulting in a large error "if you picked the only k-mer that is not shared by a pair of sequences."
However, the error rate drops as more permutations are done. "You only need on the order of a hundred permutations to identify similarity in sequences of tens of thousands of bases," he said.
MHAP's sensitivity is tunable, a function of the k-mer size, the number of permutations, and various thresholds used for determining similarity.
In the study, the researchers compared MHAP to BLASR and DALIGNER, tools that were designed for PacBio's technology. They evaluated the performance by comparing detected overlaps to true overlaps, which were inferred from mapping reads to reference genomes. They also tested different parameters of the algorithms including a "fast" mode and a "sensitive" mode and different sequencing chemistries. The team compared the tools for sequencing of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and a human hydatidiform mole cell line.
The group found that BLASR "runtime and sensitivity was highly genome-dependent and affected by sequence complexity and uneven replicon coverage." The tool was originally developed for mapping to a reference so is not ideal for de novo assembly. By contrast, "MHAP considers all possible alignments; it was consistently accurate across all genomes tested and an order of magnitude faster than BLASR at all levels of sensitivity," the authors wrote.
DALIGNER is similar to MHAP in that it also uses k-mer matching to detect overlaps between long sequence reads. At the time, DALIGNER was still being developed and did not work on large genomes. So, to compare it with MHAP, the researchers used just 1 gb of subsampled data for both DALIGNER and MHAP.
One difference between the two tools is that DALIGNER takes into account all k-mers in all reads, rather than focusing on the subset "sketch" that MHAP does. As such, DALIGNER relies on filtering repetitive k-mers. Of the subsampled data, DALIGNER was the fastest and had a similar sensitivity to MHAP run in the "sensitive" mode. However, as read lengths increased, DALIGNER's sensitivity dropped — an average of 5 percent for reads longer than 10 kb. MHAP maintained sensitivity across all read lengths.
Koren noted that since submitting the paper, the researchers have compared MHAP to the Falcon assembler, which uses DALIGNER in its initial step, and has found that for larger genomes, MHAP is faster and produces a more contiguous assembly.
For all of the genomes sequenced in the study, the authors wrote that the "assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes."
Notably, the contig N50 of the human hydatidiform mole cell line (CHM1), was "an order of magnitude larger than both the Illumina CHM1 assembly and early BAC-based Sanger assemblies of the human genome," the authors wrote. The assembly also potentially resolves 51 out of 819 annotated gaps in the human reference genome, although further validation is needed. MHAP was also able to assemble 97 percent of the major histocompatibility complex, a historically difficult region to assemble, into two contigs, compared to an Illumina assembly, which broke it up into over 60 contigs.
Going forward, Koren said that his group plans to collaborate with the US Department of Agriculture to assemble the goat genome using PacBio sequencing and MHAP. He said that MHAP's "probabilistic approach, which has not been used much in bioinformatics, can be used in assembly for long reads to generate de novo reference grade assemblies for eukaryotic genomes." And although sequencing with long reads will still be more expensive than short-read sequencing, MHAP should significantly reduce computational costs to assemble genomes, and will be cheaper than the BAC-based approaches used to generate reference genomes.
It could have a "large impact on human genomes and cancer because short reads miss a lot of structural variants, so de novo assembly will become increasingly important," Koren said.
In addition, he said that the approach could also be applied to other types of long-read sequence data, like from Oxford Nanopore's MinIon device. Since submitting the publication, Koren said that his group has used MHAP to assemble publicly available E. coli data that was generated on the MinIon into a single contig without making any modifications to MHAP. "It is possible to improve MHAP sensitivity for new data types by tuning the number of permutations and thresholds, but it works well for current single-molecule technologies," he said.