NEW YORK (GenomeWeb) – In an effort to improve long-read metagenomic studies, two research groups have released a new metagenomics-specific long-read assembler and a number of long-read metagenomic datasets to test analytical methods on.
Earlier this month, researchers from the University of Birmingham in the UK, led by Nick Loman, published a paper in GigaScience describing long-read datasets from mock communities of microbes.
"There's not a clear view on the best way to analyze long-read metagenomic data, as it's such a new field," Loman said. "Our goal was to get some datasets out there for the community to start hacking on, with respect to methods, software, pipeline development, and validation work" for long-read metagenomics.
The DNA came from ZymoBiomics mock microbial communities and the datasets were generated using both Oxford Nanopore Technologies' GridIon and PromethIon instruments.
Already, those data have been used in developing a metagenomics-specific genome assembler for long reads, called metaFlye. Two weeks ago, researchers led by Pavel Pevzner of the University of California, San Diego, released a preprint on BioRxiv describing metaFlye, a modified version of their Flye genome assembler.
Not only did metaFlye allow the UCSD researchers to assemble complex genomes, Pevzner said, it also allowed them to detect unculturable bacteria, new plasmids, and viruses. "It gives the possibility to explore this dark matter of microbial genomics that cannot be cultivated in the lab," he said. "You can see 16S rRNA, which very rarely happens in short-read assembly […] With long-read assembly, this is possible."
Loman's lab has also used metaFlye, "both for our mock community tests, although not in our paper, and also for some real microbiome samples, with great results," he said.
While Loman is bullish that more data tools would lead to clinical applications of long-read metagenomic sequencing, such as validating the composition of fecal transplants and linking antibiotic resistance genes to specific strains of bacteria, others suggested the field had more pressing concerns.
"For most applications, right now the issue is the technology itself, not the software tools used to analyze the data," Mihai Pop, a bioinformatician at the University of Maryland who co-led the data analysis working group for the Human Microbiome Project, said in an email. The biggest issue, he said, was the highly biased representation of the sample within the long reads. "It's quite clear that the same dataset sequenced with Illumina and PacBio or Nanopore will have a different representation (with overlap) across the different technologies. Add to this the substantially higher current costs of long-read technologies and their much higher error rates, and you get to a point where the use of long-read data for metagenomics is simply not economically feasible at this point in time," he said.
According to Pevzner, long reads from sequencing platforms like Pacific Biosciences and Oxford Nanopore have proven useful for bacterial genomics, especially in 16S rRNA gene analysis. These genes, while highly conserved in bacteria, often contain repeat sections impenetrable to analysis with short reads.
Long reads can span those sections and have been useful in the study of bacterial isolates. But metagenomics presents unique challenges that many long-read genome assembly algorithms haven't been designed to address.
"Existing long-read assemblers make the assumption that there will be a standard read depth," explained Ryan Wick, a bioinformatician at Australia's Monash University, who works in the lab of Kat Holt. "Bacterial isolates will probably have just a single chromosome and most of your genomes are all about the same depths. That assumption just doesn't hold with metagenomes. There are lots of chromosomes and wildly different read depths."
One of the mock communities sequenced by the Loman group was chosen to reflect this. Both of the ZymoBiomics microbial community standards contained eight bacterial and two yeast strains, but one had a logarithmic distribution, while the other had an even distribution.
"The abundances are staggered," Loman said, with 10-fold dilutions from 100 cells up to 10 million cells, which helps to investigate the ability of a method to detect microbes present at low concentrations. "In a microbiome sample, sometimes it's good to detect important low-abundance communities," he said.
While other labs have already started chasing long-read metagenomes, Loman suggested they were operating somewhat blindly. "When folks have done these projects, they've used natural samples without a truth set, where you don't know exactly what's in the sample in the first place."
With these mock communities, however, the reads are now well defined and validated with Illumina short-read data.
"The point is to get a really nice sample set that anyone can use to do their own experiments on and encourage the bioinformatics community to develop methods that give robust results," he said. "When you write an assembler, aligner, or taxonomic classifier, you'll know what the answer is," he said. "So you'll know that your software is making sense."
Wick said he has "been enjoying Flye as an assembler for bacterial isolates," and while the microbial genomics lab where he works doesn't do a ton of metagenomics work, he has run studies on a number of long-read assemblers. "Long-read assemblers are getting pretty good and Flye was one of the better ones," he said.
As previously reported, Flye and metaFlye are a newer breed of assembler, which use De Bruijn graphs to increase speed.
Canu, an established long-read assembler that looks for overlapping contigs, has tweakable parameters that help with assembling metagenomes. However, Pevzner said the preprint shows metaFlye outperforms Canu.
"Canu was estimated to take 50,000 hours on a small supercomputer to assemble even a low-complexity Oxford Nanopore dataset," he said, while "metaFlye generates better assemblies and is 10 to 300 times faster than Canu on various datasets."
A request for comment from Adam Phillippy, the developer of the Canu algorithm, was not returned before press time. In their preprint, Pevzner's team presented data on metaFlye's performance on five mock datasets, as well as the performance of four other assemblers: Canu, wtdbg2, miniasm, and PacBio's Falcon (run only on the one PacBio dataset.)
On the PacBio dataset, representing a mock human microbiome containing 22 species, the researchers wrote that "metaFlye, Canu, and miniasm assemblies resulted in high reference coverage (ranging from 99.6% for miniasm to 99.8% for metaFlye) and NGA50 (ranging from 1.48 Mb for miniasm to 1.82 Mb for Canu)." They added that metaFlye had fewer mis-assemblies compared with Canu, 67 and 122, respectively, and that the wtdbg2 and FALCON assemblies had reduced reference coverage and lower contiguity.
The dataset contained 14 known plasmids; metaFlye was the only algorithm to assemble all of them, the authors wrote. "Most of the missed plasmids were shorter than 5 kb and were fully covered by a single read, illustrating additional complications in reconstructing short plasmids. Overall, plasmid reconstruction using long reads showed substantial improvement over short-read metagenome assemblers," they wrote.
The results suggest that long-read assemblers require a new version for metagenomics, Pevzner said. "Traditional genomics tools will not work."