NEW YORK – Researchers from France's National Research Institute for Agriculture, Food, and Environment (INRAE) have published data on the performance of both short- and long-read sequencing platforms in analyzing highly complex metagenomic samples.
The three mock communities, featuring as many as 87 microbial strains spanning 29 bacterial and archaeal phyla, were created by researchers at Oakridge National Laboratory led by Mircea Podar and are likely some of the most complex communities that could be sequenced, according to senior author Mathieu Almeida, a research fellow at INRAE.
The study, published in the journal Scientific Data, ran the samples between 2018 and 2019 on many of the most-used sequencers available at the time, including Illumina's HiSeq 3000, MGI Tech's DNBsEQ-G400 and T7, Thermo Fisher Scientific's Ion Torrent GeneStudio S5 and Ion Proton P1, Oxford Nanopore Technologies' MinIon, and Pacific Biosciences' Sequel II.
"When we started, we were aware of the strong requirements for long-read sequencing, requesting long DNA fragments and high DNA purity, and expected low performance for quantitative metagenomic analysis compared to conventional short-read methods in such complex mock samples," Almeida said. "However, we were surprised to see strong performance, even at low-depth sequencing, for both PacBio and Minion. Furthermore, we were surprised by the DNBSEQ-T7 performance, providing ultra-deep sequencing in a single run with similar low error rate compared to the other technologies, making it at the time of our study one of the cheapest technologies for metagenomic sequencing."
Almeida suggested that the datasets associated with the paper provide resources to "challenge assembly, taxonomy profiling, and binning software," as the mocks combine high complexity both in the number of species and by including closely related organisms, challenging aspects for existing metagenomics analysis software.
"This is a nice comparison of the long- and short-read platforms, and shows overall comparability between methods," said Chris Mason, a sequencing expert at Weill Cornell Medicine who was not involved in the study. His lab has made headlines by sequencing microbial communities from urban environments, including subways. "It also shows that, at a relatively low read depth, 100,000 reads, minimal complexity samples can reach saturation quickly."
The collaboration between INRAE and Oak Ridge can be traced back to Almeida's postdoc years at the University of Maryland, when he discovered a mock community generated by Podar that was "at that time the most complex synthetic mock generated." After moving to INRAE, Almeida aimed to "go beyond [a] commercial mock community comparison," and reached out to Podar.
The labs began collecting data around 2018 and finished shortly before 2020.
In designing the mock samples, "our intent was not to mimic a community from a specific environment but to achieve a high degree of diversity at all phylogenetic levels and to capture a wide range of genomic sizes and composition," Podar said.
The authors submitted their results to Scientific Data to make them more easily accessible to the field, he added.
Access to data from complex metagenomic sequencing studies aren't often provided, "because these data are produced for the purpose of hypothesis testing," said first author Victoria Meslier, also of INRAE. "As a microbiome data analyst, I wanted to put forward these complex synthetic communities' datasets," she said, noting that this particular journal fit the bill for providing open access to them.
The authors noted that their data could be used to benchmark or improve metagenomic assemblers and taxonomic profiling software.
Mason said that he would like to see data from the latest sequencing platforms. In just the last year, the number of companies offering short-read sequencing platforms has doubled and PacBio has introduced new long-read instruments.
Almeida said he plans to compare the results with technologies that INRAE is potentially upgrading to in the next year, including the PacBio Revio, MGI's HotMPS chemistry, and the Illumina NovaSeq X series with XLeap-SBS chemistry.
One downside to using bespoke mock communities is that there is little of them left for future studies. "I've used up all the DNA for some of the microbes," Podar said. Also, high DNA input requirements for long-read platforms meant that only one of the communities made it onto a PacBio Sequel II in the study. He suggested that commercial standards providers, such as ATCC, could step in to provide more mock communities like the ones in the study.
The data from shallow sequencing suggest that shotgun metagenomics is becoming more and more competitive with amplification-based approaches, such as 16S ribosomal RNA gene sequencing. Amplification-based methods are still cheaper, but increasingly less so, while shotgun metagenomics provides the ability for functional analysis at the same time, Almeida said.
"It always depends on the goals of the project," Podar stressed. "If you're trying to find a rare needle in a haystack, obviously you need higher [sequencing] depth."
Almeida and Podar noted that while they did not provide data comparing sample preparation methods for the study, it's something they're keeping an eye on. "There has been a lot of improvement to extract DNA from many organisms," Almeida said, including bacteria, archaea, and fungi. "If you want to do new benchmark studies, you have to take this into account.