NEW YORK (GenomeWeb) – To examine bacterial diversity in the gut microbiome, researchers from Stanford University have turned to a next-generation sequencing method that uses synthetic long reads. The method allowed them to identify bacterial species that could not be seen using short-read sequencing.
The Stanford researchers, who published their results today in Nature Biotechnology, found previously uncharacterized bacterial strains and uncovered a greater-than-expected diversity.
"The bacteria are genetically much more heterogeneous than we thought," senior author Mike Snyder said in a statement.
Short read sequencing is typically used to study bacteria because of its high throughput. But often, the shorter reads are not able to accurately identify all the bacterial strains present in a sample or assess the diversity of strains. To get around this problem, researchers have turned to metagenomic sequencing over the past several years as an unbiased tool to study microbial diversity directly from environmental samples. Many bacteria cannot be cultured, so metagenomic sequencing has proven to be a good method to identify and characterize previously unknown microbes.
In their study, the Stanford researchers used a long-read sequencing approach called TruSeq Synthetic Long-Read sequencing — originally developed by Moleculo, which has Illumina has since acquired.
The technology generates long reads by first fragmenting DNA into 10-kilobase pieces and tagging those pieces with unique barcodes. The fragments are then broken up further and sequenced with standard short-read Illumina sequencing technology. After sequencing, the barcodes are used to stitch the 10-kb fragments back together.
Snyder's team had previously modified the protocol to work with RNA and demonstrated that it could be used for transcriptome sequencing. Other groups have also used the technique for metagenomic sequencing of sediment samples.
In this study, the Stanford group developed an algorithm called Lens that it used to reveal haplotype diversity, as well as a bioinformatics pipeline called Nanoscope.
The team used the long-read approach on two metagenomic data sets: a "mock metagenome" dataset from the Human Microbiome Project consisting of 20 organisms with known reference genomes that is widely used for validation, and a sample from a healthy male gut.
For the mock metagenome, the researchers generated three TruSeq Synthetic Long-Read libraries as well as 3.1 gigabase pairs of standard Illumina short-read libraries. The long-read libraries consisted of 2.9 gigabase pairs of sequence data with an N50 of 9.2 kilobase pairs.
For the human gut metagenome, the team generated seven long-read libraries consisting of 8.3 gigabase pairs of sequence data with an N50 of 8.6 kilobase pairs, plus an addition 8.1 gigabase pairs of short-read sequence data.
The researchers used both short- and long-read sequence data to assess the benefits of the longer reads.
After sequencing, they mapped the long reads back to the known reference genomes of the mock community, finding that the method had high accuracy with less than .5 percent of reads misassembled.
The long-read technology did seem to have some sequence bias. For instance, four organisms were covered at more than 98 percent with short reads but less than 75 percent with the long reads. However, six organisms had at least 10 percent of their genomes covered by long reads but were not covered at all by short reads. The two technologies also provided different abundance estimates, often differing by more than an order of magnitude, the authors wrote. Coverage differences have previously been linked to long reads' increased sensitivity to GC content during the PCR amplification step, so "both types of data should be used" for best results, they added.
Next, the researchers assembled the long reads from the human gut sample using Nanoscope, which automates metagenomic assembly, species identification, and substrain analysis. It also estimates the abundance of species using both short- and long-read data.
Using Nanoscope, the researchers were able to assemble the gut metagenome into contigs that were longer and more complete than either the long-read assembly or short-read assembly alone. They created contigs for more than 650 megabase pairs of sequence data with an N50 length of 49 kilobase pairs. In addition, 22 contigs were longer than 1 megabase pair, with the longest contig at 3.9 megabase pairs.
From the assembly, the researchers demonstrated that they were able to identify complete bacterial operons — clusters of functionally related genes.
In addition, they used their Lens tool to phase SNVs and indels, finding "extensive intrastrain variation in almost every bacterial species in the human gut metagenome." The team used Lens to assemble more than 200,000 variants into 5,024 haplotypes.
Next, they sought to identify the bacterial species present in the human gut metagenome, finding 178 species with widely varying abundance. Some species comprised 5 percent of the metagenome, while others made up just .02 percent.
The authors noted that the long- and short-read sequencing approaches were better at recovering different types of species. For instance, the long-read approach uncovered 51 species, mostly present at low abundance, that the short-read approach missed completely. Short reads helped find two high-abundance bacteria.
The "human gut microbiome is more complex than previously thought, particularly in terms of subspecies diversity," the authors wrote.
The researchers also looked at how the synthetic long-read approach compared to Pacific Biosciences single-molecule sequencing technology, assembling the mock genome with the PacBio technology. The PacBio approach produced "long contigs with fewer misassemblies," the authors wrote, but the assembly had a higher indel rate and the researchers could only confirm 88 percent of the SNVs called in the contigs.