NEW YORK (GenomeWeb) – Long-read sequencing can reveal a vast amount of microbial diversity, researchers from the University of California, Berkeley found in a recent study testing the use of Illumina's TruSeq Synthetic Long Read technology in a metagenomic sequencing study of sediment samples.
The researchers published their data using the synthetic long read technology for metagenomics, in Genome Research this week.
Illumina acquired the synthetic long read sequencing technology along with Stanford University spinout Moleculo in 2013, and began offering the technology as a service later that year and as the TruSeq Synthetic Long Read kit in the second half of 2014 .
In essence, the technology generates long reads by fragmenting genomic DNA into 10-kilobase pieces. It then tags those pieces with unique barcodes, breaks them up further, sequences with short-read technology, and then assembles.
Researchers have published on the technology's ability to phase human genomes and resolve repetitive transposable elements, and have also described its use for sequencing fish and insect genomes, but the Genome Research study is the first published description of its use in metagenomics.
The Berkeley researchers tested the technology on sediment samples collected from three different depths of an aquifer near the Colorado River. The team had previously sequenced the samples on the Illumina HiSeq 2000. They generated around 200 gbp of HiSeq sequence data with 150 bp median read length and an additional 1.5 gbp of synthetic long-read data with median reads of around 8 kbp.
The team found that the long reads revealed an incredible amount of microbial diversity in their sample that was invisible when just short-read sequence data was analyzed.
"We knew that microbial communities were complex, but we didn't know how complex," Itai Sharon, a postdoctoral researcher in Jill Banfield's laboratory at UC Berkeley and lead author of the study, told GenomeWeb. "We initially thought there would be a few hundred microbial species, but it turned out we had several thousand species," he added.
Many of the thousands of species identified by the long-read data were present at very low abundances — less than 0.1 percent — but belonged to phyla that were represented by the more abundant organisms.
The researchers assembled the short-read sequence data from each of the three samples, generating 931, 1,456 and 366 mbp in scaffolds longer than 1.5 kbp for each of the three samples. Very few of the reads — between 18 percent and 33 percent — could be mapped to the assembled scaffolds and contigs, indicative of complex communities.
Next, they tested three different assemblers for the synthetic long-read data: Celera, Minimus 2, and an algorithm the group developed for the study called Lola. The Celera assembler failed and the team achieved the best results with the Minimus 2 assembler, but even that one resulted in low assembly rates of between 8 percent and 17 percent of the reads. The low assembly rates could be attributed to "extremely high species richness and community evenness in our samples, leading to relatively few reads from any single genome," the authors wrote.
Sharon said that while the group did not accomplish its main goal of using the long reads to generate multiple complete assemblies, they did make a number of findings that would not have been possible with short-read sequencing technology alone.
"There were thousands of species in each sample, and for most genomes we got only a few long reads," he said, making it difficult to generate assemblies.
For instance, for one of the most abundant species, RBG-1, whose genome has previously been sequenced, only 1x coverage was obtained from the long read data across all three samples.
However, using the long reads the researchers found that they could "characterize parts of the community that we couldn't otherwise characterize using the short reads," he said.
The long reads helped pinpoint the rare species as well as some of their metabolic characteristics, because although the team could not assemble the complete genome of those species, the long reads did provide insight into some of the genes contained in the genomes.
The researchers used both the long-read and short-read data to evaluate 16S rRNA and rpS3 genes to evaluate the species composition of the community.
They found that the long-read data was able to identify many more 16S rRNA genes, indicating that short-read sequence data cannot distinguish between closely related species or highly variable strains. Of the 16S rRNA genes identified from the long-read data, approximately one-third were from closely related genomes, the authors reported.
Looking at the rpS3 genes, the short-read sequence data did not pick out the rare species, but nearly all the rpS3 genes covered at less than 2-fold were identified by the long-read data. In addition, even some of the more abundant species were not recovered by the short-read data. For instance, the most abundant species from one of the samples is from the Aminicenantes phylum. The long-read data recovered four copies of the rpS3 gene and five 16S rRNA genes, while none were identified in the short-read assembly.
In the other two samples, short-read assembly failed to detect the most abundant species. Sharon explained that even though the species were the most abundant, slight differences in the individual genomes make assembly fail. In closely related species or strains, nearly 90 percent of the genome can be identical, Sharon said.
"But, using long reads, we got large portions of the genome, and using methods we developed were able to characterize populations that were completely missed in the samples we analyzed with short-read data," he said. One of the most abundant species, which accounted for about 15 percent of the community, "was completely missed by short-read sequence data," he said. "But using long-read data we were able to characterize them and reconstruct the genome. This is the main advantage we found of this technology."
Another advantage of the long-read technology is its accuracy, Sharon said. For the RBG-1 genome, the team generated 87 contigs and 99 unassembled long reads that aligned to and covered about 75 percent of the genome. Of the 186 sequences, 162 aligned at more than 99 percent of their entire length, while 22 sequences only partially aligned. The authors said the discrepancies in those 22 sequences could have come from other genotypes of RBG-1 present at low levels or from errors in the assemblies of the reads. The remaining two sequences had local misassemblies.
Sharon said that the team is now considering using Illumina's long-read technology to study the human microbiome. Using it on the human microbiome would potentially allow the team to reconstruct more complete genomes because it is less complex than the sediment samples they initially studied.
"One main advantage of this technology is that the reads are very accurate," Sharon said. But, he added, "getting more reads would be better and allow us to get more assemblies" from metagenomic samples.