Researchers from Aalborg University in Denmark have developed a sequence-independent binning approach to assemble genomes of rare, uncultured bacteria from metagenomic samples — a technique that can be used to better understand the bacteria species present in environmental samples.
The method was published online this week in Nature Biotechnology. Its developers claim that it improves upon previous binning approaches for metagenomics.
"It has been nearly impossible to assemble individual genomes from a metagenome," Per Nielsen, a professor in the department of biotechnology, chemistry, and environmental engineering at Aalborg University and senior author of the study, told In Sequence.
To sequence rare bacterial species from metagenomic samples, the team first did shotgun sequencing of two metagenomes from the same sample using two different extraction techniques. That way, while the bacterial makeup would be the same, the relative abundance of each species would differ slightly, said Nielsen.
The metagenomic paired-end sequencing, done on the Illumina HiSeq 2000, generated 29 gigabases and 57 gigabases each with an average read length of 124 bases.
Next, the team de novo assembled the larger data set, generating 423 megabases of scaffolds ranging in length from 1 kilobase to 3.6 megabases. They then mapped reads to the scaffolds. The scaffolds were then binned into putative population genomes by plotting the two coverage estimates against each other. Scaffolds that cluster together represent putative population bins.
These bins could then be further refined using sequence-dependent methods, such as principal component analysis of tetranucleotide frequencies — a method that identifies species by looking for specific genomic signatures — so that each bin represents one species. Additionally, conserved marker genes and GC content are used to help refine the bins.
In total, the team identified 31 population bins, which included rare bacteria members, with the lowest being present at 0.02 percent relative abundance. Of those, the 13 most-complete population genomes were further improved.
Four of those refined genomes were from rare populations of candidate phylum TM7, a bacteria about which little is known, but is suspected to be involved in human disease, said Nielsen. "We were able to assemble four genomes, and one of them completely as a circular chromosome," he said. Before now, only four partial single-cell genomes were available for the phylum, said Nielsen, and all of them are less than 50 percent complete.
The phylum is frequently found as a "minor but persistent constituent of microbial communities," the authors wrote, and TM7 bacteria are "widespread in natural and engineered ecosystems, and are found in humans where they have been implicated in oral and gut inflammation."
The authors found that the genome itself was small, only around 1 megabase.
The approach is similar to other binning methods except that it uses coverage to make the initial bins and sequence-dependent information to refine them, whereas other binning approaches rely more heavily on the tetranucleotide frequencies and other sequence-dependent information.
The team compared their approach to another, published last year in Science, re-analyzing a set of three metagenomic data sets obtained from an aquifer sample. Compared to the original study, the team found that its method resulted in "more complete genome bins" that had "less contamination from other populations." For instance, in one genome, the Aalborg team identified 101 essential genes versus 89 for the other method.
Paul Blainey, a core member at the Broad Institute and assistant professor at MIT, who has worked on single-cell techniques for microbial sequencing, said the method was "important for the field," and a "new twist" on previous metagenomic binning approaches. "They shift the emphasis from nucleotides to coverage depth as a primary means of stratifying the samples," he said.
Blainey said that one characteristic to the approach, and binning approaches in general, is that they yield "composite" genomes. Because these methods start with a heterogenous sample it is "hard to say which allelic variants you might find go with which other allelic variants, so it's hard to make the case that a population genome that you sequence using this method actually represents the genome of any particular cell."
He said that the approach would work well with single-cell sequencing, which has the advantage of "getting rid of heterogeneity within the sample," but has the disadvantage in that data quality is poorer.
Nielsen said that his group will continue to use the method to study different ecosystems and assemble relevant reference genomes to enable the study of the function of the bacteria within a specific ecosystem. "We want to look at gene expression, protein expression, all 'omics within a specific ecosystem … and to do that we need the genomes," he said.
Further improvements to the method would be enabled by longer read lengths, he said, which will make it "easier and higher quality." He plans to continue to use Illumina sequencing for the time being rather than the longer reads of 454 or Pacific Biosciences, because depth of sequencing is also important. Also, as the PacBio system's throughput continues to increase, Nielsen said he would consider testing it.