NEW YORK (GenomeWeb News) – By combining two pre-assembly filtering approaches, researchers led by Michigan State University's Titus Brown were able to turn complex metagenomic datasets, including data from two soil samples, into ones that are comparably friendlier to assemble and use.
As they reported in an early edition Proceedings of the National Academy of Sciences paper to be published this week, the researchers combined both digital normalization and partitioning to process metagenomic sequencing samples before assembling them.
Brown and his colleagues first tested their approach on a characterized gut microbiome sample before applying it to the de novo assembly of two soil metagenome samples, one from a Midwestern cornfield and one from a native prairie site.
Metagenomic sample datasets can be cumbersome and difficult to analyze because of their vast size. For instance, the researchers noted that for the soil experiment, they generated nearly 398 billion basepairs, or the rough equivalent of 88,000 Escherichia coli genomes or 130 human genomes.
This normalization and partitioning method, they reported, allowed the dataset to be compressed with minimal loss of information and separated into biologically relevant subsets for assembly.
"We are actually converting standard, heavyweight approaches in biological sequence analysis to an ultraefficient streaming approach," Brown said in a statement.
Still, the researchers found that their soil assemblies were limited by low coverage, suggesting that even higher sequencing coverage is needed to functionally characterize soil.
"It's one of the most diverse microbial habitats on Earth, yet we know surprisingly little about the identities and functions of the microbes inhabiting soil," said Jim Tiedje from Michigan State.
This approach to scale and improve metagenomic assembly relies on a combination of digital normalization and partitioning. Digital normalization reduces the size of the dataset by leaving out reads from high-coverage regions and, in this case, scaling the assembly by sample richness rather than by diversity. Partitioning then divides the metagenomic dataset based on De Bruijn graph connectivity to separate sequences broadly by species. Each partition is then assembled separately.
The researchers first tested this approach on a human-gut mock community (HGMC) dataset containing 21 known genomes with isolates present at varying abundance levels.
After sequencing, the HGMC dataset included about 93 percent of the genomic content of the reference genomes, Brown and colleagues said, and after digital normalization — which included some 40 percent of the total reads — reference genome coverage was about 91 percent.
Then after assembling both datasets using Velvet, the researchers retained 43 percent and 44 percent of the reference genomes from the original and filtered assemblies, respectively. Similarly, the unfiltered and filtered assemblies shared 95 percent of genomic content. Additionally, the researchers noted that the filtered assembly more closely resembled abundances predicted from reference genomes.
The mock metagenome sample was also partitioned into more than 85,800 disconnected partitions containing some 9 million total reads, and less than 3 percent of partitions contained reads from more than one genome. The researchers noted, though, that the number of partitions depended on the sequencing coverage — reference genomes with high coverage were associated with fewer partitions.
Partitioning, they added, also did not affect the assembly as mock metagenome assemblies that were partitioned shared 99 percent of their genome content with those that had not been partitioned.
"Our evaluation of the mock metagenome suggests that this information loss is minimal overall and that our approach results in a comparable assembly whose abundance estimations are slightly improved," the researchers said in the paper.
Brown and colleagues then applied their approach to the de novo assembly of two soil metagenomes. Prior to filtering, the corn and prairie soil datasets contained 1.8 billion and 3.3 billion reads, respectively, and could not, the researchers noted, be assembled by Velvet in 500 GB of RAM.
After normalization to a sequencing depth of 20, the Iowa corn and prairie datasets decreased in size to 1.4 billion and 2.2 billion reads, respectively, and partitioning further decreased them to 1.0 billion and 1.7 billion reads. The datasets were also divvied into 31.5 million and 56.0 million respective partitions.
"What this gives us is a 2- to 200-fold decrease in computational requirements for the actual biological analysis," Brown said.
Assembly of the corn and prairie soil metagenomes led to a total 1.9 million and 3.1 million contigs, respectively, and respective assembly lengths of 912 million basepairs and 1.5 billion basepairs.
Still, they noted, the coverage of each metagenome was as low as 48 percent, and 31 percent of the total contigs in Iowa corn and prairie assemblies had a read coverage of less than 10.
Using the MG-RAST pipeline, the researchers annotated the assembled contigs, finding nearly 2.1 million and 3.5 million predicted protein-coding regions in the corn and prairie metagenomes, respectively.
"In our soil assemblies, we identified millions of putative genes, with hundreds of thousands of functions, even though only 10 [percent] of sequences were sufficiently sampled for assembly," the researchers said. "The resulting corn and prairie soil metagenome assemblies resulted in a total length of 912 million [basepairs] and 1.5 billion [basepairs], respectively, equivalent to [about] 500 E. coli genomes' worth of DNA."
Additionally, drawing on the MG-RAST Kyoto Encyclopedia of Genes and Genomes Orthology database, the researchers found some 3,533 unique KO identifiers —2,201 shared between the samples and 1,129 found in only the prairie sample — that were associated broadly with metabolic functions.
"This result may reflect the varying management history of these two soils," Brown and colleagues said. "Unlike the prairie soils, which have never been tilled, the corn soils have been cultivated for more than 100 [years] and have had annual additions of animal manure that potentially could enrich specific metabolic pathways with decreased diversity."
Overall, the researchers reported that their approach scales the data and separates it into partitions that are small enough for a number of genomic analysis tools to be applied to them, using fewer computational resources.
They also noted that their assembly pointed out that 300 gigabase pairs of read data is not enough to cover even a small soil sample deeply. "[C]onsiderably more data are needed to study the content of soil metagenomes comprehensively," the authors said.