Michigan State University researchers have developed two computational tools that reduce the amount of memory needed to assemble large quantities of sequence data generated by short-read shotgun sequencing technologies.
The developers have published a description of one of the methods, which has been implemented in a software package dubbed khmer, in the Proceedings of the National Academy of Sciences. A separate paper describing the second algorithm, named digital normalization, or diginorm for short, is being reviewed for publication in PLoS One. For now, a preprint of the paper is available here.
Both methods — which were developed primarily for metagenomic data — reduce the size of these datasets by discarding reads that contain sequencing errors prior to assembly.
This cuts down on the amount of storage needed by assemblers like Velvet and ABySS, which have high memory requirements because users have to load in all the data before they can correct errors, C. Titus Brown, an assistant professor of bioinformatics at MSU and a co-developer of both methods, explained to BioInform this week.
This isn’t practical for soil metagenome studies, for example, where researchers can generate about 50 terabases of data from a gram of soil. This is about 1,000 times more data than generated for the initial human genome project and would require the storage capacity of 50 laptops, according to Brown.
However, rather than building a new assembler that has smaller memory needs, “we are preprocessing the data so that other assemblers that people have already written … can take the data and deal with it more efficiently,” he told BioInform.
“We are reasonably happy with the approaches that already exist for assembly, but we are unhappy with the fact that we can’t run them on the actual data that we have” because of the size of the sequence set, he said.
He noted that even though there are algorithms specifically for de novo assembly of metagenomes, such as MetaVelvet and Meta-IDBA, “the real problem with all of those is none of them are focused on the scaling problem so they don’t take a big problem and make it smaller,” which is where the MSU team focused its efforts.
The PNAS paper explains that underlying khmer is a probabilistic data structure called a Bloom filter that partitions de Bruijn assembly graphs into “disconnected graph components” prior to assembly, which can be used to “subdivide sequencing reads into disconnected subsets that can be assembled separately.”
In other words, “partitioning … takes all of the reads that you have and then it breaks them up based on connectivity in the graph,” Brown explained.
As an illustration, in the case of genomes from Escherichia coli and Salmonella, "You could say this set of reads belong to one genome, and this other set of reads belong to the other because they don’t connect to each other,” he explained.
This lets users “take the piles of reads and divide [them] into two different piles that are much smaller than the original pile. If you extend that to a situation where you have, say, a million genomes…if you can do that efficiently, you could really scale that problem because now you are no longer talking about trying to assemble a million genomes all together; you are talking about a million different assemblies of individual genomes,” he said.
According to the authors, khmer’s graph-based approach is “convenient” for storing large assemblies because, among other things, it is “efficient and compressible;” has a “fixed-memory data structure with predictable degradation of both local and global structure as more data are inserted;” and because “memory usage is independent of the k-mer size chosen” which makes the method suitable for “exploring properties at many parameters.”
Partitioning is particularly useful because it “exploits underlying biological features of the data to divide the dataset into disconnected subsets” — in this case, it exploits the fact that microbial metagenomes contain many microbes that should not assemble together, the paper states.
Results provided in the paper show that when khmer was applied to a real soil metagenome dataset containing 35 million reads, it reduced the amount of memory that was required for assembly by almost 40-fold in one case — the assembly required 0.83 gigabytes or memory compared to the 33 gigabytes that would have been required for the unpartitioned data.
While khmer was developed for use in soil microbial community studies, it could be used for other metagenomic datasets, as well as for transcriptomes, MSU’s Brown said.
Commenting on the MSU paper, Inanc Birol, a bioinformatics group leader at the Genome Sciences Centre of the BC Cancer Agency, agreed with the study’s premise, noting that “de novo assembly is still an enabling technology in biology research, and one of the beneficiaries is metagenomics studies for sure.”
“It takes the field beyond tag counting to reconstruction of the details of contributing genomes, and lets researchers discover novel genes and interactions. Inherently the data structure resulting from an assembly is a faithful description of what is in a sample,” he told BioInform in an e-mail.
Birol, who was not involved in the MSU study, noted that although Bloom filters have been used in bioinformatics tools and are a “promising technology,” they haven’t had “much traction in the field (yet), perhaps due to competition from exact compact representations based on Burrows-Wheeler transformation and Ferragina-Manzini indexing.”
“I think the fresh perspective offered by [the paper] should be welcome in the field,” he said.
In terms of the method itself, Birol said that partitioning a de Bruijn graph before assembly "is a clever trick that trades memory use with run time.” He noted, however, that “before I get really excited … I would love to see the overall resource utilization by applying this procedure. Does it really change the CPU-hours figure for the better in assembly of experimental data?”
He also noted that MSU’s approach to metagenome assemblies, which is “substantially different” from the one his team uses, could be a helpful step for smaller research laboratories with limited compute power.
“We treat metagenome data similar to RNA-seq data. Where one has different expression levels between genes, the other has different abundance levels; one has alternative splicing, the other phylogenetically nearby species,” he explained. “Consequently, in metagenomics studies we apply an assembly protocol trained for transcriptome data,” he said.
“But, of course we assemble our data on compute farms where nodes typically have 48 GB of memory, and distribute the process and the memory load when necessary,” he added.
Brown presented diginorm — which is best suited for microbial genomes, amplified single-cell genomic data, and transcriptome data — last month at the Bioinformatics Open Source Conference prior to the Intelligent Systems for Molecular Biology conference in Long Beach, Calif.
The preprint of the paper describes diginorm as a “single pass” computational algorithm “that systematizes coverage in shotgun sequencing data sets, thereby decreasing sampling variation, discarding redundant data, and removing the majority of errors.”
In other words, it is a method of “lossy compression” that “provides an … efficient way to subselect reads based on whether or not they add new information,” Brown said.
“The idea behind lossy compression is that there is a lot of noise in these datasets … [which] makes it difficult to scale … things like assembly in particular because you have to store all of the [reads with] errors first and then resolve them afterwards,” he explained. With digital normalization, on the other hand, “you are essentially only keeping reads that give you additional coverage in certain regions and, once you have enough coverage, throwing out all the reads that would add additional coverage.”
Diginorm selects which sequence reads to drop by first building a de Bruijn assembly graph and then looking at reads in portions of the graph and asking whether those portions have shown up multiple times, he explained. If that’s the case, then “we don’t need this read,” he said.
With this approach, diginorm “substantially reduces the size of shotgun data sets and decreases the memory and time requirements for de novo sequence assembly … without significantly impacting content of the generated contigs,” the preprint states.
For many samples, the diginorm algorithm enabled the researchers to discard more than 95 percent of reads prior to assembly, Brown said at BOSC.
During the talk, he provided results of diginorm’s application to some bacterial datasets.
In one case, when the algorithm was applied to genomic data from E coli it reduced the number of reads from 31 million to 600,000; cut assembly time from 1,040 seconds down to 63 seconds; and needed only 0.5 gigabytes of memory compared to the 11.2 gigabytes of memory that would have been required, Brown said.
Similarly when the algorithm was applied to a data from a single cell of Staphylococcus aureus, the number of reads dropped from 58 million down to 300,000; assembly time fell from 4,749 seconds down to 26 seconds; and only 0.4 gigabytes of memory was needed compared to 52.7 gigabytes that would have been required, he said.
Brown and his colleagues plan to publish a paper that applies a combination of digital normalization and the partitioning approach to process metagenomic data.
“It turns out you need both,” he said. That’s because in metagenomes “often you’ll be sequencing different genomes that have a very high abundance and [those] that have a low abundance and so what digital normalization does is even out the abundances and throw away a lot of the redundant data that’s in these datasets” prior to the partitioning step.