A team from Yale University, the University of Chicago, and Argonne National Laboratory has developed a computational approach called Kmerspectrumanalyzer that shows promise for quantifying the sizes and repeat contents of not-yet-assembled bacterial and archaeal genomes.
"Kmerspectrumanalyzer primarily reports repeat content and overall genome size," David Williams, an ecology and evolutionary biology researcher at Yale University, told In Sequence in an email message.
"[T]his information can be useful for further analyses aimed at recovering repeated sequences from read data," he said. "Furthermore, optimizing short-read assembly algorithms can be aided by knowing the total genome size as reported by Kmerspectrumanalyzer."
In a study appearing online last week in BMC Genomics, Williams and his colleagues outlined the rationale behind Kmerspectrumanalyzer — which considers the spectrum of distinct 21-base sequence combinations that have been tallied up from short available read data — and assessed the tool's performance using new and existing microbial datasets.
By using the method to assess whole-genome shotgun sequence data for five Escherichia coli strains sequenced from an environmental source, for instance, the team picked up variations between the genomes, while at once obtaining genome size estimates that seemed to jibe with the E. coli reference genome and with genome size ascertained for the new strains experimentally.
Likewise, the approach appeared to produce fairly accurate genome size and repeat content estimates when the study's authors plugged in read data for several already-sequenced bacterial genomes, even those as large as 9 million bases or as small as 5,386 bases.
"We introduce a straightforward methodology that provides information about the repeat structure of genomes that is ordinarily missing from assemblies of short reads," they wrote. "This additional information offers new insights about genome diversity and evolution that can be gained through the analysis of novel datasets or through the re-analysis of the large volumes of archived short read [data]."
The advent of such an approach is expected to be especially useful since the type of short-read sequence data that can be plugged into Kmerspectrumanalyzer tends to be difficult to organize and assemble into complete genomes on its own without the help of additional read types that can link contigs or refine complicated or repeat-rich regions.
The notion of relying on k-mers — stretches of sequence that are a pre-determined number of nucleotides long — as a source of information about a genome's sequence and structure is not new, the study's authors noted. Rather, k-mers have long been used as a means of gauging everything from assembling and error correcting short-read data to finding unusual genome regions and comparing genomes to one another.
"Existing algorithms exploit k-mer information from short-read data for many purposes including error correction and assembly," Williams said. "The implementation of our algorithm has been optimized for use with high-coverage data typical of current sequencing technologies in contrast [to] older k-mer-based approaches for estimating genome size and repeat structure."
Because limitations related to computer memory can sometimes make it tricky to tally up k-mers longer than 18 bases apiece or so, the researchers turned to an existing hashing implementation algorithm called Jellyfish that can stretch the size of quantifiable k-mers out to around 31 bases.
By plugging short-read bacterial genome sequence data into Jellyfish and then using Kmerspectrumanalyzer to profile the resulting k-mer patterns, the team saw that it was able to glean additional insights into the size, structure, and variability of the genomes.
"Early development of Kmerspectrumanalyzer by [study co-authors] Will Trimble and Folker Meyer at the University of Chicago was aimed at characterizing read data to assess the quality of sequencing runs," Williams told IS.
"We teamed up to continue development towards accurately inferring the total size and the amount of repeated sequence in a target genome directly from raw read data," he said. "This is information that a conventional genome assembly cannot provide."
At the moment, the group is focusing on k-mers that are 21 bases long — a k-mer size that appears optimal for assessing typical microbial genomes.
"A k-mer length of 21 is biologically relevant to bacterial genomes: it is short enough to resolve small repeated elements but long enough to distinguish between single-copy protein-encoding regions by spanning the sequence that defines such regions as unique," Williams said.
For the current study, he and his co-authors took a crack at applying the Kmerspectrumanalyzer-based analysis to whole-genome shotgun sequence reads generated for five E. coli strains collected at a water treatment plant in California.
For each of the five environmental E. coli strains, they used the Illumina HiSeq 2000 to generate 76 base paired-end reads that covered each genome to an average of between 55-fold coverage and 85-fold coverage.
Using their computational strategy to analyze the 21-mer profiles in the datasets, the team picked up fairly subtle size and structural differences between the E. coli strains. It also came up with genome size and repeat content estimates that were on par with those measured by other means such as pulsed-field gel electrophoresis.
Even so, the researchers did see slight differences in the genome sizes predicted by Kmerspectrumanalyzer and those found by PFGE — discrepancies that they attributed to the differences in the way the computational and experimental methods deal with sequences found on E. coli's multi-copy plasmids.
To verify the veracity of the Kmerspectrumanalyzer results, they also compared the size and repeat profiles predicted for the newly sequenced strains with those found in an existing DH1 reference genome sequence. In general, though, a reference sequence is not needed to apply the Kmerspectrumanalyzer approach, making it amenable to uncharacterized organisms, Williams said.
The team's computational approach appeared to produce fairly accurate size estimates for other microbial genomes, too.
For their follow-up analyses, the researchers applied Kmerspectrumanalyzer to short-read datasets for 19 previously sequenced bugs, including Niastella koreensis, a bacterial species with a genome coming in at around 9 million bases.
For the large N. koreensis genome, Kmerspectrumanalyzer generated a genome size estimate that was off by around 642,000 bases. The relative disparity between documented and predicted genome sizes was a bit higher for another bug, Listeria monocytogenes. For that species, Kmerspectrumanalyzer-based genome size estimates were roughly 12 percent different from the documented genome size.
Again, some of these differences seemed to stem from the presence of bacterial plasmids, the study's authors noted, though the Kmerspectrumanalyzer's accuracy also appears to get a boost when the datasets on hand contain reads that cover a given genome fairly evenly.
For those considering the possibility of applying the computational approach to eukaryotic genomes, Williams noted that it would likely be necessary to adjust the lengths of k-mers considered in order to optimize the Kmerspectrumanalyzer tool for the larger genomes.
He also cautioned that it may be difficult to get accurate genome size and repeat content estimates for genomes containing very complex repeat structures and/or for genomes represented by low depths of sequence coverage.
Though their own experiments hinged on HiSeq 2000 reads, Williams noted that Kmerspectrumanalyzer's applicability does not appear to be limited to datasets generated on Illumina instruments.
"Although we've confirmed that Kmerspectrumanalyzer works well with data from the Illumina platform, in principle, the algorithm and tool are compatible with data generated by any sequencing technology," he said.
Kmerspectrumanalyzer is being made available to other members of the research community as an open source tool, Williams noted. Both the software and additional scripts used to retrieve, process, and produce the data presented in the current BMC Genomics study can be downloaded from a metagenomics analysis server on GitHub.
For genomes that are less than 10 million bases or so, the tool can be run on any standard lab computer that has at least a 64-bit Intel CPU, which is required to run the Jellyfish k-mer counting tool used in conjunction with Kmerspectrumanalyzer.
The team also intends to make Kmerspectrumanalyzer available at KBase, a free, online software and data site developed by researchers at Argonne National Laboratory and elsewhere, which provides cloud computing power supported by the US Department of Energy.