Danish researchers have developed a computational search method for identifying organisms from a small subset of raw reads randomly selected from across a newly sequenced genome — a tool that they hope will be applicable in infection tracking, food safety, and other settings.
The approach, known as TAPIR, uses randomized raw reads and a k-mer-based scoring algorithm to identify organisms represented within a large database of genome sequences. It can be used with or without additional software that aligns the newly sequenced reads to the reference genomes of organisms identified during the initial search step for more detailed local analyses.
"As it is now, it's kind of a demonstration that you can send just a small bit [of genome sequence] over the net with limited bandwidth," Ole Lund, a researcher with the Technical University of Denmark's Center for Biological Sequence Analysis, told In Sequence.
Lund and co-author Laurent Gautier introduced TAPIR in a PLOS One study published late last year. In that paper, the pair described the rationale for the approach, demonstrating that it can effectively almost always identify an unknown organism to the species or sub-species level with roughly 100 raw sequences from across the new genome.
"What we observed was that sub-samples of sequencing reads are sufficient to identify pure cultures or even identify [species] in mixtures," first author Laurent Gautier told IS.
The latter, multi-species identification can be done using iterative features of the search scheme, noted Gautier, who was a senior researcher and head of the Technical University of Denmark's core facility when the search scheme was developed. He has since moved to the Novartis Institutes for Biomedical Research.
After identifying the first species in the mix and aligning the sequencing reads to the full genome of that organism, for example, the program can send back the unaligned reads for subsequent rounds of searching.
Those interested in using the search software can either plug their data into an online search engine or download a more comprehensive version of the software that returns genome sequences from species identified in the initial search. The general approach is expected to be highly scalable, according to Lund, in part due to the type of database back-end being used.
"In the future, if you get 1,000 times as much data, you have to distribute your back-end databases across several computers," Lund said. "Because it uses this kind of database back-end, you could replace the one we have with another one that has this built-in functionality of being able to spread the database over several computers."
The software came about as part of an effort to build up a system for global surveillance of food pathogens and similar infectious microbial agents in collaboration with Frank Aarestrup from the Danish food safety agency, who was not involved in the current paper.
Sequencing-based methods are increasingly being used for detecting and identifying infectious pathogens and food safety culprits and are on the cusp of replacing culture and/or PCR-based assays in that arena as sequencing prices dip.
"In a few years, next-generation sequencing will replace all kinds of other diagnostics and food safety and surveillance [methods]," Lund said.
While sequencing-based testing remains somewhat pricey and involved at the moment, he noted that it is also simpler in some respects because it centers on a single technology rather than relying on complicated, sometimes species-specific culturing approaches and/or assays.
In a study published in the Journal of Clinical Microbiology this month, for example, Aarestrup, Lund, and others explored the potential hurdles of finding microbes of interest by doing whole-genome sequencing on clinical samples and considered the potential bioinformatics obstacles that are yet to be overcome.
Anticipating a rise in reliance on sequencing, Gautier and Lund set out to find more efficient ways for accomplishing species identification from whole-genome sequence data.
As large databases of bacterial and other reference genomes continue to grow, there are still questions about how to best match a new genome sequence set to organisms within such databases.
The classical approach for identifying bacteria based on DNA sequence data hinges on 16S sequences, which have been documented and somewhat standardized. Plugging those relatively short stretches of sequence into searching software is akin to typing a short string of words into a typical web browser — a process that's not all that computationally complex or time-consuming.
With the 16S sequence data, "you just need a few thousand nucleotides to determine what it is," Lund noted. "So the idea here is that you don't have to transfer all 200 megabytes [of genome sequence data]. You just have to transfer some of the data and then you can look up in a database what it is and give an answer back."
But searching an enormous database quickly becomes more complicated and cumbersome when large query terms — such as whole genome sequences — are far more complicated, Lund explained. "It's not comparing a small search string with a big database, but comparing two very, very big databases."
To get around that problem, he and Gautier decided to develop a search approach based on a limited, but still genome-wide, sequence set.
With a subset of randomly selected raw reads, they demonstrated that it was possible to rapidly search a database of genome sequences from tens of thousands of organisms.
"The idea is that we are not shuffling all of the [genome] data through the network," Gautier said. Using this strategy, it becomes possible to do data driven searches of centralized or perhaps even distributed databases in a non-compute intensive manner that requires relatively little bandwidth, he explained.
By plugging in around 100 of these randomized reads, each coming in at around 100 nucleotides, Lund explained, researchers run 10,000 nucleotides of unidentified genome sequence data against the reference database.
That's far more than the 1,600 or so bases of 16S sequence that are often used to query bacterial sequence databases, while still representing just a fraction of the overall sequence data generated.
Through a computational process that identifies, indexes, and tallies up representation by stretches of sequence in the random reads, the investigators can identify the source of the original DNA if the organism's sequence is part of the larger genome database.
In its current form, the approach can reliably distinguish between different bacterial species, for example, sometimes turning up additional data about the sub-species involved.
For example, in their PLOS One study, Gautier and Lund generated synthetic reads based on sequences found in reference database genomes to test the accuracy of the search algorithm.
Those results indicated that the approach can effectively identify organisms from simulated reads with read lengths and error rates on par with those in Life Technologies' SOLiD 5500 and Ion Torrent PGM, Pacific Biosciences' RS, and Illumina reads.
The algorithm also showed promise for classifying bacteria and viruses in metagenomic sequence data, though it did not necessarily distinguish between different strains of a given species.
Once the species has been identified, Lund noted that users may opt to download the full genome and apply alignment tools such as Bowtie to do more detailed assessments of the newly sequenced organism — from base calling to SNP tree development and phylogenetic analyses.
The TAPIR software is freely available in its current form, though its developers have filed for patents related to some of the related technology.
Researchers can download a version of the software that sends their search query of raw sequence reads to the server, returns the top genomes identified in the search, and allows for subsequent genome alignments. That downloadable version of the software is designed to iteratively send reads that did not align back to the server to look for other matching organisms.
Because bacterial genomes are generally much more compact than those of their mammalian or plant counterparts, the reference genome sequence return step is still more feasible for microbes at the moment, Lund noted, though the TAPIR search algorithm can accurately match new sequence reads to larger and more complex genomes for identification purposes.
Alternatively, users can enter their data into an online TAPIR search tool. At the moment, that version of the software returns the name of the most closely matched organism rather than providing the genome sequences.
Lund and his team are currently maintaining the TAPIR database at their institution. It includes all of the publicly available bacterial genome sequences they could get their hands on, along with the human reference genome and some plant, fungal, and viral genomes.
Amongst those interested in food safety and food surveillance, Lund noted that there's potential interest in creating a global database containing standardized sequences and meta-data. For instance, the logistics of developing such a database has been a topic of discussion at meetings held by the Global Microbial Identifier group.
In the meantime, Lund said that he and his colleagues are considering a system for passing sequences submitted directly to TAPIR on to an existing database such as the Short Read Archive.
So far the group has taken a crack at running TAPIR searches with sequence reads generated on Ion Torrent and Illumina instruments.
The number of reads required for identifying a given species would likely be lower for sequences generated on long-read platforms, Lund noted, since TAPIR depends largely on the overall number of nucleotides transferred during the search.
The average depth of sequence coverage across a given genome is relatively unimportant to performing the searches and identifying the species, he said, though coverage becomes important for those interested in doing downstream analyses on local sequence alignments.
In terms of future improvements to the software, Lund said that the group hopes to tack on even more analytical tools downstream of the species identification and alignment steps.
The time required for such searches depends, in part, on the server load, Lund said. Generally speaking, the search itself usually takes a matter of seconds, while the process of transferring and returning results occurs within roughly a minute or so.
"It's the first implementation," Lund said. "Different parts of it might still be optimized for speed."