The Los Alamos National Laboratory is seeking industry partners to commercialize Sequedex, a software program developed at the lab for classifying short DNA sequence reads according to phylogeny and function.
By commercializing Sequedex, LANL hopes to make the software more usable for a broader range of researchers including those who aren’t sequence analysis experts, Benjamin McMahon, acting group leader in LANL’s theoretical biology and biophysics lab and one of Sequedex’s developers, told BioInform. Working with an industry partner will also help the lab defray the associated costs of that process, he added.
Sequedex was developed to characterize microbial communities. It searches for exact matches between short-read DNA sequences and a list of pre-determined “signatures” and then maps the hits to the appropriate branches of a phylogenetic tree.
Although it was developed specifically for analyzing metagenomic datasets, McMahon and his colleagues believe Sequedex can be used in numerous clinical and environmental applications that require microbial genome analysis.
So far, LANL has identified some “application-specific” potential partners and collaborators that it has begun interacting with, McMahon said.
The list includes a consumer products company as well as researchers involved in viral diagnostics and biofuels, he said. He could not disclose specific details about these groups since those discussions are still ongoing.
Some other potential application areas for the software include identifying and characterizing microorganisms for medicine, biodefense, and in pharmaceutical settings. It could also be useful for epidemiological projects in public health as well as to mine enzymes for use in chemical and manufacturing activities.
A free two-month demo version of Sequedex is available here.
A Substitute for Blastx
McMahon and his colleagues began developing Sequedex about three years ago with funds from the Department of Energy's Laboratory Directed Research and Development program, which supports science and engineering research within DOE's national laboratories.
“In some ways, you could look at it as a substitute for Blastx,” which searches protein databases using a translated nucleotide query, he said. Sequedex “serves a similar role [but] is a lot faster.”
He explained that the software relies on a data module that contains 20 million 10-mer amino acid signatures from “multiple genera of a 400-genome bacterial reference set.”
In Sequedex, “every signature is assigned a node on the bacterial phylogeny
according to the diversity of organisms containing it, and most also have a
functional assignment,” he told BioInform. “Based on exact matches to this signature list, Sequedex assigns both phylogeny and function to DNA fragments on a read-by-read basis at the rate of 6 Gbp/hour on a single core of a standard laptop computer with a sensitivity and specificity [that is] comparable to Blastx.”
According to a paper on Sequedex published in August in BMC Research Notes, methods based on Blastx assigned function at a rate of 25 kbp/hour.
Sequedex works faster then Blastx because important phylogentic signatures are pre-computed ahead of time, McMahon explained.
“We … pruned things down by picking one representative organism per genus,” he said. These signatures “make up only five percent of the total number of possible 10-mers in these organisms.” As a result, Sequedex has a much smaller pool to search from, which makes the program run faster.
Also, unlike Blastx, “we pre-compute the phylogenetic specificity for each signature as much as possible,” he said. This means “[we define] for each signature, not only where in the phylogeny does that signature occur but how specifically you can attribute with that signature” — for example, “is it indicative of a particular organism like Yersinia pestis … or is it simply indicative of an enteric bacteria, which has very different implications.”
Another difference between the two programs is Sequedex’s ability to handle short reads, McMahon said.
“Blast has a real problem with short reads because it doesn’t know where the start and the end of the gene is and the similarity scores are all designed to be global similarity scores for long matches,” he explained. “Sequedex is going to be much less fooled by the domain structures of proteins and by the fact that the modern sequencers send out short reads instead of long reads.”
Both the phylogenetic and functional profiles of Sequedex’s bacterial communities can be obtained with the current version of the software, McMahon said.
The developers plan to include additional data modules as well as the ability to annotate individual reads with phylogeny and function in later iterations of the program, he added.