NEW YORK (GenomeWeb) – Scientists from the University of Utah, Arup Laboratories, and bioinformatics startup IDbyDNA have developed Taxonomer, a metagenomics analysis platform for universal pathogen detection and host-response profiling that they say is able to detect and classify pathogens more accurately, and in some cases more quickly, than some existing methods.
The developers have made a beta version of the software freely available to the biomedical community. But IDbyDNA, a spinout from the University of Utah, has licensed the Taxonomer technology from the institution and plans to use it as the basis for a series of commercial metagenomics applications that could be used in clinical diagnostic labs and other arenas.
The Sunnyvale, California-based company is currently in stealth mode, so it is not sharing details about what those products are and when they might launch, IDbyDNA President, Cofounder, and CEO Guochun Liao told GenomeWeb in an email. The company is also not providing details about its commercialization plans at this time. However, Liao did indicate that there are a number of possible development directions that the company might take.
"Taxonomer was designed for NGS metagenomic analysis, and can be used for broad metagenomics-based applications [such as] pathogen detection [and] microbiome studies," he said. Liao also noted that the company is open to collaborating on test projects with interested parties.
In a paper published in Genome Biology this week, Taxonomer's developers provided specific details of the algorithms they use to match sample sequences to reference information in both nucleic acid and protein sequence databases. Given a list of input sequences, the software can in minutes return a list of all microorganisms present in the sample including bacteria, viruses, and fungi as well as provide details about the specific family, strain, and subtype of the organisms in question.
Taxonomer features a simple user interface powered by the iobio platform, which was developed in the lab of Gabor Marth, a professor of human genetics at the University of Utah and co-director of the USTAR Center for Genetic Discovery. The platform offers easy-to-use tools for interactively visualizing and exploring genomic data in real-time, according to its creators. The paper also provides a number of case studies that demonstrate the software's ability to identify pathogens in clinical specimens.
Taxonomer was developed in response to a perceived need for faster tools for processing data from metagenomics samples, Mark Yandell, a professor of human genetics at Utah and one of IDbyDNA's co-founders, told GenomeWeb. Yandell and Robert Schlaberg, a medical director at Arup Laboratories and co-founder of IDbyDNA, are co-authors on the Taxonomer paper.
Yandell, who co-directs Utah's USTAR center, was also involved in developing the Variant Annotation, Analysis, and Search Tool, or VAAST software, which was licensed and commercialized by genome interpretation firm Omicia.
Existing metagenomics pipelines took too long to complete projects, with runtimes lasting as long as a week or more in some cases, he said. Advancements in computer science that were being applied to tasks such as monitoring cell phone traffic showed promise in helping researchers parse millions of metagenomics reads in minutes without requiring large quantities of compute power.
Taxonomer and another recent entrant into the metagenomics space called Kraken, which was developed by researchers from the University of Maryland and Johns Hopkins University in 2013, leverage a number of those advancements and have shortened the time required for metagenomics analysis, Yandell said. Incidentally, researchers from JHU and the University of Maryland, some of whom were involved in Kraken's development, published a paper in BiorXiv this week that describes new metagenomic classification software called Centrifuge that they claim is more sensitive then Kraken and uses less memory.
Though both Kraken and Taxonomer share computational DNA, Taxonomer was designed and built with molecular biologists and clinical labs in mind, Yandell said. Users do not need to know how to code to use the software, and they can access and explore data from standard desktop computers and laptops as well as mobile devices, he noted.
Another important differentiator for Taxonomer compared to a tool like Kraken is that it uses both nucleic acid- and protein-level information to group organisms into taxonomic classes, according to the developers. This is crucial for properly classifying phylogenetically distant or novel pathogens in specimens, which may not have homologs in nucleotide databases. In diagnostic settings, often "we are [working] with very incomplete reference databases," Schlaberg told GenomeWeb. "And we routinely encounter novel organisms in the types of specimens that the software is designed to analyze. So, there's much more unknown space that you have to cover and deal with — and getting the classification right under those circumstances is much more challenging."
Full details of the system's architecture and underlying algorithms are provided in the paper, but basically Taxonomer is comprised of four components. When reads are uploaded to the system, a binner module compares each read to reference databases of host and microbial data in parallel and assigns them to broad taxonomic categories based on the most likely organism of origin. A second module then uses k-mer matching to classify sequences at the nucleotide-level. This module also handles ribosomal RNA-based bacterial and fungal characterization and host mRNA expression profiling, according to the paper.
Any reads that cannot be classified using nucleotide sequences are moved into a so-called protonomer module which uses a non-degenerate mapping scheme to search for similar sequences in the protein databases. This module is used to classify viruses in the protein space because of their high mutation rates, genetic variability, and incomplete reference databases. For reads that are not classified at this point, another module called an afterburner uses a degenerate k-mer matching engine to try to identify the most likely source of the sequence using a reduced amino acid alphabet. The system pulls in information from various public sources such as Ensembl and UniProt, and users can also add in their own bespoke databases as well.
The developers claim that Taxonomer improves on existing methods in terms of its accuracy, sensitivity, and speed of pathogen detection. According to internal benchmarks using infected reads from pediatric nasopharyngeal specimens from a US Centers for Disease Control and Prevention study, Taxonomer returned results faster than the Sequence-based Ultrarapid Pathogen Identification (SURPI) pipeline, which was developed by Charles Chiu's team at the University of California, San Francisco, even though both tools search both nucleotide and protein sequence databases. Taxonomer was also more accurate in its classification of some pathogens than SURPI, according to the results.
When both tools were used to analyze 6.5 million reads that contained the Human coronavirus (HCoV), Taxonomer completed its classification in roughly five minutes compared to 92 minutes for SURPI. Both tools were able to classify almost 100 percent of the reads in the sample. In another assessment involving over 7.5 million reads, including sequence from the influenza A virus, it took Taxonomer just over nine minutes to classify 88 percent of the reads. SURPI, on the other hand, needed almost four hours and only managed to classify 78 percent of reads.
Taxonomer is slightly slower than Kraken but is more accurate in its classifications because it searches both nucleotide and protein sequence databases, Yandell said. For the aforementioned HCoV dataset, Kraken completed its classification in roughly one minute with a minor difference in the percentage of reads classified — 99.6 percent of the reads compared to 99.9 percent of reads for both Taxonomer and SURPI. However, the difference in accuracy was far more pronounced in the influenza A analysis. Kraken was only able to classify 66 percent of the reads compared to 88 percent for Taxonomer.
Potential applications for the Taxonomer software include helping infectious diseases researchers shorten the time to results for testing and freeing them from relying solely on time-consuming culture-based methods or testing methods with a limited scope. In fact, Taxonomer will soon be used in at least one new project focused on diagnosing serious infections in children in resource-limited settings. Arup's Schlaberg was recently awarded a $100,000 grant from the Bill and Melinda Gates Foundation for the project.
The software could also be used to identify pathogenic microorganisms that are responsible for disease outbreaks, or could be used to measure the host organism's expression profile to gauge its response to the pathogenic activity, helping clinicians determine if a detected pathogen is really causing the infection or if there are multiple suspected pathogens.
Evidence of Taxomer's clinical efficacy is provided in the Genome Biology paper through a number of case studies. These studies demonstrate its ability to detect previously unrecognized infections as well as antiviral host mRNA expression profiles among other use cases. In one scenario, the researchers used Taxonomer to analyze RNA-seq data gleaned from serum from a patient with hemorrhagic fever caused by a novel rhabdovirus; a throat swab from a patient with avian influenza; and plasma from a patient with Ebola virus.
According to the researchers, Taxonomer was able to correctly identify all three viruses or close relatives even though the actual matching sequences were removed from the reference databases prior to the start of the analysis. In another study, the researchers used the software to determine that a cohort of patients with Ebola-like symptoms actually did not have the disease but had severe bacterial infections — caused by Chlamydophila psittaci and Elizabethkingia meningoseptica — that were most likely responsible for their symptoms.
Meanwhile, a separate study published in the Journal of Clinical Microbiology showed that Taxonomer in combination with RNA sequencing can reliably detect disease-causing pathogens and improves on at least one existing commercial test. For the study, Schlaberg and others used an RNA-seq-based metagenomics approach and Taxonomer to analyze respiratory virus-positive pediatric nasopharyngeal swabs. They compared this approach to GenMark's Respiratory Viral Panel (RVP). They looked at 42 known samples and 67 unselected samples, and according to their results, the RNA-seq-based method detected 86 percent of the known respiratory virus infections, which is on par with the commercial test used for the comparison. It also detected an additional 12 viruses in the samples that RVP missed.
Meanwhile, researchers in Yandell's lab are using the software in a number of internal projects. This includes using DNA and RNA sequencing to identify and characterize new venom genes from the marine cone snail, Conus bullatus. They are also using it in evolution and demographics studies of fungal pathogens that infect pine trees as well as to study sepsis in mice, he said.
The developers are also adding new features to Taxonomer, expanding the breadth of reference databases accessible to researchers to include non-human host data from mouse and other organisms, Schlaberg said.
When it launches its products, IDbyDNA will compete with offerings from firms such as One Codex and CosmosID, both of which have commercial metagenomics platforms that leverage similar computational techniques to Taxonomer and have set their sights on supporting clinical testing.
One Codex's platform offers mapping algorithms and curated reference databases of bacteria, viruses, protists, archaea, and fungi genomes. In 2015, the company, then called Reference Genomics, won a $200,000 award from the CDC for its platform for strain-typing Shiga toxin-producing Escherichia coli. It was able to demonstrate that its platform could identify STEC from complex clinical samples and provide meaningful information about its strain type and characteristics even at low levels. Last year, the company began adding capabilities for clinical users including better sample management and HIPAA-level security measures.
For its part, CosmosID offers desktop, appliance, and cloud-based options of its Genome Identification Universal System (GENIUS) platform, which offers algorithms and curated databases of bacteria, viruses, fungi, parasites, and antibiotic resistance and pathogenicity markers. Earlier this year, the company raised $6 million in a Series B funding round, a portion of which it said will be used to develop clinical applications of its platform.