Researchers from Boston University School of Medicine and George Washington University have developed Pathoscope, statistical software that accurately distinguishes between closely related pathogenic species and strains in genomic data collected from infected tissues.
According to a Genome Research paper published last week, Pathoscope uses a Bayesian statistical framework to compare sequence reads collected from infected samples to a reference database of known microbial organisms. In cases where reads map to multiple genomes in the database it uses sequence and mapping quality information to determine which of the target genomes the reads most likely originated from.
"Pathoscope is like completing a complex jigsaw puzzle," Evan Johnson, a BUSM assistant professor of medicine and a co-author on the paper said in a statement. "Instead of manually assembling the puzzle, which can take days or weeks of tedious effort, we use a statistical algorithm that can determine how the picture should look without actually putting it together."
Its developers claim that Pathoscope improves on existing computational methods like the Trinity assembler, MetaPhlAn, and MEGAN, because it needs much less genomic coverage — less than 1x compared to 50-100x coverage required by assembly-based algorithms — to get results, and does not require "time-consuming and labor-intensive steps" such as multiple alignment steps, homology searches, or genome assembly. "Our approach also incorporates the possibility that multiple species can be present in the sample and considers cases when the sample species/strain is not in the reference database," the researchers wrote.
It's also more accurate than existing methods, according to the team. In one software comparison test reported in the GenomeResearch paper that compared data from Escherichia coli O104:H4 to 30 other E. coli strains, Pathoscope showed "substantial improvement over naïve mapping, context mapping, and assembly-based methods for species identification and strain attribution "[reassigning] on average 99.4 percent of the read probability directly to the O104:H4 strain" in at least one scenario.
Finally, this method, which relies on next-generation sequence data, improves on wet lab pathogen-identification methods such as tissue cultures and polymerase chain reaction-based detection, which are often imperfect and time-consuming, the researchers said. Culture-based methods for example "usually require four or five days to identify species," Johnson told BioInform. Also, "culture- or PCR-based methods ... really are limited to one species at a time" but by using sequencing data, "we can actually look for thousands of pathogens at the same time. That was the primary motivation."
Johnson et al also found approaches used by existing computational methods lacking in terms of speed and accuracy. For instance, tools which align reads based on clade-specific markers often "throw away a lot of useful information," that could be used later to distinguish between strains, because the "markers may only be one percent of the genome," he said. On the other hand, methods that identify pathogens by simply assembling the genomes present in the samples are more specific but need much more coverage of the pathogenic genome in order to be successful,.
The goal for the team therefore, was to "develop a method that is just as specific as assembly but doesn’t require you to actually do the assembly" so that "if we have reads coming from specific strains or species, we don’t actually have to fully assemble those species and those strains in order to fully know what we have," he said.
Before they can use Pathoscope, researchers have to create a database that contains the target pathogenic genomes that they want to search against. Next, they can use a variety of alignment tools, such as Bowtie or BLAST, to align the sample sequences to the reference resource. Sequence reads that map uniquely to a source genome are left alone. Pathoscope is then used to "reassign" reads that map to multiple closely related genomes to their most likely source. It uses a Bayesian mixture model that "re-weights the read assignment probabilities using the mapping qualities and the parameters of the model," the researchers wrote. In other words, the software penalizes reads that map to multiple species and then uses read alignment scores from the previous alignment step to assign the ambiguous reads to their most likely source, Johnson explained to BioInform.
Basically, the combined approach lets users accurately identify species "[in] any circumstance whether there are unique genomes or not very unique [and in] cases where there are multiple species in the sample," he said. The other advantage is that it does this extremely fast, he added. In speed tests reported in the Genome Research paper, after the initial alignment step — which took about 38 minutes using BLAST and 13 minutes using Bowtie — Pathoscope needed only seven minutes to realign reads that mapped to multiple genomes. "The computational burden for this comes in the alignment step,"Johnson said. "If you are using BowTie … an eight-[central processing unit] server will be enough to run the alignment" and then Pathoscope could be run on a single CPU.
The developers believe Pathoscope could be useful in a variety of settings including bioforensics, biosurveillance, and clinics. For example, it could be used to simultaneously screen thousands of infectious pathogens in clinics and to monitor disease outbreaks. It could also be used to identify bioterrorism agents and harmful pathogens in soil, water, or food products. Johnson and his co-authors are already using it to explore bacterial and viral strains in pulmonary diseases like pneumonia, and also viral causes of cancer, he said. They're also looking into using it for a water quality surveillance project.
Over the next few months, they plan to launch a second version of the software — and publish an associated paper — that will automatically extract pathogenic genomes from databases run by the National Center for Biotechnology Information. Right now, users have to download data from these resources to generate a reference database for the alignment part of the procedure. But with Pathoscope 2.0, all a user would have to do would be to enter the NCBI species taxonomic ID for bacteria, for instance, and the algorithm would extract all genomes from NCBI's database associated with that ID, Johnson said. Down the road, the developers plan to add capabilities that will enable Pathoscope to extract data from more than one resource at a time, he said.