NEW YORK (GenomeWeb) – An international team of researchers has developed a computational pipeline that is compatible with both local and cloud infrastructure for identifying pathogens in sequenced samples that they claim is accurate and fast enough to produce results in a clinically actionable time frame.
The researchers, from the University of California, San Francisco and several institutions in the US and abroad, said the pipeline could be particularly useful in infectious disease diagnostics and treatment.
The sequence-based ultrarapid pathogen identification (SURPI) pipeline maps the millions of sequences generated from patient samples that are included in the human and pathogen reference databases maintained and managed by the National Center for Biotechnology Information (NCBI), the researchers explained in Genome Research. It uses a computational subtraction approach to separate host sequences from alien ones and then uses two alignment algorithms to match the latter dataset to candidate pathogen data stored in NCBI's repositories. The first algorithm, SNAP, matches incoming sequences to the NCBI's human and pathogenic databases, while the second algorithm, RapSearch, compares sequences to the NCBI's protein databases.
Both algorithms, the researchers wrote, are as accurate as existing sequence alignment tools, however, they are orders of magnitude faster in terms of performance. SNAP, for instance, is about 1,000 to 10,000 times faster than Blast and 10 to 100 times faster than BWA and Bowtie, according to Charles Chiu, an assistant professor in UCSF's department of laboratory medicine and one of the authors on the Genome Research paper. RapSearch, for its part, is about five to 10 times faster than the comparable Blastx software, he told BioInform. In the paper, there are detailed descriptions of the results of tests that compared the SURPI aligners with Blast, BWA, and Bowtie in terms of accuracy and speed using both in silico and real data.
SURPI has two modes of operation: a fast mode where it matches reads specifically to NCBI viral and bacterial databases; and a comprehensive mode where reads are compared to the entire NCBI nucleotide repository. In cases where the pathogenic sequences don't match up with any known organisms, its mechanism for identifying offending bugs is to assemble the sequences into longer contigs and then compare them to viral and/or NCBI protein databases. According to statistics provided in the paper, when SURPI is run in the fast mode, it takes minutes to process a dataset consisting of about 50 million reads, "while in comprehensive mode, all potential pathogens (viruses, bacteria, fungi, and parasites), as well as novel emerging viruses with high sequence divergence, can be identified in [approx] one-to-five hours."
A solution like this is potentially useful not just for infectious disease diagnosis but also has application in fields such as public health surveillance, outbreak investigation, and clinical and environmental metagenomics studies, said Chiu, who is also an infectious disease physician. It offers a way to take advantage of current technological advances in the field of genomics, which promise to improve disease diagnosis and shorten time to treatment.
That's important because other studies have shown that "conventional diagnostic testing for pathogens is narrow in scope and fails to detect the etiologic agent in a significant percentage of cases," which "contributes to continued transmission and increased mortality in hospitalized patients." For example, Chiu said, up to 30 percent of pneumonia cases in patients in the intensive care unit are not detected in spite of extensive testing. Also, up to 60 percent of encephalitis cases show up negative when samples are tested.
Part of the problem, he said, is that typically physicians test for whatever pathogens they believe are most likely causing whatever symptoms they see in their patients. But the trouble with that approach is that infections that are caused by separate microorganisms can have identical symptoms, and delays in treatment because of a misdiagnosis could be fatal. Also concerning is the rise of infectious diseases such as Middle East Respiratory Syndrome that are caused by novel pathogens, the researchers noted in their paper. What's needed, according to the researchers, are "rapid, broad-spectrum diagnostic assays that are able to recognize these emerging agents."
Next-generation sequencing has the potential to fill that role and effectively address the shortcomings of existing pathogen detection methods. Technological advances have brought the cost of owning and operating a sequencer to a more manageable level for clinical and public health laboratories, but there are still challenges on the computational side — especially a lack of tools to process the massive quantities of sequence data that instruments from companies like Illumina spit out and get results back to clinicians in minutes or hours at the most.
That's what motivated Chiu and his colleagues to put SURPI together, he said. Furthermore, enabling it to run on cloud infrastructure — it's available as a machine image on Amazon but it can be ported to other clouds — as well as local servers ensures that end users who don't have the in-house infrastructure needed to run SURPI locally, and those who want to save money for practical reasons, can still use the pipeline in their labs for relatively low cost, Chiu noted. Roughly, the cost of analyzing 30 to 50 million reads using the cloud version of SURPI is under $10, he said. By way of comparison, analyzing the same datasets using Blast on the cloud would cost over $800, he added. Furthermore, running SURPI on the cloud is a much simpler process than installing it locally. "You don't have to know any programming at all," he said. "It's basically a one click installation."
Getting the local version up and running is a bit more involved, but it's still relatively easy to do. "We have a single script that you have to launch that will set up the program, the pipeline, and all the dependencies," he said. "The one thing that's missing right now in the local version is we don't have an easy way to regenerate the databases … We are looking for solutions for that [including] automated installations, as well as potentially providing condensed databases for download" — this particular update to the pipeline should be available this week. There are also specific hardware requirements for local implementation in terms of required memory size, number of cores, and so on. Details on these are provided in the paper.
For their next steps, the researchers are improving SURPI's user interface so that it's even easier to access and use, especially for folks with no bioinformatics expertise, Chiu said. They also hope to develop and incorporate new visualization tools into the pipeline, and they are exploring new algorithms for potential incorporation into SURPI. An example of the latter is Kraken, software for classifying sequences from metagenomic samples, which was developed by researchers at the University of Maryland's computer science department and Johns Hopkins University's Center for Computational Biology. Other future development plans enabling SURPI to run tasks in parallel to make it work even faster could shorten the time frame for analysis from hours to minutes. "Hopefully, we can go down from say an hour, which is what it is for 10 million reads, to six minutes," he said.
SURPI is available under a BSD license. Interested researchers are invited to try it out and give the developers feedback on possible improvements, bugs, and new features.
The pipeline has already been implemented by researchers at the US Centers for Disease Control and Prevention, who plan to use it to investigate outbreaks, as well as for pathogen discovery, said Chiu. Also, the group has been holding discussions with some commercial companies who have expressed interest in perhaps incorporating SURPI into their platforms.