NEW YORK (GenomeWeb) – Researchers from Imperial College London, New York University School of Medicine, and elsewhere have developed software for analyzing bacterial population genomic datasets that they claim shortens the time to identification and tracking of bacterial pathogens in the context of research and public health applications.
As explained in a paper published last week in Genome Research, the researchers developed Population Partitioning Using Nucleotide K-mers, or PopPUNK, a machine learning-based computational tool designed to help researchers analyze tens of thousands of bacterial genomes in a single run at speeds up to 200-fold faster than existing methods.
Nicholas Croucher, a professor of bacterial genomics in Imperial College London's School of Public Health and a co-author, said in an interview that his team is currently working with an unnamed public health agency to use PopPUNK as part of efforts to integrate surveillance data with large, well-characterized genome datasets.
"We are working with online surveillance databases to see if we can implement this in a way [where] all of the software is run [on] cloud computing … to enable people easily share data" as well as create "a centralized global database that is automatically updated as new datasets come in," he said.
Disease surveillance is important for several reasons including monitoring outbreaks and tracking the spread of virulent or multi-drug resistant bacteria. The reduced cost and improved performance of whole-genome sequencing have made it possible to use the technology to investigate potential outbreaks as well as track microbes' movement.
Efforts to use WGS for bacterial disease surveillance have yielded dividends. A Nature Genetics study published last year used the technology to identify cholera sublineages from isolates in Bangladesh. A 2016 Lancet study showed that bacterial genome sequencing was an effective method for tracking gonorrhea transmission and identifying potential resistance genes. In recent months, researchers in hospitals in the US and Europe, for example, have implemented sequence-based pathogen sequencing surveillance programs to track infectious disease outbreaks in hospitals.
Large-scale projects such as the Global Pneumococcal Sequence consortium are sequencing tens of thousands of samples and producing large quantities of data that need to be processed quickly. The scale of this and similar projects necessitate the development of tools like PopPUNK because current disease surveillance methods were not designed for use with such large datasets, according to Croucher. Specifically, existing methods, "do not fully exploit core and accessory genomic variation, and cannot both automatically identify, and subsequently expand, clusters of significantly similar isolates in large datasets and across species," he said.
Furthermore, existing software have to be retrained to take in new data as it becomes available. Part of the challenge here is that the tools use complex mathematical models that take a long time to run and require a lot of computational effort. "It makes it quite slow if you have to use surveillance datasets where you keep getting new data every day, week, month and so on," Croucher said. In contrast, he added, PopPUNK offers a fast method for bacterial surveillance using genomics that is "flexible to being applied across many different species that have quite different population structures."
For example, when PopPUNK and two other methods — Roary and RhierBAPS — were used to analyze data from 284 Staphylococcus aureus samples, PopPUNK took less than an hour to complete the analysis. This is compared to over 11 hours and over six hours required by Roary and RhierBAPS, respectively. PopPUNK also used less memory — less than 1 gigabyte compared to nearly 4 GB and nearly 5 GB of memory for Roary and RhierBAPS.
Croucher attributes PopPUNK's speed to its memory- and CPU-efficient approach use of k-mer-based sequence comparisons. "We run these comparisons at multiple lengths of k, which can be done in parallel, so it doesn't actually add any computational time," he explained. Then "we use a mathematical relationship to extract the variation in terms of base substitutions in the shared genome and the differences in the genes in terms of the accessory genomes. So, it's estimating the quantities very quickly based on sampling the genome, running k-mer comparisons, and then fitting a simple mathematical relationship."
As explained in Genome Research, PopPUNK uses k-mers — short sections of DNA of a given length — to estimate the proportion of sequences that are shared by genomes. Differences in k-mer content in otherwise similar stretches of DNA between genomes can represent important base-pair changes or differences in gene content. And these changes could correlate with clinically important factors such as virulence and antimicrobial resistance.
The input to PopPUNK is a set of bacterial genomes which can be gleaned from projects using different sequencing technologies and assembly methods. For each pairwise comparison of sequences, PopPUNK uses the proportion of shared k-mers of different lengths to calculate two types of distance: the density of mutations in the sequences shared between the pair, and the proportion of their genomes that are unique to each. It then uses machine learning approaches to process the information into networks of clusters where connections show the relationships between isolates.
As new sequences come in, they are efficiently compared to a reduced set of references selected from the clusters rather than to all isolates in the database. The required distances are calculated between the new sequence and the references, and new nodes added to the appropriate cluster. "We've tailored the software to generate different outputs," Croucher said. This includes "phylogenetic trees, sets of clusters and representations of how these sets of genes within each genome have diverged."
To determine how it might work in a surveillance setting, the researchers tested PopPUNK on a previously published dataset of Escherichia coli isolates collected over the course of a 10-year study that ran from 2001 to 2011. According to their results, PopPUNK successfully classified the prevalence of different strains in the population each year and identified the emergence of antibiotic-resistance strains.
In addition to working with public health agencies to quickly identify and track harmful bacterial strains in outbreaks, PopPUNK could also be used to surveil other kinds of microorganisms such as viruses and parasites using their whole-genome sequences. With viruses, for example, "you don't usually see differences in gene content, but this method is still a very quick way of calculating differences in terms of the shared genes and it would allow you to correct for any differences in length," he said. "[So] It will still work and it would be very fast."
PopPUNK also has potential for use in clinical contexts. "If you know the properties of a strain, you can very quickly see if this is a bacterial strain that we are concerned about [or] is this something that is associated with severe disease or antibiotic resistance, for instance," Croucher said. "It doesn't allow you to infer those properties directly, but it tells you if its closely related to something with those properties."
In terms of future improvements to the software, the researchers are exploring ways to further optimize the k-mer-based calculations so that PopPUNK remains fast and flexible as datasets continue to grow. "Similarly, as new cluster identification machine learning techniques are developed, we aim to include them as options to the user," Croucher said.