NEW YORK (GenomeWeb) – Researchers in the Prokaryotic Super Program at the Department of Energy's Joint Genome Institute and collaborators at the University of North Carolina, Chapel Hill recently published the details of a computational tool they've developed to quickly and automatically remove contaminants from sequenced microbial genomes.
The so-called Protocol for Decontamination of Genomes (ProDeGe), described in a short communication piece in The ISME Journal published this month combines homology- and sequence composition-based approaches to clean up draft assemblies generated from high-throughput sequencing of single amplified genomes.
The novelty of the method, according to the developers, is that it uses an automated protocol to both identify and remove potential contaminants. That's a step up from existing computational methods for sequence decontamination, some of which require significant manual effort and time, and some of which require prior knowledge of contaminants in order to actually remove them from the sequence.
There are multiple points in the sequencing and analysis pipeline where errors can be introduced into the genomes. Contaminants could come from the samples themselves or they could be introduced during the extraction, isolation, amplification, and sequencing steps. There could also be errors and artifacts introduced during the sequence binning and assembly steps. Over time, these contaminants are making their way into public databases as reference sequences.
Manual methods can work well when just a few genomes are involved but they become increasingly inefficient as the number of genomes that need to be checked and cleaned grows. It's a realization that really hit home to researchers at the JGI as they were targeting and sequencing about 200 microbial genomes representing different phyla, Nikos Kyrpides, head of JGI's Prokaryote Super Program, told GenomeWeb. A paper on this project, which was part of the Microbial Dark Matter (MDM) initiative, was published in 2013 in Nature. While those genomes were manually cleaned, it became clear that this would not be feasible in the long run, he said.
That’s where ProDeGe comes in. Essentially, the software has two modules, a homology-based module and a k-mer-based module, which are used to categorize sequences as either "clean" or "contaminant," Kristin Tennessen, a researcher in the JGI's Prokaryotic Super Program and the first author on the study, explained to GenomeWeb. Inputs to the system are the draft genome assembly up for decontamination and its corresponding NCBI taxonomy, according to the paper.
In the first step of its protocol, ProDeGe's homology module annotates genes on the input contigs and then uses Blast to assign those genes to a phylogenetic lineage, Tennessen said. Those gene assignments make it possible to determine the most likely lineage for each of the input contigs. The tool then compares the taxonomy it has selected for each contig with input taxonomy provided by the users, and this is used to categorize the contigs as either clean or contaminant or to tag them as "unknown." Unknown contigs are the input to the second module. Here, the tool calculates kmer frequencies for each contig and then uses principal component analysis to assign the previously unknown contigs to the clean or contaminant bins, she said.
According to the paper, ProDeGe removes, on average, 84 percent of sequences that derive from the non-target organism, and retains 84 percent of the sequence that derives from the target organism — these percentages were chosen in an attempt to balance the number of clean contigs that the method accepts with the number of contaminated contigs that are removed. Performance tests using data from 182 manually screened single amplified genomes gleaned from two studies — an Arabidopsis endophyte sequencing project and the MDM project — showed ProDeGe successfully identified and removed 5,311 potential contaminants from the data that might otherwise have become part of public repositories. The tool completed its categorizations at a rate of 0.30 CPU core hours per megabase of sequence, according to the results.
It is an improvement on the status quo but is not as accurate a manual methods, Kyrpides noted. ProDeGe is programmed to look for, essentially "foreign" contigs, but these are not always caused by errors. Regions of the genome that are the result of horizontal transfer between organisms would have different kmer frequencies from the rest of the contigs; ProDeGe doesn't have a mechanism to distinguish between the two and so errs on the side of caution and removes them.
It's a drawback of the method and one the researchers have yet to adequately address. On the one hand, it is good to be overcautious and remove as many potential sources of contamination as possible, Kyrpides said. That's because in a number of cases, these cells are part of novel genetic lineages and serve as phylogentic anchors for future studies of those lines. As such, keeping them contaminant-free is critical.
On the other hand, it would be valuable to increase the accuracy of the method, he noted. Moving forward, "we'll need to do a little bit more investigation ... in order to figure out ways of reducing the removal of contigs that are truly part of the genome of the organism even though they have different signatures," he said. One potential course of investigation could be to try to find a unique sequence pattern that indicates horizontal transfer that the software could identify. Another approach might be to come up with methods of generating longer contigs that encompass these horizontal transfer regions, which could also be helpful in clarifying what is and isn't part of the genome, he said.
ProDeGe's developers have provided a web interface for researchers to upload and analyze their datasets remotely. It lets users export clean and contaminant contigs and explore gene calls and their taxonomies among other features. ProDeGe is also available as a standalone software that can be downloaded and run locally. The software can be run on a system with Perl, R, and NCBI Blast installed.
The JGI researchers are already using the system internally to clean all the single-cell genomes that are sequenced at the institute, Kyrpides said, and to assess and clean genome sequences imported from publicly available repositories like GenBank. They are also using ProDeGe to try to clean up possible contamination in genomes assembled from metagenomics sequences. Other efforts are focused on creating a better user experience for researchers that have begun adopting the tool into their research pipelines, Tennessen said.