NEW YORK (GenomeWeb) – Sequences from contaminants like bacteria may remain in mammalian genome assemblies even after they've been deposited into public databases, researchers led by Johns Hopkins University's Steven Salzberg reported in PeerJ this week.
Using microbiome analysis software, Salzberg and his colleagues found that the draft assembly of the domestic cow genome contains some 173 small contigs that actually are microbial contaminants, including many from the Acinetobacter, Pseudomonas, and Stenotrophomonas genera.
This contamination, though, wasn't limited to mammalian genomes. The researchers also reported that one Neisseria gonorrhoeae genome, though uploaded to GenBank as a complete genome, contains a number of sequences that are really from cow or sheep.
"These results illustrate the importance of performing a thorough search for contamination before submitting a genome sequence to a public archive," Salzberg and his colleagues wrote in their paper. "The rapidly growing number of draft genomes represents both a valuable resource and also, as we show here, a cautionary tale."
There are more than 27,000 prokaryotic and 1,600 eukaryotic genomes housed in GenBank, and though the researchers noted that many of these are draft genomes, some 3,000 of the prokaryotic genomes are listed as complete. Such a set of complete genomes, they added, have enabled a number of microbiome analyses that use such data to determine what bacteria are present in a sample of DNA isolated, for instance, from the human body or from soil. Any errors or contaminants in the banked genomes could influence those studies' results.
"If scientists cannot assume that the sequence of a species truly comes from that species, then analyses that use this data may be fundamentally flawed," the researchers added. "Contamination from other species may masquerade as lateral gene transfer, an event that is relatively common between some bacteria but extremely rare otherwise."
Salzberg and his team said that they were intrigued by a recent finding that samples obtained from the cow, Bos taurus, for microbiome analysis seemed to contain a human pathogen that doesn't infect cows. Using microbiome sequence analysis software, they examined the B. taurus genome for signs of microbial contamination.
The draft cow genome had been assembled from some 35 million Sanger reads, most of which mapped to chromosomes. The assembly the researchers used had nearly 3,300 unmapped contigs containing nearly 9.5 million nucleotides. Using Kraken, a method to match DNA sequences to its species using k-mers, and a database the researchers developed from 2,757 bacteria and archaeal genomes and 2,335 viral genomes from RefSeq, they determined that 138 of the B. taurus contigs were bacterial. A BlastX search confirmed those 138 contigs and uncovered a further 35 that were due to contamination.
The most common contaminants belonged to the Acinetobacter, Pseudomonas, and Stenotrophomonas genera.
One contaminant, the researchers noted, could be traced to the bovine herpesvirus 6, isolate Pennsylvania 47. This cattle virus, which causes a range of diseases, is a retrovirus and could thus have integrated itself into the cow genome.
Using Mummer to align the bovine herpesvirus against the cow genome, Salzberg and his colleagues found that those original five contigs remained, indicating that the virus had not integrated into the cow genome. They hypothesized the sequenced animal was instead infected with the herpesvirus.
To check themselves, Salzberg and his colleagues also used Kraken to search through all of the cow chromosomes. From this, they found some 2,885 contigs that appeared to align to the bacterium N. gonorrhoeae.
However, they noted that all these contigs align to one of four regions of this N. gonorrhoeae strain. When they then aligned these sequences on their own to all sequences in GenBank, they came back B. taurus.
To try to determine how those foreign sequences became embedded in the N. gonorrhoeae genome, the researchers went back to the original publication of the strain and its GenBank entry. Though it was listed as a finished genome, Salzberg and his team concluded that it was likely uploaded as such by mistake: It contained 180 contigs that were concatenated together.
Using nucmer, they aligned the strain to two of its close relatives, giving 181 separate alignments, but 67 small segments didn't align to either of the relative stains.
Four of these small segments, the researchers noted, included those that had matched the cow genome and a fifth that aligned to Ovis aries, or sheep.
After removing these contamination contigs from the stain, Salzberg and his colleagues developed a new draft genome of the strain consisting of 165 contigs.
"However, because we did not have access to the original TCDC-NG08107 data and because the original submitters did not respond to any requests for data, we cannot be confident that these contigs are the best representation of the genome," the researchers said. "As a result of our findings, GenBank has temporarily suppressed the entry for this genome."
To gauge how widespread contamination is in publicly available genomes, the researchers randomly chose eight genomes from the NCBI database for further evaluation. Three of those eight genomes had between two and four contaminant contigs, and one had upwards of 225.
While Salzberg and his colleagues attributed these instances of contamination to, for instance, viral infection of the cow chosen for sequencing, other recent studies have indicated reagents and other steps of the sample prep process may also be sources of contamination, as GenomeWeb has reported. For instance, a study appearing in the Journal of Virology traced what was first thought to be a new hybrid DNA virus to DNA contamination of in silica spin column products that had been used for nucleic acid purification.