CHICAGO – Bioinformaticians with the Australian e-Health Research Centre at the Commonwealth Scientific and Industrial Research Organisation have developed a computing algorithm to detect foreign DNA in whole-genome sequences by looking for shifts in k-mer signatures. Unlike previous alignment-free methods, this algorithm works without prior knowledge of the genome or the inserted DNA.
Called Inserted Sequence Information Detector, or INSIDER, the method converts sequences of variable lengths into genomic signatures, then analyzes the signatures of genomic segments to identify the origins of sequence clusters. INSIDER does not require a reference genome, nor does it need to know the genetic material that was actually inserted.
The CSIRO team described the method in a paper published last month in the Computational and Structural Biotechnology Journal. The paper calls INSIDER "the first tool specifically designed to specifically function with no prior knowledge about the genome, meaning it can be readily used to [analyze] completely novel genomes."
Denis Bauer, bioinformatics group leader at CSIRO, said that the issue of foreign DNA has popped up not only in genetics research but in biosecurity, which CSIRO has an interest in as well. She noted that foreign DNA could have negative or positive effects on the rest of an organism; the latter might happen in the case of a CRISPR gene edit.
Bauer said that INSIDER originated from COVID-19 research early year, in which CSIRO was tasked to differentiate strains of SARS-CoV-2. In that work, described in an April 2020 paper in Transboundary and Emerging Diseases, Bauer and colleagues calculated k-mer frequencies from whole-genome sequences of viral isolates, then plotted the distance between strains that might vary from traditional phylogenic-tree distances that only show mutations.
"This one allows us to look at deletions and complex rearrangements as well," Bauer said. "Well, if we can differentiate between strains of viruses, can we then go deeper and differentiate between blocks of foreign DNA?"
In the Computational and Structural Biology Journal article, the researchers wrote that the ability to identify foreign DNA for health and biosecurity purposes, including gene drives and antimicrobial resistance "does not exist for poorly characterized host genomes or with limited information about the integrated sequence."
They said that the identification of foreign DNA with previous methods was "time-consuming and complicated, requiring significant manual processing."
The researchers decided to take a metagenomic strategy, looking at genomic signatures to identify sequences that came from a different species than the host organism. "This approach significantly reduces the search space from an entire genome to a more focused selection of potential sequences of interest," according to the paper.
Indeed, the CSIRO bioinformaticians chose gene drives and antimicrobial resistance as use cases for testing the INSIDER algorithm. They also built a synthetic dataset to simulate the placement of an RNA-guided CRISPR-Cas gene drive into the genomes of wild-type yeast and bacteria.
"Being able to monitor the acquired sequences by distinguishing foreign from host genome is vital for a range of health, ecological, and environmental applications, such as monitoring the spread of antimicrobial resistance (AMR) or monitoring genetic changes in wild populations," the researchers wrote. "INSIDER is therefore a powerful tool that will streamline the process of identifying integrated DNA of unknown origin in poorly characterized wild species, allowing for enhanced monitoring of emerging biosecurity threats."
After confirming that k-mer signatures can help them determine that sequences came from different organisms, the researchers generated short sequences from yeast, fruit fly, zebrafish, mouse, and human genomes. They were able to determine that their method is accurate with clusters as small as 2 kilobytes.
"INSIDER can streamline the process of identifying integrated DNA, reducing the search space from the entire genome to only targeted sequences, and requires no prior knowledge about the genome or inserted sequence," they wrote.
The CSIRO team created three use cases for their study, one looking at gene drives, one examining antimicrobial resistance, and a third made up of synthetic data. The synthetic data was necessary because INSIDER struggled to differentiate between human and mouse cells, according to Bauer.
INSIDER builds genetic profiles by counting the frequency of k-mers in each sequence, then compares it to the average of a particular organism. This allows the algorithm to determine whether something in the sequence is abnormal.
"It is the first method that really [allows] us to do that," Bauer said. "I think there is a lot of need for an algorithm like that," Bauer said.
Steffen Pallarz, bioinformatics officer at the German Federal Office of Consumer Protection and Food Safety, called the translation of genetic code to k-mer profiles "very promising" as a means of understanding unknown genetic samples.
"Being able to quickly compare and analyze samples, without the need for reference genomes, and therefore without the resource-intensive alignment or assembly steps, is a great advantage for numerous analysis aims," Pallarz said via email. "Also, using profiles to filter a metagenomics dataset can be very efficient."
Pallarz, a former postdoctoral researcher in bioinformatics knowledge management at Humboldt University of Berlin, noted that he employed a similar technique in his doctoral dissertation to look for infections in plant transcriptome samples.
Bauer stopped short of calling INSIDER a major breakthrough or some other superlative, and noted that there is always room for improvement. She added that the CSIRO bioinformatics team is now looking for limitations in the technology, such as by making k-mers larger to test the sensitivity of INSIDER.
"Potentially, we could differentiate between mouse and human or potentially even closer-related species," Bauer said. "We have to find the best practice for application cases because it would differ between application cases," she added.
The CSIRO researchers said that future work might look to incorporate common genomic signatures into INSIDER to be more precise at distinguishing between significant and expected signature variation when examining closely related species. They surmised that longer k-mers could help with creating this separation.
Bauer believes that this method can have applicability far beyond the three use cases in the published research. "I'm still quite surprised at such a simplistic method," she said.
"It's a simplistic idea, but it is so powerful. We can apply it to any area," Bauer explained. "Antimicrobial resistance and gene drivers are the first things that we were thinking of, but now the sky's the limit, I guess."