Skip to main content
Premium Trial:

Request an Annual Quote

Bioinformatics Pipeline Annotates CRISPR Repeats, Cas Genes; Assembles Data in One Repository


NEW YORK (GenomeWeb) – Given the great versatility of the CRISPR-Cas proteins that have been discovered to date, it's perhaps not surprising that various research groups are now turning their attention to finding more of the enzymes. And considering the wealth of sequencing data emerging from microbiome projects that span sites from the human navel to gas-filled craters in Turkmenistan, it's not a stretch to think there are new Cas proteins to be found.

Indeed, barely a month ago, early-stage life sciences company Arbor Biotechnologies and the team of Patrick Hsu at the Salk Institute for Biological Studies published concurrent studies in Molecular Cell and Cell, respectively, in which they each described their separate discoveries of a new Type VI CRISPR system called Cas13d. But it wasn't just the new protein that the respective teams were eager to tout — it was how they found it.

Arbor has a proprietary biomolecule discovery platform which the firm said "employs a diverse set of technologies and techniques — including artificial intelligence, genome sequencing, gene synthesis, and high-throughput screening — to curate and mine the natural genetic diversity for impactful peptides, proteins, and enzymes" in order to "enable the high-throughput discovery and identification of enzymes that provide new protein functionalities and catalytic activities."

As for the Salk team, the study's first author Silvana Konermann told GenomeWeb in March that the team built the computational search program to look for the "core feature" of any CRISPR system: the contiguous stretch of DNA that distinguishes one Cas from another called the array. And once the program found what it thought were proteins, she added, the researchers then had to cluster them into families in order to see if they matched with any families that had been previously described. Konermann and Hsu also said their program has found additional Cas proteins.

But just because there's data lying around and new Cas proteins waiting to be discovered, that doesn't mean they're easy to look for. "I've been looking for and looking at CRISPR-Cas systems for 14 years, and unless you're an expert and you've done it before, it's kind of hard to get started. It is pretty complicated, it is pretty technical, and there are … surprisingly still very few resources online for the non-expert to get started," Rodolphe Barrangou, associate professor at North Carolina State University, told GenomeWeb.

In fact, he added, the lack of resources coupled with the sheer amount of data that appears on a regular basis waiting to be mined for new CRISPR-Cas systems, was the impetus for Barrangou, Alexandra Crawley — a researcher in his lab — and a bioinformatician at agricultural biotechnology company AgBiome named James Henriksen to develop a new automated pipeline called CRISPRdisco (CRISPR discovery) to identify CRISPR repeats and Cas genes in genome assemblies, determine their type and subtype, and to describe how complete or incomplete the systems are.

"The cost of sequencing isn't a limiting factor anymore, but the analytics and interpretation of the data are," Barrangou noted. "There was a need [for CRISPRdisco] and certainly this was the time."

A different kind of pipeline

There are some differences between CRISPRdisco and the pipelines from the Arbor and Salk teams. For one thing, Barrangou noted, those programs are focused on looking for and finding new Cas proteins whereas CRISPRdisco searches for all components of a CRISPR-Cas system.

"Homology between reference sets of proteins and the detection of CRISPR repeats, along with typing logic, are used to categorize systems…. The CRISPR arrays are identified using minCED (mining CRISPRs in environmental data sets), a derivative of CRISPR Recognition Tool that is more conservative in repeat calling and allows more flexible user outputs," the authors wrote in their paper describing CRISPRdisco, which was published April 9 in The CRISPR Journal.

Custom code determines the orientation of the repeats, generates the consensus repeat sequences, and returns the number of repeats, indicating the size of the array, the researchers added. Once CRISPR loci have been identified, the presence and absence of genes are used to assign type and subtype, detect multiple systems in a genome, and determine the completeness of the system through the identification of missing repeats and Cas proteins.

In other words, Barrangou said, "Up until now there's no tool that looks for both CRISPR and Cas. If you think of CRISPR-Cas systems as puzzles, you need multiple pieces to be able to confidently understand what system you have, not just one CRISPR-Cas protein or one CRISPR repeat, but the combination of all the Cas genes you need for the various steps of the CRISPR-Cas systems to actually operate and function, with regard to acquiring new spacers, to transcribing guide RNAs, targeting DNA. So often times, each piece of that puzzle can be hard to identify or annotate. And that's the value here."

For a researcher to know that he or she has all those pieces in place not only lends confidence that what they have in hand is a legitimate Cas protein or CRISPR-Cas system, but also that it works and that it's worth taking the time to study, he added.

And whereas other pipelines look only for new Cas proteins, which in and of itself is a valuable activity, what they're doing is looking for proteins that are similar to known Cas proteins but different enough that they could be a different subtype. CRISPRdisco, on the other hand, takes all the "puzzle pieces from genomic data" and tried to piece them together into completed CRISPR-Cas systems, he said.

A single database

The other main objective of the project, according to Barrangou and Crawley, was to create a single database to house all information related to CRISPR research, including all annotations, discoveries of new Cas proteins, and so on. As of yet, no such repository exists and all the available research is in disparate locations, making it hard for any one researcher to look up all the information that may be pertinent to them.

"We took the current published literature and created a database out of that. That is one thing that was lacking — a consolidated, single-source database of knowledge for CRISPR-Cas systems — and that's what we started with this tool," Crawley said. "That's really the meat behind the program — taking all these sources and consolidating them into a curated database."

Indeed, the team began with a collection of genomes from bacterial and archaeal taxa categorized as full and complete from the RefSeq database containing 5,201 replicons, and used them to build and fine-tune the pipeline. They compared the output with tools that are currently used and publications that are considered to be the gold standard of CRISPR classification to determine the accuracy of the pipeline.

"The pipeline showed agreement with CRISPR repeat detection in 94 percent of genomes analyzed with the CRISPRdb," the authors wrote in their paper. "When our pipeline disagreed with the presence or absence of CRISPR components in genomes, we used the entire CRISPR-Cas locus to determine which annotation software was more likely to be accurate…. When using the whole locus to determine the accuracy of the pipeline, we are more than 98 percent accurate in CRISPR repeat calling and 99 percent accurate in Cas detection relative to these other sources."

Further, they noted, of the 1,963 CRISPR elements they detected in the 5,201 genomes and plasmids from the RefSeq database, they identified only 1,065 complete CRISPR-Cas systems where Cas genes co-localized with CRISPR repeats.

"I think the keyword here is curation. There's a lot of data out there, but not all that data is good, or correctly annotated, or insightful, or even useful to the users," Barrangou said. "And in addition to the tool itself, much of the value of the paper lies in the actual curated set of data that comes with it that enables the user to correctly not just find or discover CRISPR-Cas systems, but also name them correctly; describe them in the right class, type, and subtype; and to see whether they're new, novel, or not, and whether they match canonically curated reference sets or not. There's value there both for the expert and the non-expert."

Open access

Another big difference between CRISPRdisco and other recently described CRISPR discovery platforms is that CRISPRdisco is free and is being offered by the developers open access on Github.

Barrangou and Crawley hope that by providing the tool to anyone who needs it, not only will other researchers use it to add to the CRISPR database they've created, but they'll also find ways to improve the tool itself. "We want people to use it, we want people to be able to take it and make it better. As our knowledge of CRISPR grows, we want the tool to grow with it," Crawley said. "We really want this to be a place where CRISPR experts can pour knowledge behind the tool. But anyone — the casual microbiologist, biotech scientists — anyone can take that knowledge that the CRISPR community has published ... and use that for their own research."

And while users are not required to give back any data or make improvements to CRISPRdisco, she noted, the trend in bioinformatics is moving toward a more collaborative environment where researchers share each other's code and help each other improve programs to the benefit of the entire community. "As new [Cas protein] types or subtypes are discovered, we would love for people to make this part of the tool so we can continue to build on the knowledgebase," Crawley added. "And that's something that we'll probably keep abreast of. But we hope that as people learn more about CRISPR, they contribute back."

And Github is the perfect place for that kind of sharing and collaboration, Barrangou noted, adding, "By its nature, it is a sharing portal and you can download, access, update, upgrade, alter, comment, share, and reshare, all those tools accordingly. Or bring up issues with the tool that need to be addressed."

In the end, the researchers stressed, this is a tool that anyone can use, regardless of their level of expertise or their possible application of interest.

"What we hope this will do is help us understand the path between finding new systems in bacteria and predicting their functionality and whether or not they can be used for potential genome editing and beyond," Crawley said. "One thing is that if you look in the literature, people are really only using about six proteins. When people say Cas9, they don't mean all Cas9, they mean SpCas9. I think that's the thing that has really gotten lost — we only know a lot about a very small number of systems, and… I would argue that anyone who is interested in CRISPR, either from a microbiology standpoint or from a technology standpoint, should be interested in the tool."