NEW YORK (GenomeWeb) – A new algorithm for calculating on-target guide RNA efficiency is taking off in the field, both in home-grown and commercial CRISPR/Cas9 experiment design software.
The fruit of a collaboration between scientists at the Broad Institute and machine learning experts at Microsoft Research, the algorithm has uncovered new rules and considerations for how well a single guide RNA (sgRNA) will work for gene knock-down experiments, with results that can be broadly applied.
"You can measure these things in the laboratory, but to systematically do this genome-wide to develop some library that the community can use, it would take years and millions of dollars," Jennifer Listgarten, a senior researcher at Microsoft Research in Cambridge, Massachusetts told GenomeWeb. "The idea is you do things much more quickly and cheaply by using in silico modeling of things you could have done in the lab."
While scientists had tried to tackle sgRNA design using probabilistic models before, their simplistic models often prevented them from accounting for the true complexity of the system. Some had even tried to throw so-called off-the-shelf machine learning techniques at the problem, with positive, but limited results, Listgarten said.
Using data supplied by sgRNA guru John Doench of the Broad Institute, Listgarten and her team applied complex non-linear models that can capture the interactions going on in the system that affect on-target activity. "We were able to sit down and start with zero assumptions and build up a model that could predict on-target for single guide RNAs," Nicolo Fusi, a researcher at Microsoft who worked with Listgarten and Doench, said. They published their results in February in Nature Biotechnology.
The Microsoft team developed an algorithm called Azimuth that scores guides for on-target activity, which Doench used to build a new library of sgRNAs for gene expression knock-down. "If you want to knock down some gene, you can deploy a CRISPR system to do that in hundreds if not thousands of places in a gene," Listgarten said. "Only some of those actually work well. You could go out to the lab and say, test all of those out for the one gene, or, instead you can do that for a handful of genes. We use a modeling technique that captures that biological knowledge in the form of a machine learning predictive model. And then we employ that model genome-wide where people had never done the wet lab experiments."
Now, CRISPR experiment design software packages like Benchling and Deskgen are incorporating the results, spreading the improvements throughout the field.
The beginning of this collaboration is the kind of validation for so-called "innovation hubs" such as Cambridge's Kendall Square. Listgarten recalled that she and Fusi got an announcement that Doench would be giving a talk on CRISPR at the Broad Institute, just around the corner from Microsoft's offices. "We heard this CRISPR stuff is changing the world, so we thought, 'wouldn't it be cool to learn more?'"
Doench is well known in the CRISPR community for developing sgRNA on-target activity algorithms, but Listgarten said it was clear that he could use her expertise. "We approached him at the end and said, 'We think this is cool science and we'd love to work together if that makes sense.' Within a week he visited us and within another week gave us data" to work with. Within a few days, the Microsoft team demonstrated that it could help generate better sgRNA scoring and the collaboration began in earnest.
The machine learning experts quickly found ways to improve the rules for selecting sgRNAs.
Fusi said the single most impactful change was switching from trying to classify guides as either working versus not working for gene knock down, to classifying them onto a continuous scale. "This boosted our performance immediately and it was a huge improvement," he said.
Previously, scientists had been using a lab-based assay with a continuous readout on gene expression, but were applying thresholds to sort guides into a binary classification. Listgarten said she wasn't entirely sure why, but she and Fusi saw that it was affecting the results. "This creates a huge bottleneck in the way the model assimilates information," she said. "[The model] doesn't get to see the nuances of the guides that work halfway."
Allowing for complexities created an explosion of criteria, much of which is not readily translated back into English.
"They're all entangled together," Listgarten said. "The most you can do in explaining it is to tease it apart in simpler ways than the model actually accounts for it." But by taking one feature out of the model, the researchers could see how much worse it performed, partially isolating important aspects of sgRNA design.
Not only did Azimuth find things people had already deemed important, such as the guanine and cytosine content in the guide region, it found a bunch more.
"GC content is a proxy for thermodynamics," Listgarten said. "We actually used more proper thermodynamics."
Fusi said that Doench's intuition was also helpful in finding important features. The canonical protospacer-adjacent motif (PAM), or recognition site, for Cas9 from Streptococcus pyogenes is NGG. Doench thought that there might be an interaction between the N and whichever letter came after the GG. "We included this feature into our model and it turned out to be predictive, and in a very particular way," he said. "You need to know both what follows the NGG and what the N is in the NGG is. And you have to know these at the same time. Depending on what the pair is, it makes [the guide] more or less active in knock down."
There are a lot of different features and some are ranked as of higher importance than others. "We can't do effect sizes, because there is no such notion," Listgarten explained, but there is a proxy for effect size, called Gini importance. Different classes of features are ranked and can be drilled down into. The model allows a scientist to ask very specific questions, such as "Does an A in position three matter?" Listgarten said.
Once the paper hit the field, people scrambled to incorporate the algorithm into their guide RNA design tools. Saji Wickramasekara, CEO of lab notebook developer Benchling, told GenomeWeb his firm has already incorporated the findings as an advanced feature. Edward Perello, co-founder of Desktop Genetics, said in an email that the firm's Deskgen platform has implemented a particular version of the model.
They were able to do so because Fusi and Listgarten made sure that it was available in an number of ways, for both research and commercial use.
Benchling's engineers have been in direct contact with Doench and the Microsoft team, Fusi said, while Deskgen noted on its blog that it has used code provided by the Microsoft researchers on the Github code repository.
The Broad Institute has also incorporated the model into its web-based sgRNA design tool and Microsoft has implemented it on its Azure cloud platform.
"People can interact with it via Excel or their language of choice," Listgarten said. "There's a bit of code you can insert [into your program] to ping our cloud server. We very purposefully made it available in all these different ways because it suits different people with different expertise and desires in how they want to use it."
Fusi added, "The more people that use it, the better it is for everyone."