NEW YORK – Researchers affiliated with the Arc Institute, Stanford University, and Nvidia have released what they say is the largest biological artificial intelligence (AI) model to date and have made it free to use and open to all interested researchers.
The model, called Evo 2, autonomously predicts the functional consequences of genetic variation –– stemming from both coding and noncoding sequences –– and can be used to design new mitochondrial, prokaryotic, and eukaryotic sequences at genome scale.
A study of Evo 2's capabilities has been submitted as a preprint to BioRxiv and is currently on the website of the Arc Institute, a nonprofit research institute that was founded in 2021 and operates in partnership with Stanford, the University of California, Berkeley, and UC San Francisco.
Evo 2 was trained on some 9.3 trillion DNA base pairs stemming from 128,000 genomes curated from organisms spanning all domains of life and builds on a previous model of Evo, which encompassed 80,000 prokaryotic genomes.
Importantly, Evo 2 was also trained on both DNA and RNA sequences, enabling it to learn features relevant to the molecular fitness of both, such as mRNA decay rates, from which researchers can infer gene regulatory features such as transcription factor activity.
Another key feature of the model is that it is trained to recognize distant regulatory DNA sequences. These stretches of noncoding DNA often influence the expression of far-flung genes. In their study, Evo 2 appeared capable of identifying regulatory sequences from as far as 1 million nucleotides away from the genes they control. This, the study's authors argue, could help researchers understand functional relationships between distant parts of a genome that might otherwise be very complicated to study.
As a test of the model's accuracy, the investigators used it to predict the effects of known mutations in the breast cancer-associated BRCA1 and BRCA2 genes. Overall, Evo 2 performed similarly to other models with respect to coding variants and outperformed competitors when evaluating noncoding variants and combined coding and noncoding variants.
The team continued testing Evo 2's capabilities at increasing levels of complexity, such as genome organization. In the Escherichia coli genome, Evo 2 identified features corresponding to open reading frames and intergenic regions, among others, as well as features associated with protein secondary structures such as alpha-helices and beta-sheets, suggesting that the model can be used to capture higher-order structural information beyond DNA.
Extending this analysis to the human genome, the researchers found further evidence to suggest that the model can discern regulatory elements, such as transcription factor binding motifs and variations in exon-intron architecture in addition to gene-coding features.
In light of these findings, the authors wrote that Evo 2 effectively captures a "broad spectrum of biologically relevant signals, from mobile genetic elements and regulatory motifs to protein secondary structure and mutational severity."
Because Evo 2 is fundamentally a generative model trained to predict the next base pair in a sequence, the researchers also tested its ability to build whole genomes based on genomic prompts, akin to the textual prompts given to generative AI engines such as ChatGPT.
Supplied with portions of human mitochondrial DNA, Evo 2 produced mitochondrial genomes with the correct number of coding sequences, tRNA, and rRNA, as well as diverse mitochondrial genes with varying degrees of sequence identity to naturally occurring mitochondrial proteins. This task was repeated with several bacterial genomes and the eukaryotic yeast Saccharomyces cerevisiae with similar results.
Importantly, the team also showed evidence for being able to generate long genomic sequences replete with regulatory regions functionally similar to those involved in regulating chromatin accessibility, a key feature of complex, large-scale genomic architecture.
Overall, Evo 2 appears capable of genome-length sequence design at the scale of whole human mitochondrial genomes, minimal bacterial genomes, and yeast chromosomes, including the ability to produce complex epigenomic structures.
The model's developers noted that in addition to genome analysis, Evo 2 could potentially be used to engineer new biological tools and treatments, such as CRISPR gene editors.
Mariano Álvarez, CSO of precision oncology company DarwinHealth, described Evo 2 in an email as "learning, interpreting, and then speaking the language of life."
"The scale and capabilities are impressive and Evo 2 is an anticipated application of language AI models," he said.
Álvarez cautioned that the model still needs to be validated "through extensive experimental work" and that due to its open-source nature, safeguards must be in place to prevent misuse, such as designing new human pathogens.
The Arc Institute said in a statement that human pathogens and some other complex organisms were intentionally excluded from Evo 2's base dataset, and that steps were taken to ensure that the model returns no "productive answers" to queries on these organisms.
Additionally, the authors noted that Evo 2 has yet to be validated in experimental settings, but that plans are in place to do so. Meanwhile, it is already freely available, along with its OpenGenome2 training dataset, to scientists who might want to use it in their own research. The code is publicly accessible at the Arc Institute's GitHub site and is integrated in Nvidia's BioNeMo framework, also on GitHub.