NEW YORK – Members of a California-based research team have developed a generative genomic foundation model for predicting and producing DNA sequences at the prokaryotic genome level — an approach they dubbed Evo.
"Using information learned over whole genomes, Evo learns how small changes in nucleotide sequence affect whole-organism fitness and can generate DNA sequences with plausible genomic architecture more than 1 megabase in length," researchers wrote in a paper published in Science on Thursday.
There, co-senior and co-corresponding authors Brian Hie, a data science and chemical engineering researcher affiliated with the Arc Institute and Stanford University, and Patrick Hsu, with the Arc Institute and the University of California at Berkeley, and their colleagues noted that Evo's current "prediction and generation capabilities span molecular and genomic scales of complexity, advancing our understanding and control of biology."
Building on a deep signal processing architecture for long sequences known as StripedHyena, the team put together the Evo machine learning model for interrogating relatively long stretches of sequence in prokaryotic genomes, along with the interactions between the DNA, RNA, and proteins encoded by these sequences.
After training Evo with sequence data for 2.7 million prokaryote or phage virus genomes — including genome sequences for more than 80,000 bacteria or archaea — a set that did not include viruses known for infecting eukaryotic cells — the team used Evo to translate between DNA, RNA, and protein modalities, the functional systems they formed, and the consequences of DNA mutations.
"The central dogma integrates DNA, RNA, and protein with a unified code and predictable information flow, whereas evolution unifies the vastly different length scales of biological function represented by molecules, pathways, cells, and organisms," the authors explained. "Evo learns both of these representations from the whole-genome sequences of millions of organisms to enable prediction and design tasks from the molecular to genome scale."
Beyond the model's predictive capabilities, the investigators outlined a series of experiments highlighting Evo's applicability for designing and producing synthetic CRISPR-Cas molecular complex and transposon systems capable of prompting specific genetic cuts or transposition events, respectively. Those experiments were informed by Evo fine-tuning with published data on more than 72,800 CRISPR-Cas loci and with sequence data for specific mobile genetic element family sequences.
By bringing in insights from the Genome Taxonomy Database and the US Department of Energy's Integrated Microbial Genomes, the team also took a crack at using Evo to predict essential gene sets, uncover problematic changes in such essential genes, and to generate new bacterial genomes.
Their results "suggested that Evo can generate genome sequences containing plausible high-level genomic organization at an unprecedented scale without extensive prompt engineering or fine-tuning," the authors reported. "These samples represent a 'blurry image' of a genome that contains key characteristics but lacks the finer-grained details typical of natural genomes."
The team noted that further work will be needed to expand Evo's capabilities to deal with larger and more complex eukaryotic genomes in the future.
"Future models may learn from diverse human and other eukaryotic genomes, using larger context lengths to capture distant genomic interactions over larger genomic scales," Gladstone Institutes and University of California at San Francisco researcher Christina Theodoris noted in a corresponding perspective article in Science.
"The ability to predict the effects of mutations across all layers of regulation in the cell and to design DNA sequences to manipulate cell function would have tremendous diagnostic and therapeutic implications for disease," she wrote, adding that it may eventually be possible to "develop methods to prompt Evo with environmental cues or cell states that contextualize DNA, which is identical across cells within multicellular organisms and yet directs variable functions of each cell across space and time."