
NEW YORK – Generative artificial intelligence, which exploded into the public's consciousness after the November 2022 release of ChatGPT, is starting to find its way into genomics and bioinformatics.
A large language model (LLM) meant to mimic human conversation and writing patterns, ChatGPT was initially often seen as a novelty and used for entertainment purposes. However, businesses and life science organizations have also been harnessing or experimenting with it, or other generative AI tools, to improve efficiency.
Christopher Mungall, head of biosystems data science at Lawrence Berkeley National Laboratory in Berkeley, California, said that generative AI has been "hugely successful" as an assistant to coders and software developers. "It's like autocomplete on steroids," he said. "It's giving you suggestions of lines to write and maybe even entire blocks of code."
In July, Japanese genetic research and testing firm Genesis Healthcare introduced GenesisGaia, a generative AI software platform for omics data. A population genomics platform at heart, GenesisGaia applies its technology to genomic datasets in an effort to reduce pharmaceutical R&D bottlenecks.
Working with Tokyo-based analytics company Xebral, Genesis has collected more than 31 million data points from public-sector and open-source datasets on genetics, variants, proteins, drug compounds, scientific publications, and clinical trials, and has combined that information with its own collection of more than 400,000 genomes to create GenesisGaia.
Genesis Healthcare is at its core a genetic testing company, performing more than 1.4 million tests over the last 20 years, primarily on Japanese populations.
Michel Mommejat, the firm's chief innovation officer, said that GenesisGaia works on two levels. First, it facilitates and accelerates early-stage research by taking some of the manual labor out of identifying relationships between proteins, compounds, and diseases. It also provides analytics services, including genome-wide association studies (GWAS).
This GWAS capability means that GenesisGaia might be suitable for non-pharma genomic research in the future.
Genesis Healthcare claims to be the first commercial company to release generative AI software specifically for genomics data, though there are numerous methods deployed or under development in academia. They include GeneGPT, a method to teach large language models to use the web application programming interfaces (API) of the National Center for Biotechnology Information to answer genomics questions developed by the US National Library of Medicine, and the Genomic Pre-trained Network (GPN), a model to predict genome-wide variant effects through unsupervised pre-training on genomic DNA sequences developed by the University of California, Berkeley. The academic methods have been described in preprints in Arxiv and BioRxiv, respectively.
Others in the commercial sector are quickly adding their own spins on predictive and generative technology to address challenges with genomics data.
Ben Busby, a computational biologist at DNAnexus involved in scientific strategy with large biobank customers such as the UK Biobank, mentioned Nvidia's NeMo framework for building generative AI models, including a life sciences-specific version called BioNeMo that, among other things, can generate new molecules and predict structure and function of molecules in silico.
This week, in a paper published in Science, investigators at Google DeepMind described AlphaMissense, a machine learning-based tool for predicting the pathogenicity of missense variants in protein-coding genomic regions. Pushmeet Kohli, VP of research in AI for science at Google DeepMind, and colleagues used the AlphaMissense tool to predict likely pathogenic variants among the missense mutations that might arise in more than 19,200 canonical human protein-coding sequences.
First author Jun Cheng, a research scientist and team leader with Google DeepMind in London, explained during a press briefing about the paper that the tool is trained on known protein sequences. "By training, it sees millions of protein sequences and learns what a regular protein sequence looks like."
Cheng noted that the machine-learning model builds on Google DeepMind's previous protein structure prediction tool, AlphaFold. In contrast to AlphaFold, though, AlphaMissense is also trained with missense variant data from humans, nonhuman primates, and other organisms. "When it’s given a protein sequence with a mutation, it can tell us whether this looks bad or not," he said.
Work like this prompted Gonzalo Benegas, a computational biologist in the laboratory of Yun Song at UC Berkeley, to develop GPN, a DNA language model. "We're very inspired by protein language models" like AlphaMissense, said Benegas, lead author of the BioRxiv preprint.
"You could predict pathogenicity of a variant just using the language model probabilities fully unsupervised," Benegas said. "That was the spark. We wanted to try it on the whole genome, not just the protein-coding [part of it]."
He noted that his team unexpectedly found "great results" in terms of variant effect prediction.
While the manuscript specifically looks at AI to analyze the functional impact of genetic variants in plant genomes, the authors wrote that there is applicability to human and animal genomes, as well. Benegas said that a human model is under development, but it will be far more complicated than the one for the Arabidopsis thaliana plant the Berkeley team studied for their preprint because the human genome is far larger.
"If we apply the same model directly to humans, it doesn't work as well yet, so we're working on extensions," Benegas said.
"The truth is, we basically took off-the-shelf modeling and applied it, and we got results very quickly," he said. "The model learned a lot of plant genomics with a few days of training."
Generative or predictive?
Busby has been watching generative AI for several years. "People are already using this stuff in small, relatively research-oriented ways," he said, stressing that he was speaking on behalf of himself and not DNAnexus. He recalls seeing such models as long ago as 2018. "It's just that many more people are so super interested that there's a lot of money and initiative behind these things [today]."
There is a distinction between predictive and generative AI and machine learning, UC Berkeley's Benegas said, though that line is blurring.
Google DeepMind's AlphaMissense falls more into the predictive category. Benegas also said that he would consider his GPN technology more unsupervised machine learning than generative AI because the model is predictive rather trying to generate DNA sequences.
"Right now, we're focusing on prediction problems rather than generation," he said. "But for sure, I think they're related to each other, and I'm very hopeful for both in the near future."
A common characteristic in generative AI is "zero-shot learning," the ability of a computational model to recognize and classify concepts it has seen before, even if it has not been trained with specific, labeled data. "It means that you do not have to feed any examples to the algorithm," Benegas said.
"For standard machine learning, you always have to amass a large, specialized dataset," Mungall explained. "But these generative AI tools have already been pre-trained on a large corpus of literature … which means they essentially come out of the box with quite generalized abilities."
In genomics, for example, a researcher can enter a text query into a generative AI model and the computer ought to be able to find gene-disease relationships without any specialized training on gene-disease associations.
The GPN investigators are trying to improve variant effect prediction, which can help with fine-mapping for GWAS and developing polygenic risk scores.
"One of the grand visions we have would be to have a single model for all the species on earth," Benegas said.
There is currently a divide between algorithms based on language models and those that predict gene expression, he noted, adding that he sees "a future where these models will be multimodal," incorporating medical images as well as phenotypic data.
Hype vs. hope
ChatGPT and generative AI more broadly have been riding the peak of inflated expectations this year and are moving toward the trough of disillusionment on the Gartner Hype Cycle, a measure of technology evolution created by consulting firm Gartner.
"People's individual reactions follow the same kind of pattern," Mungall said. After the initial rush of excitement, users started to realize that they cannot always trust the results.
When people start to get disillusioned, they could start to ignore or miss opportunities to harness the technology in responsible ways, though.
"I think if we see generative AI as more of a partner in our information ecosystem," Mungall said. "It pairs very naturally with ontologies and knowledgebases. There's a variety of ways you can put these two things together and build some kind of hybrid agent-based system" for tasks such as applying ontologies or validating the output of generative AI.
A recent preprint study from researchers at Stanford University and UC Berkeley found that ChatGPT creator OpenAI's GPT-4 LLM became less accurate between March and June 2023, while its supposedly less advanced predecessor, GPT-3.5, improved.
This highlights the importance of validation and curation to create trust in large language models and generative AI, according to Busby, who noted that pharma companies necessarily err on the side of caution when working with new informatics technologies. "None of these companies are going to do stuff without a validatable outcome," he said.
Busby suggested that reinforcement learning from human feedback, or RLHF, will become central to many generative AI implementations. "RLHF is a big part of this fine model-tuning ecosystem," he said. "I think that's going to be a really huge deal for folks maintaining these models."
Also required, in his opinion, is a technology platform for integrating data and ontologies, perhaps in the form of what Busby called a "metathesaurus."
He does not believe there is a need for a formal standards-making process because ontologies already exist. However, he said there is more work to do to make the ontologies easy to use for AI purposes.
He said he suspects that drug companies and research institutions might develop their own in-house standards to validate AI through RLHF.
Importance of ontologies
Mungall, a principal investigator for many ontology-related projects, including the Gene Ontology Consortium, the Monarch Initiative, and Phenomics First, sees ontologies and knowledgebases as a "source of truth that can help guide generative AI." He led development of a new tool called OntoGPT, a GPT-based framework for matching LLMs with ontologies.
One component of OntoGPT is a knowledge extraction method called Structured Prompt Interrogation and Recursive Extraction of Semantics, or SPIRES. Mungall and colleagues described SPIRES in a preprint posted to Arxiv in April.
SPIRES is the flagship OntoGPT tool, designed for extracting knowledge from text. Mungall said he hopes to apply this kind of technology to the extraction of molecular and pathophysiological models of disease. "We're not quite there yet," he said, but a strength of generative AI is in its generalizability for many different tasks.
SPIRES was built in OpenAI, which is not as open as the name might suggest because it requires a subscription to access the API, and Mungall said he does not know what data OpenAI trains its model on.
If he had his way, Mungall would rather depend on truly open generative AI, and he has begun to take a look at other technology, including a model from Hugging Face, a for-profit company that has an open-source LLM. However, he said that OpenAI's GPT-4 is a more advanced generalizable LLM than anything he has seen elsewhere.
Shawn O'Neil, a data engineer at the Translational and Integrative Sciences Lab (TISLab) at the University of Colorado School of Medicine and training coordinator with the US National Institutes of Health's National COVID Cohort Collaborative, recently joined the Monarch Initiative, an open-source bioinformatics platform for matching phenotypes to genotypes.
His first project for Monarch was to lead development of Monarch Initiative Explorer, a plug-in to ChatGPT that adds a biomedical knowledge graph to the generative AI system. "I thought that seemed like an obvious thing to do, to hook it up to the Monarch knowledge graph and see what we can do," O'Neil said, reflecting the freewheeling nature of innovation in an emerging technology like generative AI.
Monarch Initiative Explorer is currently only available to subscribers of ChatGPT Plus, the paid enterprise version of OpenAI's popular chatbot, but O'Neil said that the integration was rather seamless, thanks to the OpenAI API.
With this connection, if a ChatGPT user asks, for example, which genes are associated with cystic fibrosis, the program can call the Monarch server to search for gene-disease and symptom-disease associations, he explained. The plug-in returns a list of search results, including Monarch's ontology identifiers that can aid in looking up additional information about cystic fibrosis from the medical literature by taking users directly to the Online Mendelian Inheritance in Man (OMIM) page for that disease.
O'Neil called the plug-in "pretty experimental" currently, in part because it needs improvements to the user interface but also because it can only handle so much information at once. "If a disease, for example, has 1,000 genes that are associated with it, that's too much data for the AI at this stage to summarize completely," he said.
"In this kind of interface, I think that we need to figure out how to identify which of those [genes] are most relevant for the query the user is asking," O'Neil added.
Safety first
Safety is also a concern when it comes to machine learning and generative AI. "A lot of these AI [systems] will dispense advice as if they are a practitioner of medicine or law or something like that," Busby said.
While generative AI algorithms can be unreliable, experienced developers should be able to spot and correct errors.
"If it's only 50 percent correct, that can be a huge timesaver," Mungall said. He explained that generative AI cuts out the "tedium" of the kind of pattern-based work typical of ontology development.
Mungall said that he and his colleagues are only using generative AI for research ontologies, not for the type of diagnostic work that might make such technology subject to regulatory review.
Safety is also a reason why Mungall is not a fan of OpenAI's closed model.
"There's a lot of misinformation in general, but in the area of genomics in particular, there is a lot of misinformation about genetics and whether intelligence is hereditary and segregated in different groups, and so on," he said. "It really doesn't sit well that we don't know what their curation process is for what they include and what they exclude."
Japan's Genesis Healthcare is being vigilant. While it has been forced to put AI guardrails in place by stringent Japanese privacy laws, Mommejat said that the company also has its own ethics committee to oversee privacy and security practices. The firm only uses patented or validated generative AI models, with data scientists on hand to check the accuracy of machine outputs, he added.
"I do think that there's need for more rigorous evaluation of how good the results are," TISLab's O'Neil said. That might require creation of benchmarking applications to, in the case of cystic fibrosis, make sure the AI comes up with results about CFTR mutations.
One such benchmarking tool O'Neil is evaluating is an experimental one called GeneTuring, which its creators described in a BioRxiv preprint.
O'Neil said he has not seen any fear of a "Skynet"-type scenario — from the "Terminator" movie franchise — where machines become sentient and threaten humanity.
"I think that there is a lot of justifiable concern about how we do this in the right way, for accuracy, to make sure that AI isn't either deliberately or inadvertently giving misinformation," he said. "But for the most part, there has been very little discussion in the circles that I run in about existential threats."
"I think generative AI can help a lot with interpretation of these big, high-quality, curated knowledgebases like Monarch to help make the connection between us regular humans and the really powerful information that's stored in the ontologies," O'Neil said.
Mungall said he sees another opportunity in genome informatics workflows. "Often, if you want to analyze a sequence or a set of sequences to achieve some result, you've got to have a lot of domain knowledge expertise in terms of which tools to use, how to take the output of one tool, and put it in as the input of another and essentially chain together a whole workflow," he said. "I think we're going to see more and more workflow assistant managers," similar to ChatGPT Assistant, a Google Chrome browser extension.
Mungall said he is also doing some preliminary work of GPT technology with the Exomiser and Genomiser tools for identifying pathogenic variants in noncoding regions, adding that there could be an initial release in the next few weeks.
Another project underway at Lawrence Berkeley National Laboratory is called SPINDOCTOR, for Structured Prompt Interpolation of Natural Language Descriptions of Controlled Terms for Ontology Reporting. That method, described in a preprint posted in May, applies GPT for summarizing gene-set functions to assist enrichment analysis.
"Not surprisingly, we find that using the ontology-based summarization technique gives the best results," Mungall explained. "But there's still potential there for using this as an aid to assist in interpretation of genomics experiments."