CHICAGO – Bioinformaticians at the University of Texas MD Anderson Cancer Center have developed an automated platform that uses natural-language processing augmented by artificial intelligence to simplify many standard omics analysis processes and to make analytics more efficient, intuitive, and collaborative.
The system, called DrBioRight, features a chat-style web interface touted as "user-friendly" with a single field for data input and output, backed by AI technology. "All the interactions with users are based on human languages," according to a recently published paper in Cancer Cell.
Corresponding author Han Liang, deputy chair of MD Anderson Department of Bioinformatics and Computational Biology, said DrBioRight goes beyond typical NLP to what he called natural-language analytics and understanding.
"Natural-language processing usually just helps identify certain key words," he explained. "With us, we understand your sentence, and then we [perform] some action, and then we give you some feedback. Then people essentially begin to talk to DrBioRight."
Liang and his team historically have been there to provide bioinformatics support for their colleagues at the Houston cancer center. Liang said. The usual process is for researchers to set up a meeting with him, ask questions, then for the bioinformatics team to identify the relevant datasets, write code, and generate and return results.
"Usually, turnaround time is very long," often weeks or months, Liang said. "I recognized that may not be efficient."
He also realized that although each research project is unique, there are some common elements to the analysis, including data science and pathways.
"I was thinking, you can start out doing this person-to-person interaction by email or phone, but don't we just set up this natural language process [to help us] understand the question?" Liang explained. "After that, we can generally identify the datasets, call the scripts, and return the results using this dialog format."
DrBioRight has what its creators called a "flexible modularized framework, based on which a new computational analysis can be added with just two simple steps," according to the Cancer Cell article. Those steps include choosing and curating relevant modules from datasets including the Encyclopedia of DNA Elements (ENCODE) project, the now-concluded Genotype-Tissue Expression (GTEx) effort, the Cancer Genome Atlas (TCGA), the Cancer Cell Line Encyclopedia, and the International Cancer Genome Consortium.
They singled out ENCODE and TCGA in particular as ever-growing sets of "rich" omics data that have challenged data analysts to derive usable insights.
With DrBioRight, MD Anderson has built 10 analytics modules to process and visualize various datasets, then trained those modules with natural language. DrBioRight also supports analysis of raw next-generation sequencing data.
The web interface provides dialog-type interactions, so users can check on the analysis at each step, from quality control to read mapping to gene enrichment analysis. Researchers also can analyze the reproducibility of previously published studies, a feature that the MD Anderson team demonstrated in the paper.
Users type in an analysis query, such as, "perform survival analysis in breast cancer on TP53 gene expression." The DrBioRight system then identifies a specific analysis to perform, asking the user to confirm if this matches the intended query before the platform schedules the task.
"Whenever you input a sentence, we can identify whether this keyword represents a cancer type or gene name or specific analysis," Liang explained.
Once the job is confirmed and scheduled, the cloud-based computing nodes call the datasets and perform the analysis to, for example, predict the degree of correlation between TP53 gene expression level and the survivability of breast cancer.
After the analysis is complete, DrBioRight pulls up a visualization module from the cloud to deliver the results to the output area, usually in the form of an interactive plot or chart. Users can then rate the quality of the work so DrBioRight's creators can refine the NLP and AI in the system.
"This feedback helps us continuously improve the accuracy our model," Liang said.
Liang likened the dialog to an online chat on a platform such as WhatsApp or Facebook Messenger. "You can just open the software and type your question just like a question to a friend, and DrBioRight will understand the question, do the analysis, and return the result in the dialog," he said.
If something is missing or DrBioRight does not fully understand a gene name, the system can ask the user for clarification. "It's just like you are talking to a bioinformatician collaborator," he explained. "You just Q&A. You have to confirm that everything looks fine, then DrBioRight does the analysis and returns the results."
According to the MD Anderson bioinformaticians, early omics analytics software was written in general-purpose programming languages, including Python, R, and Perl, requiring users to have at least a minimal set of programming skills, a limiting factor in genomics research laboratories.
Later, some web-based and bioinformatics-specific platforms arose. "These tools, however, are of limited use, as they only support a predefined set of analyses," the authors wrote in Cancer Cell.
Graphics-based "module hubs" such as Galaxy GenePattern and "interactive data portals" including cBioPortal and GTEx simplified the work for end users but still left some holes.
"Despite these impressive efforts, users still have to spend considerable time identifying appropriate tools and learning distinct user interfaces and procedures, in addition to keeping track of the status and updates for the quickly evolving tools and datasets," they said in the paper.
The authors said that data generated from high-throughput omics technologies has "ushered in a golden era for biomedical research while at the same time presenting us with unprecedented challenges in digesting these data and formulating new hypotheses."
They called DrBioRight an early attempt at applying NLP and natural-language understanding to managing large bioinformatics pipelines. "Such an analytics platform with the aforementioned features will generate a new research paradigm that maximizes the utility of omics data, accelerates biomedical research, and ultimately leads to better health for everyone," the authors wrote.
They said that all "next-generation" data analytics should have five features: natural-language understanding, a form of NLP that addresses machine comprehension of text or spoken words; transparency of datasets, methodologies, and algorithms; compatibility with mobile devices and social media such as chat interfaces; crowdsourcing of data and algorithm development; and artificial intelligence.
Eventually, according to the paper, the technology could even be integrated with lab technology to create a "self-governing system, where the analytics even guides robots to generate new research data to test specific hypotheses.
Since the MD Anderson researchers submitted the paper for publication, they have continued to enhance DrBioRight with an analytical report showing where the system gathered its data from and how it processed the data. This, according to Liang, simplifies reproducibility processes.
In the future, Liang wants to open DrBioRight to the general informatics community so bioinformaticians can contribute their own modules to the platform.
He also would like to make the system more smartphone-friendly by building a mobile app that can mine interactions on social media to gauge how popular newly published medical knowledge is.
"Now, DrBioRight just does the analysis. In the future, we want DrBioRight to say this result is consistent, that this kind of pattern has already been reported in the three papers," Liang said.
This helps cut down on the needs for time-intensive manual literature review.
"Our ultimate goal is to try to make DrBioRight become an interactive partner ... so that researchers can talk to DrBioRight just like they talk to a very capable bioinformatics collaborator," Liang said.