CHICAGO (GenomeWeb) – French pharmaceutical firm Sanofi has turned to natural language processing to sort through massive amounts of genomic data and published medical literature in its quest to find new biomarkers for multiple sclerosis.
Specifically, the drug maker enlisted Linguamatics, a Cambridge, UK-based vendor of a natural language processing platform called I2E that extracts information from unstructured text, to find associations between human leukocyte antigen (HLA) alleles, diseases, and hypersensitivity to potential treatments. A proof-of-concept study, which Sanofi discussed at last month's Bio-IT World Conference in Boston, found more than double the number of previously known disease associations.
Since 2016, Linguamatics has offered I2E with a query language called the Extraction and Search Language, EASL, that allows text mining queries to be described and written in a human-readable text format. EASL queries can be generated outside the I2E platform, and can support custom interfaces and enhanced workflow automation.
The text might come from literature searches, from trial data, posts from patient forums on social media, public databases, or regulatory documents. "In order to make decisions, you need to pull information out from unstructured text," explained Jane Reed, head of life science strategy for Linguamatics.
Some Linguamatics customers use I2E to comb Twitter to look for clues about side effects and other patterns, such as during an influenza outbreak in the UK several years ago. "Some of the pharma [companies] wanted to monitor the reaction of the general population to that," Reed explained.
Drug companies also can glean information from calls to toll-free patient hotlines. "All those things need to be captured to give the right response back, but also to make sure there is something they can [feed] back into the product development," Reed said.
Reed noted that next-generation sequencing technologies produce long lists of variants that researchers must wade through. "There's a lot of structured information that you can use to try and get annotations, but if you want to get the most up to date [information] from the literature," there needs to be order among all the unstructured data, she said.
For its MS biomarker work, an internal project, Sanofi turned to I2E to build a catalog of HLA allele annotations by mining millions of pieces of scientific literature. The pharma company worked with Linguamatics to establish a workflow for HLA typing and analysis, based on whole-exome sequencing. This, the partners said, has resulted in the identification of more than 400 HLA-related alleles.
"We are using I2E to extract information from the literature," including 22 million abstracts indexed on PubMed, and other texts from ClinicalTrials.gov, according to Dongyu Liu, associate director of translational sciences at Sanofi.
"It's easy to plug in any ontology or dictionary," Liu said. "You can extract any domain knowledge."
Liu said that Sanofi is using I2E in the early stages of research, including target discovery, mapping of gene mutations to diseases, and drug repositioning. The company is looking for gene-disease, gene-mutation, and gene-drug associations, among other correlations.
"We have been working with Linguamatics to develop different, specific queries to pipelines to extract the information and then feed it into our database" or others, Liu reported.
"There are a lot of databases out there, but not all of them are available for the use case" of HLA for multiple sclerosis, Liu said. In fact, Sanofi had none. "[Natural language processing] is the perfect device to use," he said.
Sanofi has applied I2E to different areas before the current MS study, Liu said, though this is the first allele use case. In this instance, HLA been able to to predict side effects of Sanofi Genzyme drug Lemtrada (alemtuzumab) on MS patients, Liu said.
The study has not moved into clinical trials yet, nor is there a definitive start date, according to Liu, but Sanofi is using the NLP technology across multiple research groups and projects at various stages.
"It's not a matter of only using I2E. There are a lot of these molecular profiling data, from all different data we are trying to find biomarkers," Liu said. "For any of these kind of big projects, we can use I2E from Linguamatics to help to extract a lot of very useful information."
Sanofi eventually expects to be able to feed the results of data extraction into its machine-learning algorithms.
"There is a lot of data there," Liu said. "We use I2E to help to find a piece of the puzzle, like what's the relationship between the allele and the disease."
For this project, Sanofi included several hundred alleles, according to Liu. "Theoretically, you can do all the genes," he said.
Liu said that he is also working with colleagues at Sanofi to incorporate NLP into studies looking for associations between metabolites and certain diseases.
For the European Union's new General Data Protection Regulation, Sanofi is piloting I2E as a way of identifying what patient-specific information is private, and thus in need of being protected.
Additionally, Sanofi is looking to use it to extract information from electronic health records for phenome-wide association studies, Liu said.