Text Mining at Sanofi for Genotype-Phenotype Associations in Multiple Sclerosis
This webinar discusses how Sanofi used literature mining to annotate the association of human leukocyte antigen (HLA) alleles with diseases and drug hypersensitivity as part of a multiple sclerosis (MS) biomarker discovery project.
For any drug development project, it is important to have a comprehensive understanding of the genetic associations for the disease of interest. While public databases of genomic variants provide valuable information, there can be many gaps in the biological knowledge. For Sanofi’s internal MS biomarker project, they needed a comprehensive catalogue of annotations to HLA alleles and turned to Linguamatics I2E to text-mine the scientific literature.
The HLA region is the most polymorphic region of the human genome. HLA alleles have been associated with more than 40 different autoimmune diseases, various types of cancer, infectious disease, and drug adverse events. However, there are no known resources that systematically annotate the association of HLA alleles and diseases.
For the Sanofi MS project, a workflow was established for whole-exome sequencing-based HLA typing and analysis. This identified more than 400 HLA alleles. The Linguamatics I2E platform was used to search the literature to annotate the association of the HLA alleles with diseases and drug hypersensitivity. This project resulted in more than double the previous disease associations and the curated annotations were fed into a knowledge base for broad use within the Sanofi team.
What will you learn?
- How natural language processing (NLP) text mining can extract structured data from unstructured text in scientific papers
- How text mining is used at Sanofi to extract the most up-to-date published knowledge for a gene or group of genes, including information on diseases and specific allele variations