CHICAGO (GenomeWeb) – When health analytics and contract research giant IQvia acquired Linguamatics in January, it purchased a company with roots in the early days of what is now known as natural language processing, but also one in transition.
Last year, Cambridge UK-based Linguamatics introduced a scientific search engine called iScite to augment an NLP platform called I2E that extracts information from unstructured text.
The iScite search interface is kind of an "entry-level" tool, according to John Brimacombe, who leads the company's NLP practice. "The business has been very heavily focused on large pharma since inception, and that now gives us an entry for small pharma," he said.
"We've got enormous search power. The question is how do we reveal that to the audience? iScite is designed to sort of lead people on [to the NLP platform]," Brimacombe said.
"People who aren't informaticians largely search using 'keywordese.' They will try a few terms and see what they get and iterate. That's OK as a start, but in a very large knowledge space, you are just going to get hundreds of thousands of hits back," he noted.
That many results can be overwhelming, and push people back to manual extraction.
"What we try to do with iScite is walk them from starting with keyword terms which they are already familiar with, but revealing natural language queries and matching them to queries which are going to use the power of the linguistic engine in terms of understanding, concepts understanding ontologies, and understanding relationships between terms," Brimacombe said.
For example, a user might be looking for information on a disease-associated gene mutation, but queries tend not to be as precise as search engines are built to handle.
"People aren't used to precisely saying what they mean," Brimacombe explained. Does a search for "acetaminophen" mean exactly "acetaminophen," or should it look for misspellings, related trade names, and various formulations?
"In most keyword search engines, you will get just the direct hits for 'acetaminophen,' which will be a tiny portion of the knowledge base, whereas if we can walk you up to an ontological search, then you're going to get massively improved results," Brimacombe said.
The Linguamatics customer base includes pharmaceutical companies, the US Food and Drug Administration, hospitals, cancer institutes, and academic research centers.
"We work with payors and providers, usually looking at predictive risk models, risk stratification, and looking at social determinants of health," said Simon Beaulah, the company's senior director of healthcare. "A lot of the information is trapped in clinical notes, and [our technology] is helping free the data up."
Beaulah said that Linguamatics helps improve the efficiency of its clients' work processes by extracting data to feed predictive modeling.
Lately, Linguamatics has been supporting phenotype-genotype matching for populations. If we can take a large amount of deidentified population health data and start to associate that with a genetic record, then we can start to model outcomes, model cause and effect," Brimacombe said. "That's a significant one."
The Linguamatics technology goes beyond sources such as ClinVar and the Human Gene Mutation Database, which tend to contain data from large-scale studies, usually involving common ailments. Those are not particularly useful for rare diseases and variants of unknown significance, according to Beaulah.
"You do a search and you get thousands of hits back again, and you've got to read the papers. That's the only way to get into this," Beaulah said.
The I2E platform scans literature for lists of phenotypes and associated variants, then normalizes the findings to reduce the amount of "noise" returned in searches, according to Beaulah.
He discussed one yet-unpublished study the company participated in, noting that there were 77 different ways to express one variant for mucopolysaccharidosis type II, commonly known as Hunter Syndrome. Linguamatics was able to normalize those findings. "This is a really important way that we help in a rare disease to correlate that," Beaulah said.
"This is great for relatively early-stage work in life sciences, but then we're doing the same thing in the precision medicine side in healthcare," he added. To date, Linguamatics has signed up one academic medical center in precision medicine, but he declined to name it.
On the clinical side, the company has noticed that teaching hospitals have been exploring the Human Phenotype Ontology as a means of identifying phenotypes associated with variants. But many are manually matching phenotypes to genotypes.
"It's manual extraction of those records, teams of people reading clinical notes, pulling the phenotypes out, and then using that in literature," Beaulah said. "So, it's really cool seeing how both sides — life science doing the drug discovery and clinical care — can use the same technology."
Linguamatics dates to the early 2000s, according to Brimacombe, who was employee number seven. The NLP business unit now has a staff of about 120.
Brimacombe and other early Linguamatics employees, including some of the founders, studied computational linguistics at Cambridge University in the 1990s under Karen Sparck Jones, whose work established the basis for modern search engines. The late Sparck Jones recently was featured in the New York Times' "Overlooked" series of "remarkable people" for whom the newspaper didn't write a proper obituary at their time of death.
"She did some of the founding theory of how you ingest language," Brimacombe noted. Some of her students and protégés went on to create Linguamatics, and Brimacombe joined them a couple of years later.
Since then, algorithms have improved, but so has computing power. "It is a journey of continued research and algorithmic improvement combined with the underlying processor availability," Brimacombe said. "The algorithms were great, but many of them didn't seem tractable when we began."
Now, like so many other tech companies, Linguamatics turns to Amazon Web Services for cloud computing power as necessary. "You're really going to get enormous amounts of compute at a sensible price on demand. That's also been very transformational," he said.
Notably, Linguamatics is helping Sanofi sort through massive amounts of genomic data and published medical literature in a quest to find new biomarkers for multiple sclerosis.
Although new parent company IQvia entered the genomic informatics market a month ago with the introduction of E360 Genomics, an integrated genomic-clinical build of its established E360 data-mining platform, IQvia is for now allowing Linguamatics to operate fairly independently.
Still, there is plenty of room for Linguamatics to evolve. It already had, but there is a long way to go. "We're still on a long journey," Brimacombe said.