Incyte’s Proteome division has spent nearly a decade manually curating more than 80,000 proteins to create its BioKnowledge Library. But this era of human-only annotation has come to an end: The Beverly, Mass.-based group recently licensed text-mining software from Reel Two to help streamline the painstaking task.
Laura Selfors, principal bioinformatics scientist at Incyte’s Proteome division, said that Reel Two’s Classification System “has not eliminated any step in our process.” Instead it will be used to streamline the company’s current curation pipeline, which relies on a distributed team of PhD-level scientists who read through reams of scientific literature to update the database, she said.
Reel Two’s software is being implemented in the so-called “triage” stage of the curation process, in which Proteome’s core staff sifts through thousands of journal articles to find the most relevant papers to pass along to the curators. The company already uses an in-house filtering system that relies primarily on “sophisticated keywording” during this stage, Selfors said, “but we find that Reel Two is kind of a last step to take what we find as good and prioritize it.”
The company licensed the software in April, and Selfors said that it’s still too early in the implementation process to quantify any time or cost savings. She added, however, that the group carried out a detailed technical and financial analysis before purchasing the software, “and we felt that it would pay for itself relatively quickly.”
Neither Incyte nor Reel Two provided financial details for the agreement.
Text-Mining Test Drive
Matt Crawford, director of curation technologies at Incyte, said that text-mining software wasn’t the company’s first choice when it decided to add a new level of filtering capability to its in-house system. “We thought, to be perfectly honest, that this would be easy, and all we’d have to do is get a spam filter from the public domain,” he said. The filter — which enables users to tag incoming messages as “spam” or “not spam” based on keywords or phrases — “worked pretty well,” Crawford said, but it had its limitations. “It’s sort of a binary thing — this is spam or it isn’t spam — and in some ways it’s more important to know what kind of spam it is, and to have several different categories that you can sort the information into, so you get a much more accurate picture of it,” he said. Selfors added another major drawback: “When the spam filter comes across something that it doesn’t know, it calls it junk, which in our world is really bad.”
After an evaluation process that compared the company’s in-house system with the spam filter and Reel Two’s system, Incyte found that its internal system actually worked better than the spam filter, “but what Reel Two was really good at doing was giving you back a confidence score for each reference,” Selfors said.
Reel Two’s Classification System is based on a supervised machine-learning approach that builds predictive models to classify documents in real time. Users first train a model using an initial set of documents that fall into several known categories, quickly assigning each reference into the proper grouping so that the program “learns” to recognize unique text-based patterns for each category. The software then classifies new data as it is run through the model, assigning a confidence score for each document that corresponds to how well the document fits into its assigned category. This score can be used to determine the cut-off threshold for a set of ranked documents, or can serve as a quick gauge of the software’s performance. Users can easily reassign those documents that the software miscategorizes to continually improve the performance of the system.
Crawford said the software’s ability to rate its own performance was an important factor in Incyte’s decision to go with Reel Two. “There might be some cases where the training set is a little disparate or mixed up, or there’s really no common thread, and in that case [the software] is forthright and says, ‘Hey, I can’t really make any sense of this stuff and everything I have is sort of a low confidence score.’ That’s incredibly valuable to know when something isn’t working, because then we know we can’t use the technique,” he said. “Even more important is when it says, ‘Hey I can confidently say that I’m catching all the good stuff.’”
Another factor that sweetened the deal was the size and quality of the training data that Incyte had on hand to build its models. So far, the company’s curators have examined nearly 367,000 literature references. Of these, around 160,000 have information that was placed into the database, and around 90,000 were marked as “inappropriate,” Crawford said, because they contained data on the wrong species or didn’t contain specific enough information. He added that this negative training data is just as important as the positive training data in building accurate models. “Often what’s more important is what you didn’t read and why you didn’t read it,” he said.
Crawford acknowledged that the quality of the company’s manually curated training set was an important factor in the level of performance it has seen with Reel Two’s software. “I think if you’re coming at it from just one side of the equation — just as a process without any sort of data to back it up — it’s probably not as effective,” he said. The combination of Incyte’s data and Reel Two’s software “was a match that was bound to happen,” he said.
Nicko Goncharoff, Reel Two’s senior vice president, agreed that the companies’ products are “complementary.” Goncharoff noted that outside parties, including CTC Life Sciences, had previously suggested that the firms work together. Reel Two recently selected CTC as its exclusive reseller in Japan. “Before they even knew we had signed a deal with Incyte, they had told us there would be strong interest in a product [that] mixed the quality of Incyte’s data and analysis with the functionality of our text mining tools,” he said. “We believe there is potential for us to partner with them to develop products that put even more customized information and data analysis into the hands of researchers.”
Interactions on the Horizon
For now, Incyte is focused on improving the quality of its existing databases with the new software, “but there are some new projects we’re working on where [Reel Two’s software] is going to be really critical,” Crawford said. One of these projects is to add protein-protein interaction data to its offering — a feature the company plans to make available before the end of the year.
Culling protein-protein interaction data from the literature is one of the more common challenge tasks for the biomedical text-mining competitions that have arisen in the past few years — such as the KDD (Knowledge, Discovery, and Data-Mining) Cup, TREC (Text Retrieval Conference), and BioCreative — and the problem is viewed as the “holy grail” of biomedical text mining, Crawford said, because manual curation of interaction records presents an “exponential problem.” Considering that there are 80,000 proteins in the company’s database, and some of them might interact with up to 50 other proteins, “you suddenly end up with an exponential gain in the number of things that you’re looking at … and going after every single one of those is a huge task,” he said.
Text mining is an area of great interest in the bioinformatics sector, but a breakthrough solution in the field has been slow in coming. Crawford suggested that the issue is not so much the technical limitations of natural language processing and other approaches, but the proper application of the tools — balanced with a realistic level of expectations. “People are expecting [text mining] to be this magic bullet … but they don’t realize that there’s a lot of work that goes into training these things and getting them to work,” he said. Text mining tools such as Reel Two’s “do a very good job of what they’re supposed to do,” he said, “but they’re not magic.”