Researchers at the Polytechnic University of Madrid have developed a tool that can discover and automatically classify bioinformatics resources from the scientific literature and have used it to compile a searchable index of software and databases.
The index, called the Bioinformatics Resource Inventory, or BIRI, was compiled using a combination of natural language processing and other technologies to extract and classify key information from the literature. It is automatically updated as new resources are published, according to its developers — a potentially useful feature considering the rapid pace of publication of new bioinformatics tools.
Victor Maojo, director of the biomedical informatics group at UPM and leader of the team that developed BIRI, told BioInform that he and his colleagues began developing the automatic indexing tool in 2008, when most bioinformatics resource collections — such as the Molecular Biology Database Collection, the Online Bioinformatics Resources Collection, and the Bioinformatics Links Directory — were compiled via manual approaches.
“Someone needed to actually verify the different publications of the tools, check the webpages, check the websites, and also look for different updated publications,” he said. “We were trying to do that automatically by means of text-mining techniques.”
In a paper describing BIRI published in BMC Bioinformatics last October, the UPM team described the process they used to develop BIRI. First, the team analyzed a collection of papers to produce “structured surrogates,” including information such as title, authors, and abstracts that they used to "simplify the pattern-matching process."
The abstracts were then divided into sentences and pre-processed by a lexical analyzer to extract words, or "tokens," which then underwent a stemming activity to reduce each token to its root form, or lexeme.
Next, five experts with backgrounds in bioinformatics and text mining techniques used a training set of 100 abstracts to create a set of linguistic patterns to automatically extract information from the abstracts. The team identified three such patterns: resource-naming patterns, which would extract the name of the resource and a URL; functionality patterns, which would extract descriptions of the resource functionality; and classification patterns, which would extract either the category of the resource — such as whether it is a database, an alignment tool, a visualization tool, and so on — or the target domain of the resource — such as DNA, protein, expression, and the like.
Once the team had identified the patterns, they were translated into two "transition networks" — one that detects and extracts resource names and descriptions of their functionalities, and a second that classifies the resources into previously defined categories and domains.
A team of curators then examined the final results of the extraction process.
“[Experts] validate the final outcome,” Maojo said. “But that happens in all data and text mining processes. [It] cannot be completely automated; someone has to verify the information to avoid mistakes.”
In the paper, the team described several experiments it conducted to validate BIRI’s methodology. In a test of 400 abstracts from the ISI Web of Knowledge, BIRI retrieved 376 out of 392 possible resources, amounting to a success rate of about 94 percent. In addition, it successfully extracted 88 percent of the functionalities from the analyzed abstracts. Of the remainder, the functionality was incompletely extracted in 10 percent of the cases, and incorrect in the other 2 percent.
The authors also compared BIRI to other resource indexes — such as the Bioinformatics Links Directory, resources from the European Bioinformatics Institute, caBIG, and the National Institutes of Health’s iTools — using several criteria, including automatic index generation, whether the index indexes external resources, whether the index interface provides advanced search capabilities, whether the resources are annotated, and whether the index classifies resources.
Their results showed the BIRI was comparable to other indexes across these criteria, but was the only tool other than iTools to offer automatic index generation since the others are all manually curated.
The authors also compared the indexes according to the number of resources that they contain. While the prototype version of BIRI, which contained 316 different resource names after curation, included fewer resources than many of the manually curated indexes, it automatically discovered more than 230 resources that also appear in the Bioinformatics Links Directory, which indexes 1,350 resources, or the Online Bioinformatics Resource Collection, which includes 2,368 resources.
In addition, BIRI contains several resources not included in other indexes, which the authors said is because existing indexes are updated "only considering manuscripts published in a reduced set of journals," while BIRI relies on all of PubMed and the ISI Web of Knowledge.
Guillermo de la Calle, a PhD student at UPM and one of BIRI’s developers, told BioInform via e-mail said that BIRI can also incorporate information about tools published in other sources, with “minor changes” to the tool.
In addition to its use with bioinformatics resources, Maojo said that his team is currently using the tool to extract information from other types of scientific literature, such as research focused on nanoinformatics and nanomedicine.
The team is also planning to add information about open source bioinformatics resources to BIRI’s knowledge base and to improve the tool’s advanced search functionalities.
Finally, de la Calle said that the developers are developing a plug-in that will integrate BIRI with resources such as BioMoby and BioPortal. He said that developing a useful plug-in would require extending some features of BIRI such as information about input and output of resources.
Maojo noted that the team would also need to come to some sort of agreement with the owners of these resources since these groups will have “their own strategies and policies of collaboration.”
The research was funded in part by the European Commission.