Skip to main content
Premium Trial:

Request an Annual Quote

Bio-NLP Community Says It s Ready for a Challenge; CASP-Like Evaluation on the Way

Premium

Now that the natural language processing community has begun to coalesce as a subset of bioinformatics, it’s ready to beginning testing its capabilities, according to Lynette Hirschman.

Hirschman, head of the intelligent information access section at Mitre, is on an organizing committee to prepare a formal challenge problem that will assess the state of the art of the field on an ongoing basis. Inspired by similar evaluations already in place within the text-mining community, such as TREC (Text Retrieval Conference), and structural biology’s CASP (Critical Assessment of Techniques for Protein Structure Prediction), Hirschman said a similar assessment would serve to both stimulate and strengthen the nascent biological text-mining community.

Hirschman, along with Jong Park, Junichi Tsujii, Limsoon Wong, and Cathy Wu, organized the text-mining track at last month’s Pacific Symposium on Biocomputing, where Hirschman said she saw “a lot of enthusiasm and interest in a challenge evaluation.” However, she noted, there was some concern that any problems chosen for such an evaluation be significant biological problems. Sample problems in the natural language community have traditionally been based on data sets drawn from broadcast news reports, not the biomedical literature.

“They’ve chosen sample problems that don’t have a user community directly associated with them,” explained Hirschman. “It’s sort of like working in a test tube as opposed to working in vivo.”

But while biologists may be wary of the technology’s capabilities, the data- and text-mining community has already set its sights on biological text as the next frontier. This year’s Human Language Technology Conference, March 24-27, will feature a special track on bioinformatics and natural language processing; the Knowledge Discovery and Data Mining (KDD-2002) Challenge Cup, July 23-26 in Edmonton, Alberta, Canada, will use biology problems for the first time this year; and there are discussions to include a track on biological information retrieval in November’s TREC 11, Hirschman said.

As with most other areas of bioinformatics, it seems the real trick will be bringing the biological and NLP communities together. “Because these are different subdisciplines, people don’t normally talk to each other. It’s a networking issue,” said Hirschman, who sees the emerging ontology effort within bioinformatics as the primary interface between the two disciplines.

Biological ontologies and nomenclatures provide a structured lexicon and knowledge base that can then be combined with biological databases, which serve as “schemas” to indicate what relationships are of interest. The result, according to Hirschman, is “an incredibly valuable resource for the natural language processing community.”

Hirschman said that she and the other challenge problem organizers have “seen significant progress” in bringing the NLP, ontology, and biological database communities together. In particular, she said, there has been a great deal of interest from database developers in using text-mining technology as a database curation aid. “If you could make sense of free text in databases, then there are various knowledge representation techniques that can be used to check for consistency and completeness and so on,” she said.

With this application in mind, one possible challenge problem under consideration is an automated curation system to extract biologically relevant text from the literature and place it in the appropriate database field. Such a test would require the cooperation of an existing database research group that would keep its human-curated data “blind” for several months. Hirschman said that one such group has expressed interest in cooperating already.

Other possible challenge problems include the discovery of protein-protein interactions within the literature or the use of natural language technology in conjunction with microarray data to help organize and classify genes. “Whoever puts out a corpus and a challenge problem first will get a fair amount of attention and, in fact, takers,” said Hirschman. But while there’s no shortage of possible test scenarios, a formal evaluation effort will not become a reality without funding.

“A critical part of all this is to get a funding agency to step up and fund some kind of evaluation scheme because people can’t do this for free,” said Hirschman. “Hopefully if there’s enough evidence of interest and need then that will be forthcoming.”

— BT

Filed under

The Scan

Panel Recommends Pfizer-BioNTech Vaccine for Kids

CNN reports that the US Food and Drug Administration advisory panel has voted in favor of authorizing the Pfizer-BioNTech SARS-CoV-2 vaccine for children between 5 and 11 years old.

Sharing How to Make It

Merck had granted a royalty-free license for its COVID-19 treatment to the Medicines Patent Pool, according to the New York Times.

Bring it Back In

Bloomberg reports that a genetic analysis has tied a cluster of melioidosis cases in the US to a now-recalled aromatherapy spray.

Nucleic Acids Research Papers on SomaMutDB, VThunter, SCovid Databases

In Nucleic Acids Research this week: database of somatic mutations in normal tissue, viral receptor-related expression signatures, and more.