Skip to main content
Premium Trial:

Request an Annual Quote

Bio-NLP Community Says It s Ready for a Challenge; CASP-Like Evaluation on the Way


Now that the natural language processing community has begun to coalesce as a subset of bioinformatics, it’s ready to beginning testing its capabilities, according to Lynette Hirschman.

Hirschman, head of the intelligent information access section at Mitre, is on an organizing committee to prepare a formal challenge problem that will assess the state of the art of the field on an ongoing basis. Inspired by similar evaluations already in place within the text-mining community, such as TREC (Text Retrieval Conference), and structural biology’s CASP (Critical Assessment of Techniques for Protein Structure Prediction), Hirschman said a similar assessment would serve to both stimulate and strengthen the nascent biological text-mining community.

Hirschman, along with Jong Park, Junichi Tsujii, Limsoon Wong, and Cathy Wu, organized the text-mining track at last month’s Pacific Symposium on Biocomputing, where Hirschman said she saw “a lot of enthusiasm and interest in a challenge evaluation.” However, she noted, there was some concern that any problems chosen for such an evaluation be significant biological problems. Sample problems in the natural language community have traditionally been based on data sets drawn from broadcast news reports, not the biomedical literature.

“They’ve chosen sample problems that don’t have a user community directly associated with them,” explained Hirschman. “It’s sort of like working in a test tube as opposed to working in vivo.”

But while biologists may be wary of the technology’s capabilities, the data- and text-mining community has already set its sights on biological text as the next frontier. This year’s Human Language Technology Conference, March 24-27, will feature a special track on bioinformatics and natural language processing; the Knowledge Discovery and Data Mining (KDD-2002) Challenge Cup, July 23-26 in Edmonton, Alberta, Canada, will use biology problems for the first time this year; and there are discussions to include a track on biological information retrieval in November’s TREC 11, Hirschman said.

As with most other areas of bioinformatics, it seems the real trick will be bringing the biological and NLP communities together. “Because these are different subdisciplines, people don’t normally talk to each other. It’s a networking issue,” said Hirschman, who sees the emerging ontology effort within bioinformatics as the primary interface between the two disciplines.

Biological ontologies and nomenclatures provide a structured lexicon and knowledge base that can then be combined with biological databases, which serve as “schemas” to indicate what relationships are of interest. The result, according to Hirschman, is “an incredibly valuable resource for the natural language processing community.”

Hirschman said that she and the other challenge problem organizers have “seen significant progress” in bringing the NLP, ontology, and biological database communities together. In particular, she said, there has been a great deal of interest from database developers in using text-mining technology as a database curation aid. “If you could make sense of free text in databases, then there are various knowledge representation techniques that can be used to check for consistency and completeness and so on,” she said.

With this application in mind, one possible challenge problem under consideration is an automated curation system to extract biologically relevant text from the literature and place it in the appropriate database field. Such a test would require the cooperation of an existing database research group that would keep its human-curated data “blind” for several months. Hirschman said that one such group has expressed interest in cooperating already.

Other possible challenge problems include the discovery of protein-protein interactions within the literature or the use of natural language technology in conjunction with microarray data to help organize and classify genes. “Whoever puts out a corpus and a challenge problem first will get a fair amount of attention and, in fact, takers,” said Hirschman. But while there’s no shortage of possible test scenarios, a formal evaluation effort will not become a reality without funding.

“A critical part of all this is to get a funding agency to step up and fund some kind of evaluation scheme because people can’t do this for free,” said Hirschman. “Hopefully if there’s enough evidence of interest and need then that will be forthcoming.”

— BT

Filed under

The Scan

Wolf Howl Responses Offer Look at Vocal Behavior-Related Selection in Dogs

In dozens of domestic dogs listening to wolf vocalizations, researchers in Communication Biology see responses varying with age, sex, reproductive status, and a breed's evolutionary distance from wolves.

Facial Imaging-Based Genetic Diagnoses Appears to Get Boost With Three-Dimensional Approach

With data for more than 1,900 individuals affected by a range of genetic conditions, researchers compared facial phenotype-based diagnoses informed by 2D or 3D images in the European Journal of Human Genetics.

Survey Suggests Multigene Cancer Panel VUS Reporting May Vary Across Genetic Counselors

Investigators surveyed dozens of genetic counselors working in clinical or laboratory settings, uncovering attitudes around VUS reporting after multigene cancer panel testing in the Journal of Genetic Counseling.

Study Points to Tuberculosis Protection by Gaucher Disease Mutation

A mutation linked to Gaucher disease in the Ashkenazi Jewish population appears to boost Mycobacterium tuberculosis resistance in a zebrafish model of the lysosomal storage condition, a new PNAS study finds.