Biological Text Pushes the Boundaries of Data Mining Technology at Annual KDD Cup


Last week’s release of training data sets marked the kick-off of the KDD Cup 2002, the annual data mining competition held in conjunction with the Association for Computing Machinery''''s Conference on Knowledge Discovery and Data Mining.

For the second year in a row, the competition set tasks in the area of biology. “Biology is especially compelling because it has become a very data-rich field in the last few years,” involving different sources and types of data, said Mark Craven, assistant professor of biostatistics, medical informatics, and computer sciences at the University of Wisconsin in Madison. He and Alexander Yeh from Mitre Corporation co-chair the 2002 cup and set the two tasks.

Unlike last year''''s challenge, this one includes — among other data types — text. “There is now a lot of interest in the biological community in trying to do text mining,” Craven commented. Biological texts, he said, provide different challenges than the news articles that are commonly used as fodder for text mining technology, because they contain so much technical jargon. Moreover, the same molecules can have different names or spellings.

The first task involves extracting information from scientific articles in order to automate the curation of a database. Participants will be given a set of articles and gene names and will have to develop a system to determine whether each paper contains relevant information about the genes’ expression and then list the gene products. Curators for the FlyBase Drosophila genomic database contributed the training and test data for the task.

The second challenge relates to the effect of gene knockouts on yeast cells. Contestants will receive about 15,000 abstracts from Medline, as well as protein-protein interaction data, protein subcellular localization data, and gene functional annotation data from the Munich Information Center for Protein Sequences database and the Saccharomyces Genome Database. Based on this information, they will be required to predict whether knocking out a specific gene in yeast affects an undisclosed subcellular system. This task draws from unpublished data from an unnamed research group that has studied about 4,500 strains from a yeast deletion library. Two-thirds of their results were made available for training purposes.

Craven said he expects participants to take many different novel approaches because “it''''s not a simple matter of taking off-the-shelf data mining tools and applying them.” Although he hopes for a good response — both from the data mining community and from the text mining camp, which has not been much involved in prior KDD cups — “it might be a lot less than last year, just because the task is more at the frontiers of what the state of the art in data mining really is.” In the past, participants have been fairly evenly split between academic groups and companies, which — if successful — see the competition as good advertisement.

The test data will become available on June 10, the deadline for cup entries is June 26, and winners will present their approaches at the KDD conference in Edmonton, Canada, July 23-26. For more information, visit

