Skip to main content
Premium Trial:

Request an Annual Quote

Biological Text Pushes the Boundaries of Data Mining Technology at Annual KDD Cup


Last week’s release of training data sets marked the kick-off of the KDD Cup 2002, the annual data mining competition held in conjunction with the Association for Computing Machinery''''s Conference on Knowledge Discovery and Data Mining.

For the second year in a row, the competition set tasks in the area of biology. “Biology is especially compelling because it has become a very data-rich field in the last few years,” involving different sources and types of data, said Mark Craven, assistant professor of biostatistics, medical informatics, and computer sciences at the University of Wisconsin in Madison. He and Alexander Yeh from Mitre Corporation co-chair the 2002 cup and set the two tasks.

Unlike last year''''s challenge, this one includes — among other data types — text. “There is now a lot of interest in the biological community in trying to do text mining,” Craven commented. Biological texts, he said, provide different challenges than the news articles that are commonly used as fodder for text mining technology, because they contain so much technical jargon. Moreover, the same molecules can have different names or spellings.

The first task involves extracting information from scientific articles in order to automate the curation of a database. Participants will be given a set of articles and gene names and will have to develop a system to determine whether each paper contains relevant information about the genes’ expression and then list the gene products. Curators for the FlyBase Drosophila genomic database contributed the training and test data for the task.

The second challenge relates to the effect of gene knockouts on yeast cells. Contestants will receive about 15,000 abstracts from Medline, as well as protein-protein interaction data, protein subcellular localization data, and gene functional annotation data from the Munich Information Center for Protein Sequences database and the Saccharomyces Genome Database. Based on this information, they will be required to predict whether knocking out a specific gene in yeast affects an undisclosed subcellular system. This task draws from unpublished data from an unnamed research group that has studied about 4,500 strains from a yeast deletion library. Two-thirds of their results were made available for training purposes.

Craven said he expects participants to take many different novel approaches because “it''''s not a simple matter of taking off-the-shelf data mining tools and applying them.” Although he hopes for a good response — both from the data mining community and from the text mining camp, which has not been much involved in prior KDD cups — “it might be a lot less than last year, just because the task is more at the frontiers of what the state of the art in data mining really is.” In the past, participants have been fairly evenly split between academic groups and companies, which — if successful — see the competition as good advertisement.

The test data will become available on June 10, the deadline for cup entries is June 26, and winners will present their approaches at the KDD conference in Edmonton, Canada, July 23-26. For more information, visit

— JK

Filed under

The Scan

Study Finds Sorghum Genetic Loci Influencing Composition, Function of Human Gut Microbes

Focusing on microbes found in the human gut microbiome, researchers in Nature Communications identified 10 sorghum loci that appear to influence the microbial taxa or microbial metabolite features.

Treatment Costs May Not Coincide With R&D Investment, Study Suggests

Researchers in JAMA Network Open did not find an association between ultimate treatment costs and investments in a drug when they analyzed available data on 60 approved drugs.

Sleep-Related Variants Show Low Penetrance in Large Population Analysis

A limited number of variants had documented sleep effects in an investigation in PLOS Genetics of 10 genes with reported sleep ties in nearly 192,000 participants in four population studies.

Researchers Develop Polygenic Risk Scores for Dozens of Disease-Related Exposures

With genetic data from two large population cohorts and summary statistics from prior genome-wide association studies, researchers came up with 27 exposure polygenic risk scores in the American Journal of Human Genetics.