Beginners luck? As first-time participants in the Association for Computing Machinery’s Knowledge, Discovery and Data Mining Cup competition, Celera Genomics and New York-based data mining company ClearForest took home a first-prize trophy in one of two categories last week.
The Celera/ClearForest team beat out 32 other participants in a competition to build a system that would automatically curate thousands of scientific articles on Drosophila melanogaster for the FlyBase database. The data mining algorithms had to accurately indicate which articles included results on expression of gene products, and which genes and proteins were involved.
Adam Kowalczyk and Bhavanni Raskutti of Australia’s Telstra Research Laboratories won the second KDD Cup competition, in which 52 teams used Medline abstracts to predict the effect of knockout genes on different sub-cellular components in yeast cells.
The eighth annual KDD Cup, held in Edmonton, Canada, and co-chaired by Mark Craven of the University of Wisconsin and Alexander Yeh of Mitre Corporation, focused on data sets in biology for the second year running. “Biology is especially compelling because it has become a very data-rich field in the last few years,” Craven said previously [BioInform 05-06-02].
Barak Pridor, CEO of ClearForest, agreed that the biological domain offered several “unique challenges” that the company hadn’t encountered in its previous work in competitive intelligence, intellectual property research, and federal intelligence applications. Taking a step away from the needs of current customers such as Kodak and Dow Chemical, ClearForest drew on the domain expertise of several researchers from Celera to gain the edge it needed in its first KDD Cup appearance, Pridor said.
The two companies had an “existing relationship” prior to their involvement in the competition, Pridor said, but he was unable to provide further details of their collaborative efforts.
ClearForest’s approach to text mining combines three common methodologies: statistical analysis, structural analysis, and semantic analysis. Pridor said the company’s natural language processing technology draws primarily from the latter category to assess the patterns among textual entities, events, and facts, but also includes statistical and structural elements.
The key, Pridor said, is knowing when to apply which type of method. For example, he noted, gene-based information can be extracted using a controlled vocabulary such as the Gene Ontology, but information about proteins requires a discovery-based approach because the goal is to uncover previously unknown relationships. The ClearForest/Celera team pooled their expertise to apply the best combination of tools to the problem, he said.
ClearForest is mulling commercialization options within the life science sector for its technology, but was mum on whether Celera would play any part in this effort. ClearForest has already seen “considerable interest” from the biotech and pharma community following its success at the KDD Cup, Pridor said.