HINXTON, UK--Data comparability and the need for novel algorithms were key issues addressed by Data Mining in Bioinformatics--Towards In Silico Biology, a conference held at the Wellcome Trust Genome Campus here November 10-12. The meeting, organized by the European Bioinformatics Institute, attracted more than 275 attendees and featured 15 speakers from academia and industry and 40 posters.
Alvis Brazma of EBI, chair of the meeting’s organizing committee, said that bioinformaticists have become overwhelmed by an increasing volume of data, new databases, and new types of data. "In addition to complete genome sequences, we are learning about gene expression patterns and protein interactions on genomic scales," Brazma said. "Old ways of dealing with data, item-by-item, are no longer sustainable and it is necessary to create new opportunities for discovering biological knowledge in silico by datamining."
Heikki Mannila of Helsinki University of Technology and Nokia Technology kicked off the meeting with an overview of the issues and algorithms involved in traditional datamining. Usually, he said, datamining tasks involve finding some global structure in clustering or mixture modeling, or finding a local structure in discovering motifs or patterns. The trend now is to go towards combining database methods with statistical procedures.
Mannila said genomics is creating types of data, such as huge sets of sequence letters, that statisticians have never before encountered. "That simply isn’t a traditional statistical data type. The same is true for gene expression data: That kind of data hasn’t been around that long and consequently there are no methods for analyzing it. There is much to be done in creating new types of concepts," he concluded.
Aris Floratos of the IBM TJ Watson Research Center spoke about discovering and exploiting patterns in biological databases. He focused on the use of the Teiresias algorithm to find so-called seqlets in the GenPept database. These seqlets can then be used in homology searches, but also to describe 3D structure.
Inge Jonassen of the University of Bergen discussed methods for the automatic discovery of patterns in sequences, giving particular attention to algorithms used in the Pratt line of programs.
Rolf Apweiler of EBI presented InterPro, a new, integrated resource of protein sites and functional domains that can be used for large-scale protein characterization. The database uses various algorithms to automatically collect and properly unify different types of data from a range of protein-related databases. InterPro is used by EBI, for example to assign common annotation to unannoted entries in TrEMBL, thus preventing overpredictions and standardizing annotation.
Continuing on the topic of mining databases, Phil Bourne, of the University of California’s San Diego Supercomputer Center, presented recent results of mining the Protein Data Bank and other macromolecular structure databases.
David Westhead of the University of Leeds briefed the audience on simplified descriptions of protein 3D structures and their use in searching and structural pattern recognition. Topology diagrams are visual aids that are simple to create and a powerful way to scale down the complexity of a 3D structure without losing biologically relevant information.
Presenting a use for dataming that is more common in other fields, but relatively new to bioinformatics, Christos Ouzounis of EBI presented promising results from a pilot project in which 2,500 MedLine abstracts were automatically analyzed and thousands of protein-protein interactions were found. Being able to easily extract data from abstracts or full articles can lead to prediction of similar interactions in other species where sequence homology allows, he suggested.
To close the day, Beatriz de la Iglesia of the University of East Anglia discussed induction of simple, understandable, and interesting rules from large data sets--nugget discovery--an application that is certain to arise soon in the biological domain.
The meeting’s second day focused mainly on gene expression data analysis. Martin Vingron of the German Cancer Institute, or the DKFZ, presented examples for comparing and complementing heterogeneous information to answer bioinformatics questions.
Paul Spellman of Stanford University gave an animated presentation of various techniques that can be used to characterize function when looking at gene expression data generated from microarrays. He presented data generated from 25 different stress conditions for yeast, a total of 400 microarray hybridizations.
John Aach of Harvard University Medical School made a strong case for the standardization of data in a presentation about the development of integrated databases and analysis tools for functional genomics. Besides more traditional bottlenecks, he said the data comparability issue makes integrating various information resources for the purpose of further computation or information extraction a difficult chore.
Ronald Taylor of the US National Cancer Institute suggested that a Bayesian similarity measure for gene expression array experiments is a better and more correct alternative to Euclidean distance or correlation measures that are traditionally used.
"There is a lot to be done in the area of the algorithm development; perhaps the most important thing is to have algorithms that produce robust answers in an understandable form so that the biologist who is using the algorithm really can understand the result," concluded Mannila.
--Jean-Jack M. Riethoven