An Indiana University data-mining expert is using a $140,000 grant from Pfizer to develop the first-ever public chemo-genomics resource. This resource will use semantic Web technology to combine vast amounts of data with analysis tools to enable researchers to pinpoint potential drug targets. David Wild, a professor at the university's Bloomington School of Informatics and Computing, is basing the design for the new resource on Chem2Bio2RDF, a prototype semantic Web resource that integrated data on compounds, genes, pathways, diseases, side effects, and scholarly publications, along with some initial tools for mining the data, but was not very user-friendly. "The main emphases of the work funded by this grant are the development of user-end integrative tools that use the resource and the integration of the PubMed literature with the other data sources," Wild says. "In particular, we are developing path-finding tools where, for instance, you can specify a gene and a pathway, or a drug and a disease, and the system will identify and rank all the paths through the semantic network between these points."
Wild and his colleagues plan to enable the new semantic Web resource to extract and integrate PubMed abstracts and develop a Bio-LDA topic model. From there, the resource can identify latent topics in PubMed and associate them with biological terms at different levels of probability. These associations provide a way to rank the paths through the network. The prototype resource has already demonstrated promise for predicting off-target and multi-target interactions and for discovering new uses for existing drugs, Wild says.
Although semantic Web approaches to unifying life sciences data have remained on the fringes of bioinformatics for most researchers — mainly because of the expertise required to create a semantic Web-enabled solution — Wild says this project aims to make its potential for drug discovery clear. "I think until the last couple of years, the semantic Web was a good idea in theory, but difficult to make useful in practice," he says. However, he adds that there has been progress, pointing as an example to SPARQL, which allows powerful cross-data set querying and an increased number of tools and algorithms. "These in combination make an extremely powerful framework for translational medicine and for searching which is not tied to a specific kind of biological entity, such as genes and pathways," Wild says.