Add scientific literature to the growing list of data sources that can be mined for genomic information. Eivind Hovig, Tor-Kristian Jenssen, and their colleagues at the Norwegian Radium Hospital successfully created a network of gene interactions based entirely on information found in the Medline database of published literature.
The network, annotated with terms from the medical subject heading (MeSH) index and the Gene Ontology database, was combined with a set of web tools to form the PubGene database, available at www.pubgene.org. Hovig said the database and tools are a useful complement to conventional clustering when analyzing gene expression data.
Researchers can use PubGene to obtain a list of genes that any given gene is likely to interact with, in order of probability, as well as a list of medical areas in which the gene is likely to be involved.
To create PubGene, Jenssen first developed a collection of Perl-based routines to extracted 13,712 gene names, gene alias names, and gene symbols from human gene nomenclature databases, including HUGO, LocusLink, the GDB, and Genatlas. The team then indexed the extracted names across the titles and abstracts of over 10 million Medline records, and mapped all occurrences to primary gene symbols.
Assuming that there is some biological relevance if two genes are mentioned in a Medline record, they co-indexed all gene pairs mentioned in all articles of Medline. Each gene was represented in the database by a node in the network and a connecting link was created between every pair of genes that co-occurred. These links were given a weight equal to the number of articles in which the pair was found.
In a recent Nature Genetics paper, Jenssen, Hovig, and their colleagues compare PubGene to gene pairs described in the Online Mendelian Inheritance in Man database and the Database of Interacting Proteins. Their results indicate that their method is a useful tool for determining gene interaction.
But according to Hovig, “The most important tool scientifically is the option to perform microarray analysis. With this approach, it is possible to perform analysis including the knowledge contained in Medline, as opposed to the common strategies used today that mainly rely on pattern recognition of a statistical type, that is, nonsupervised.”
The team validated the approach by applying it to two publicly available microarray datasets. In both cases, they were able to examine the biological relationships between similarly up- or down-regulated genes using PubGene to identify their literature associations.
Hovig said the team plans to further integrate and refine the information in the current version of PubGene and add other sources of relevant meta-information. They also plan to develop similar databases for all model organisms for which there are curated name databases, such as Drosophila and Mus, and to add special services such as the generation of custom-made chips based on keyword searches.
The researchers intend to keep the public PubGene site freely available to academic users, but are in the process of setting up a company to commercialize the system.
“We believe a number of companies would be interested in setting it up in-house, instead of submitting data across the Internet,” Hovig said. He added that they have already been contacted by several interested bioinformatics providers.