Skip to main content
Premium Trial:

Request an Annual Quote

NCBI Makes its Cheminformatics Debut with Launch of PubChem Small-Molecule Database


The National Center for Biotechnology Information has taken a step beyond its traditional role as the custodian of molecular biology data with the recent launch of PubChem, a new database designed to serve as a repository for the chemical structures of small molecules along with information on their biological activities.

PubChem, which was developed to support the Molecular Libraries and Imaging component of the NIH Roadmap for Medical Research initiative, has been under development since the roadmap was announced last September. A prototype version of the resource went live last month — almost exactly a year later — at NCBI’s website (

This preliminary version of the database serves as a testing ground for massive amounts of bioactivity data that will eventually be generated for one million small molecules by several NIH-sponsored chemical genomics screening centers. Stephen Bryant, a senior investigator at NCBI who led the development of PubChem, said that the first data from these centers won’t be available until mid-2005.

In the meantime, Bryant and his team have populated the database with “legacy data” from several publicly available resources — primarily screening data from the National Cancer Institute’s Developmental Therapeutics Program, but also structural information from data sources like BioCyc, ChemBank, ChemID, KEGG, PDB, NIAID, and the NIST WebBook.

Bryant said that PubChem currently contains around 850,000 records, of which around 650,000 are non-redundant. Released quietly through NCBI’s Entrez interface in mid-September, the site recorded about 30,000 hits per day by the end of the month, Bryant said — not much compared to the nearly 30 million hits per day that NCBI sees across all of its resources in a typical day, but considerable for a resource launched with next to no fanfare.

Stephen Heller, a guest researcher at NIST who serves on the advisory panel for PubChem, said the developers are taking a “low-key approach” because “while there are an enormous amount of resources going into this, everybody realizes that it’s not coming out very quickly, so they don’t want to oversell a mostly empty vessel.” However, he added, “once they fill it up, it’s going to be great.”

The PubChem Triumvirate

PubChem is actually three linked databases: PubChem Substance, PubChem Compound, and PubChem BioAssay. A “substance” in PubChem’s terms, Bryant said, is essentially “some stuff in a well, test tube, or vial” that has certain chemical properties, while a “compound” must have a chemical structure. The compound database, therefore, is a subset of the substance database, because not all substances — which can include plant extracts, “cocktails” of anti-HIV drugs, or other samples — have chemical structures associated with them. Users can query PubChem Substance and PubChem Compound using descriptive terms, chemical properties, or structural similarity.

The third database, PubChem BioAssay, was designed as a standalone resource “because the bioactivity data is separate from the chemical structure,” Bryant said. Entries in the PubChem BioAssay database include screening data for all the samples described in PubChem Substance, along with descriptions of the conditions and readouts specific to each screening experiment.

These parameters can be a bit tricky, Bryant said, and are in many ways as difficult as comparing different microarray experiments. “To understand the data, you have to understand the experiment, you have to understand the experimental conditions,” he said.

One of the challenges in designing the bioassay database was coming up with “a uniform set of data fields” that would apply to all the experimental data that will be generated in the NIH chemical genomics screening program, in which even the criteria for identifying a compound as “active” is not predefined. “It’s not one-size-fits all,” Bryant said. “The submitter of the assay data can define a simple activity measure that would be uniform across the whole set of data, but I’m not sure if the screening centers are going to do that yet.”

One feature of the new resource that Bryant said should be particularly useful for biologists is the tight integration between PubChem and PubMed. Biologists — the target user base for PubChem — will want to search the literature for information on the bioactivity of chemical compounds, Bryant said, “but almost none of the legacy sources gave us citations.”

His team addressed this problem by developing a “neighboring” system called “PubMed via MeSH” that pre-computes links between the compounds in the database and their citations in the literature using MeSH (Medical Subject Headings) terms. Many biologists overlook the fact that “a big part of the MeSH tree is substance names,” he said. “A substantial portion of the biomedical literature has to do with pharmacology and toxicology.”

As an example of the value of the PubMed links, Bryant demonstrated how a search in PubChem on the very newsworthy compound Vioxx brought up the chemical structure of rofecoxib, the Merck drug’s generic name. Clicking on that brings up a summary page that notes there are 698 links to PubMed via MeSH. Another click goes straight to PubMed, where refining the search to include “myocardial infarction” brings up 28 articles dating back to 2001 that link Vioxx and other Cox-2 inhibitors with adverse cardiovascular events.

The scientific literature is the “mother lode” of information on the biological activity of chemical compounds, Bryant said.

Heller said that the integration with PubMed is “very impressive,” and should be a valuable tool for researchers. He urged caution, however, for biologists interested in diving into the cheminformatics resource in its current form. “It’s not ready yet,” he said, adding that Bryant and his team “are trying to be a little bit realistic” about its capabilities.

“I think they would like people to test the existing data in the small prototype, because it’s not so small — there are around 700,000 compounds, and if you don’t ask the right questions, you get thousands of answers.” Nevertheless, he added, he and the other PubChem advisory panel members are “pleased” with what they’ve seen so far.

Biologists “can really learn a lot from this and begin to see what’s going on there,” he said. “It looks like the right way to go.” Nevertheless, he estimated that PubChem won’t be considered a “major resource for biochemistry and pharmaceutical resource … for at least two or three years.”

— BT


Filed under

The Scan

Transcriptomic, Epigenetic Study Appears to Explain Anti-Viral Effects of TB Vaccine

Researchers report in Science Advances on an interferon signature and long-term shifts in monocyte cell DNA methylation in Bacille Calmette-Guérin-vaccinated infant samples.

DNA Storage Method Taps Into Gene Editing Technology

With a dual-plasmid system informed by gene editing, researchers re-wrote DNA sequences in E. coli to store Charles Dickens prose over hundreds of generations, as they recount in Science Advances.

Researchers Model Microbiome Dynamics in Effort to Understand Chronic Human Conditions

Investigators demonstrate in PLOS Computational Biology a computational method for following microbiome dynamics in the absence of longitudinally collected samples.

New Study Highlights Role of Genetics in ADHD

Researchers report in Nature Genetics on differences in genetic architecture between ADHD affecting children versus ADHD that persists into adulthood or is diagnosed in adults.