Mount Sinai Hospital's Blueprint Initiative has added another database to its catalog of bioinformatics resources a collection of more than 9.4 million predicted interactions between small molecules and protein binding sites for more than 1,500 organisms.
The database, an extension of Blueprint's SMID (Small Molecule Interaction Database) called SMID-Genomes, "bridges the gap between structural proteomics and genomics projects," said Blueprint's Michel Dumontier, who led the development of the database.
Dumontier said he views the database as "an incredibly useful tool in the initial drug-discovery phase" because it enables researchers to quickly identify small molecules that are likely to bind to a target of interest in one organism, but not to targets in so-called "innocent bystander" organisms.
As an example, Christopher Hogue, who leads the Blueprint Initiative, said, "You can do a subtractive analysis of all the small molecules that bind to Drosophila from Arabidopsis and look at candidate small molecules for pesticides that perhaps wouldn't hurt plants. Or you can throw the human genome into that analysis at the same time and find out things that are unique to Drosophila that won't affect plants or humans."
Hogue said that the database would be a good first step for high-throughput, or even virtual, screening projects. "If you're looking for a set of small molecules to screen all the genes in a pathogen, simply taking a quick look at the SMID-Genomes web page for the list of small molecules that hit those pathogens is a quick way of coming up with a starter library to test with, and that comprises small molecules that should bind to things in the proteome of that organism," he said.
Hogue said that the database grew out of "frustration" at the widening gap between experimental structures in the Protein Data Bank and sequence data. "When you look at newly sequenced genomes, there's not one of the amino acid binding sites that is annotated properly," he said. "There's not one amino acid out of that set of proteins that is actually delimited on a sequence that says this residue binds to the amino acid not one out of 20."
Taking advantage of the information available in the PDB, Hogue said that the Blueprint team decided to map that structural information to the sequence data.
In order to do that, Dumontier and colleagues wrote a program called SMID-Blast that is based on NCBI's RPS (Reversed Position Specific) Blast. "It takes a hit from a sequence that's a query to a protein family and then extrapolates that into what are actually the binding residues from the crystal structure," Hogue said.
The key to the method, Dumontier said, is the scoring scheme, which evaluates how well the binding site for the small molecule is conserved between multiple organisms. "When we make the prediction for any given protein, we ask the question, 'How similar or how identical are the residues in the binding site compared to the domain?' and that really helps us fish out which hits are real hits, and which ones are likely not to really occur naturally."
Hogue said that the "true positives" in the PDB provide an effective starting point for the predicted interactions in the database, but noted that the Blueprint team is currently looking into ways to validate those predictions. He said that the database will be updated as new structures are entered into the PDB, "but we'd like to come up with some other ways of validating the data set."
Those predicted interactions from SMID-Genomes that are validated experimentally will be deposited in Blueprint's BIND (Biomolecular Interaction Network Database), Hogue said.
Dumontier said that Blueprint is also going to look at other sources of data for small molecules, such as NCBI's PubChem. "We're going to compare the small molecules in the structure database with that of PubChem, and that way we can show all our users a list of small molecules that might be more drug-like, for instance, or more interesting in a pharmacological sense," he said.
Hogue said that Blueprint has an ongoing collaboration with PubChem, and has already deposited around 1,000 curated small-molecule BIND records in the database.
SMID-Genomes will also be available for installation behind company firewalls as part of Blueprint's recently spun out commercial arm, Unleashed Informatics (see Bioinformatics Briefs). The web-based version of the database will remain free to academic and commercial users through Blueprint, Hogue said.