Director, Institute for Genomics & Bioinformatics
University of California, Irvine
Pierre Baldi is developing a method that applies the Blast sequence-alignment approach to small molecules in order to help scientists plow through publicly available chemical databases to not only find chemically similar molecules to a query molecule but also to ascertain the statistical significance of their search results.
Baldi, the director of the Institute for Genomics & Bioinformatics at the University of California, Irvine, is the principal developer of a number of bioinformatics and cheminformatics-oriented tools, software, and databases, including the ChemDB small-molecule database, of which he is the principal founder.
Baldi spoke at the Intelligent Systems for Molecular Biology conference in Toronto last month on his cheminformatics-oriented quest to Blasting small molecules. In the conference proceedings published online in Bioinformatics, Baldi and co-author Ryan Benz of UCI wrote that given the fact that small molecules play a “fundamental role” in chemistry, biology, and medicine, and are increasingly filling chemical repositories, the tools to search them are also gaining in importance.
Noting that Blast and its statistical significance scores have become “de facto standards of modern biology” and have driven the “exponential” expansion of bioinformatics, Baldi and Benz point out that no such search tool has emerged in cheminformatics.
While chemical search engines do exist, what has been lacking, they wrote, is a systematic, large-scale, open study of molecular similarity scores, along with their statistical distributions and significance levels, so that scientists can put their search engine results into perspective. Baldi believes he has addressed this gap with his approach.
BioInform spoke with Pierre Baldi about his new cheminformatics tool idea. The following is an edited version of that conversation.
How far along are you in figuring out how to Blast small molecules?
Several algorithms for searching molecules are available. The question is how do you rank, how do you assess the quality of what you are retrieving [with your search]?
Blast does alignments, so you have an alignment score that tells you how good the alignment is between your query and the sequences you are searching. But then Blast also gives you an E [expectation] score that tells you how confident you can be that what you are getting is not due to chance.
The Tanimoto [score for chemical fingerprints] is like an alignment score. It tells you how similar two molecules are and there is fairly good agreement that Tanimoto is a good measure. Tanimoto has been around for a while, I cannot say it is universal, but it is widely used.
The part that has been missing is the E score, telling you how significant the Tanimoto value is: ‘Could it have happened by chance?’
Before we get to the statistics, how do you create fingerprints so you can search through the database space? Small molecules are very unlike genomic sequences, right?
Fingerprints are just ways of representing your molecules … a way of taking the three-dimensional or two-dimensional chemical object and transforming it essentially into a vector of zeroes and ones. Basically the ones mean that a particular [functional] group is in the molecule.
So if a molecule contains a benzene ring, you put a one in the first position, otherwise you put a zero. If it contains an alcohol group you put a one in second position, otherwise you put in a zero, and so forth.
Depending on the system you are using, you could have thousands if not hundreds of thousands of positions and indicate a zero or one depending on whether the corresponding group or feature is present in the molecule or not. That is all there is to fingerprints; it is very simple idea. … You flatten this three-dimensional object into a binary vector of zeros and ones. Computer scientists are very good with vectors and can do all kinds of things … and use the Tanimoto score to compare binary vectors.
In a [sequence] alignment, you are looking at the letters in common between the two sequences. If there are a lot of letters in common at the same position, you are going to say it is the same sequence or they are evolutionarily related.
If you have two binary vectors, and they have a lot of ones in common … you are going to say ‘those molecules must be similar because they have the same functional groups.’ For example, [that can mean] they both have two benzene rings, they both have alcohol groups.
Could you explain how the statistics enter into the searching of small molecules?
The key thing is that to assess similarity, you need a chance model. Because you are asking the question, ‘[Could] the level of similarity I am observing … arise by pure chance?’ And so when you are looking at sequences, chance means changing letters, nucleotides or amino acids at random, flipping a coin.
The same thing holds for small molecules. You have these fingerprints, [so] now you have to define a chance model: How can you generate at random, by flipping coins, these fingerprint [matches]? There are slightly different ways of doing that. But basically it is your background chance model against which you are trying to assess the quality, the significance, of the similarity you are observing.
The Tanimoto score is between zero and one, so if, for example, you are trying to do drug design, you search your database with a query. You might find a molecule that has a Tanimoto score of 0.6 in terms of similarity to your query.
You now want to know if this value is highly significant, because if it happened by chance, you shouldn’t study this molecule further. That is the question you really want to know as a user.
The key to finding the solution here, [and that was our discovery], is that you can approximate it with a ratio-of-correlated-Gaussians approach. That is what allows you to calculate the probability; say one in a billion or one in two. It’s relatively obvious now but it is something we discovered while thinking about this problem. It is the key to estimating or calculating the probability that what you observe could have happened by chance or not.
You want to get the Tanimoto score and then at the same time you want to get a degree of confidence … in that score. That is the same as in Blast; it gives you an alignment score but it gives you the E-value, which tells you how likely it is that this could have happened by chance.
Let’s say many years ago a researcher working on tamoxifen, a new molecule at the time, was looking for similar small molecules. That wasn’t possible then, but how could that kind of search play out now?
Suppose you have tamoxifen and you want to search ChemDB or PubChem for molecules that are similar to it. In your search you find a molecule that has a Tanimoto score of 0.55 to your query molecule tamoxifen.
What my system is going to tell you is that there is a chance in tens of billions that this could happen by chance. So I am giving you an E-value or a [kind of] P-value that tells you how likely it is that you are to get this Tanimoto by chance. If I tell you it is one in a billion or one in some astronomical number, then you will think it is highly significant. … Alternatively if I tell you [that] you have one chance in two that this happened by chance, then you won’t find the result you are retrieving highly significant. A 50:50 chance is pure randomness, so it can’t be very significant.
So why does biology have Blast but chemistry doesn’t have a similar tool?
Biology as a science is very open, so you have these very large repositories like GenBank where everybody is depositing sequences of DNA, proteins. … Blast has been developed as the tool that every biologist and computational bioinformatician around the world uses to search these large repositories. It is sort of the standard and has had tremendous influence in biology.
In chemistry, in small-molecules research, there is no such thing as this type of open environment with very large databases available on the internet and which everybody can search with a tool that everybody more or less agrees on. We are 20, 30 years behind in the world of cheminformatics, at least in my view.
Perhaps researchers, for example in a biotech, do not want people walking around in the space occupied by their precious molecules until they have patents on them?
That’s one of the reasons, but you could say the same for your genes or your genomes, your proteins. … It’s very interesting. Why is chemistry so different from biology? … It goes back to alchemy, I think. In the Middle Ages if you found a recipe to make gold, you were not going to place the recipe [in a place] where your friends could steal it from you. Chemists are very secretive.
We are trying to change that and we have created public databases with millions of molecules. NIH has created PubChem, which is a very large repository. So for the first time we are starting to have large public databases of small molecules. Now we need standardized tools for searching them in the same way we have Blast.
That hasn’t happened yet for a number of reasons. For one, the math or the statistics are not completely in place for small molecules. My paper is an attempt to fill that gap and develop the statistics that allow you to decide when you search and find something whether it is statistically significant or not. … It’s a step toward creating a Blast [for small molecules].
GenBank hasn’t been compiled the same way ChemDB has. Without wanting to sound cranky, isn’t ChemDB biased as a database?
Of course it is biased, because ChemDB or even PubChem contains all chemicals that chemists have synthesized and which you find in the catalogs of vendors. … It is not evolution but it is biased in the sense that these are man-made molecules or molecules that have been found in biological systems. It’s a different kind of bias. GenBank is biased by nature, it has proteins and DNA sequences that nature has found useful. They are both biased in different ways. The statistical theory takes that into account.
What is the next step? When will there be a tool that people can use to query ChemDB or PubChem?
Our system is scalable to any database size … you just need to adjust your statistics, which is very similar to Blast.
We have implemented the algorithms in my lab. Internally, the next step is to put them into ChemDB. Right now, with ChemDB we just have the Tanimoto [score]. We are in the process of [setting up] the next step. So you will be able to search ChemDB on the Web and get these new Blast-like scores. We are implementing it in-house.
If you really want a planetary Blast, to have this tool widely available, you need NIH or some large organization to push it, not a little academic lab. So it will take some time…. but we are very excited about this.