A team of researchers has developed an integrated analysis pipeline to coax vital mechanism-of-action information from complex high-content screening data sets — an approach that they believe could open a significant bottleneck in drug discovery.
In a recent study published in Nature Chemical Biology, current and former researchers at the Novartis Institutes for Biomedical Research and Harvard Medical School described how they combined a methodology dating back to 1904 with automated image analysis, computational tools, and high-content small-molecule screening to gain new insight into the structure-activity relationships of compounds.
This method “gives you another dimension of information to get your mind around what is happening with these compounds,” said Daniel Young, formerly a postdoctoral fellow at Novartis and now an intellectual property technology specialist at Boston-based firm Wolf, Greenfield & Sacks. Using the approach, the information from a high-content experimental screen could now be mined more fully, he said.
Although there are no magic ingredients in this recipe for drug discovery, there are some familiar facets, including compound libraries, databases, target-binding prediction algorithms, and statistical software.
The team merged two techniques — a systems-based approach and a mechanistic approach — to understand compound action, capitalizing on the strengths of each one. The result is a “better profile of what a compound does,” Andreas Bender, a former postdoctoral fellow at Novartis and now assistant professor for cheminformatics and pharmaceutical IT with the Leiden/Amsterdam Center for Drug Research, a co-author of the study, wrote in an e-mail to BioInform.
The combination merged “a systems-based readout, namely high-content screening, which does not give mechanistic information about compound action, with in silico ligand-target prediction tools, which are able to give precisely this mechanistic explanation of compound action, but on the other hand not the systems response whole-cell screening is capable of,” said Bender.
High-content screening delivers a wealth of data, and approaches such as this one will “maximize the use of such content-rich information,” wrote Paul Lang in an accompanying editorial. The approach should not only be “of tremendous value for discovering new drugs” but could also help find new applications of known drugs, wrote Lang, who directs assay development and molecular pharmacology at the Merck Serono Geneva Research Center.
Although not the first computational method to analyze phenotypic data, this is the first case study in which “complex imaging data from a several-thousand compound screen have been merged with additional databases in order to infer compound mechanism of action,” Lang wrote.
Image-based screening promises to deliver rich insight, he wrote, but the challenge is to select the criteria most relevant to understand a compound’s given mode of action. Lang also noted that “evaluating biological relevance earlier in the discovery process would help reduce the loss of golden nuggets that could be the next billion-dollar molecules.”
Bender and his supervisor Jeremy Jenkins in Novartis’s Lead Discovery Informatics section worked on the computational ligand-prediction side of the project, while Young and researchers Jonathan Hoyt, Elizabeth McWhinnie, Gung-Wei Chirn, and Charles Tao worked under Yan Feng on automated image analysis in Novartis’s developmental and molecular pathways section, and John Tallorico contributed from Novartis’s global discovery chemistry section. Harvard’s Timothy Mitchison provided systems biology insight, said Young.
Screening for Hits
Using a proprietary library, the scientists screened more than 6,000 drug candidates that affect cell proliferation. The treated cells were stained with dyes, imaged, and automatically analyzed with the Cellomics Morphology Explorer algorithm. These types of screens deliver a rich dataset with many parameters and Young said that many research groups pick “the most obvious things to look at” in their image analysis, for example, the amount of DNA in the cell. “We knew there was a lot more information there and we needed a better way to capture that,” he said.
Young recalled hearing about factor analysis during his graduate studies, so he and other team members decided to adapt the approach to this study. After grouping 36 cellular parameters that arose in the imaging analysis into six categories, they found that six factors — nuclear size, replication, mitosis, nuclear morphology, 5-ethynl-2-deoxyuridine texture, and nuclear ellipticity — were sufficient to describe the biological responses.
In a screen of this scale, researchers typically obtain terabytes of image information, explained Young. Next, scientists extract information from those images, essentially converting it into a series of image descriptors that are associated with particular cells identified in the image. “That is really the tier that people struggle with; that is probably at the gigabyte level of screens, but even that is a gigantic amount of data for somebody to deal with,” he said.
Drug-discovery scientists typically pare down that data by selecting one phenotypic criterion, but the Novartis team wanted to avoid that problematic step. By taking that approach, “you are probably bringing it down to the megabyte level, but you are losing an incredible amount of information,” Young said. “With our approach, you are probably taking it down to the megabyte level, but the important point is that you are not losing the information.”
“You are letting the data say ‘these are the important things you should be considering.’”
With factor analysis, he explained, important facets are selected in a data-driven way. “You are letting the data say, ‘these are the important things you should be considering.’”
Factor analysis integrated into the automated image-analysis and computational-validation work yielded 211 hits. Bender and Jenkins next applied additional computational tools to determine the chemical similarity of the compounds in question and a structure-based method to predict the targets for these compounds. The WOMBAT (WOrld of Molecular BioAcTivity) database was used as a training knowledge base.
Their analysis showed that similar structures did indeed show similar activity in the high-content screen. “This is the chemogenomics paradigm that is beginning to drive drug discovery, and we exploit that all the time now,” said Jenkins.
As Bender explained, drug discovery typically examines how compound activity changes when molecular structure is altered. This view is restricted to a single target at a time. In contrast, with high-content screening, the whole system is queried, not just one target, he said.
In the study, the scientists compared the predicted targets of a compound with the imaged phenotypic readout of the whole cell response. “Quite often, compounds [that] are predicted to hit the same target cause a similar phenotype — this is somewhat the 'ordinary' case,” Bender said. “However, we also observed cases where similar compounds, which are expected to bind to similar targets, create very different phenotypes, and vice versa … different compounds with predicted different targets create a similar response in the HCS readout.”
This is the surprise of the study, explained Young: phenotypes from the image screen, grouped through factor analysis, correlate better with target prediction than with chemical structure. “That is sort of an embodiment of the concept: that the targets are perhaps a better predictor than the structures alone,” he said. Although compounds are structurally different, they may affect the same target, albeit in different ways, perhaps by different pocket-binding mechanisms that are not revealed from structural analysis, he said.
While the tools used in this study, for example for ligand-target prediction, were pretty much “standard operating procedure in our group,” as Jenkins explained, the important difference is that the tools were applied to a novel dataset: phenotypic readouts on compounds for which targets were known only in some cases.
“It was kind of a nice black box screen, with a lot of visual data that we could almost synchronize with our target-prediction methods,” Jenkins said. “I never put these matrices together before; compared structure similarity with target similarity,” he added.
“If we compare any two compounds and get a measure of similarity, you might guess they’re doing similar things, but a better description is what they are actually targeting,” he said. “The opportunity here was really to put the two together; when you do that you get some emergent information.”
Just using data mining and statistical methods in cheminformatics, one might not see that molecules with similar structure act differently in the cell, he said. With the added data, scientists can discriminate differences in the activity of compounds with structural similarity.
As Bender explained, high-content screening delivers the “systems readout” by providing information about the signaling networks in the cell. “But you don’t know which targets the compound acts on,” he said. Computational ligand-target prediction gives an “idea why the compound does what it does, but only associates ligands and targets, nothing else, so the systems response is missing.” With this combined method “we know what a compound does in a living system, and also why that is the case given target predictions, he said.
One crucial part of the approach, said Young, was the integration of factor analysis, a method developed in 1904 by British psychologist and psychometrician Charles Spearman to statistically correlate measures of a quality inherently difficult to measure: intelligence. Spearman’s work helped pave the way for IQ tests.
In this instance, said Young, factor analysis is one of many possible techniques applied to automated image analysis. “What I have found is that a lot of time people doing these things are high-level mathematicians, able to grasp things in a very abstract way,” he said. “What I think factor analysis does is allow you to accomplish the same goal but do it in a way that a biologist, or a systems biologist, somebody who is not as quantitatively or mathematically inclined, can sort of say, ‘I see what you are doing.’ I think that is what has been missing,” he said.
Another essential ingredient, said Young, was a collaborative atmosphere. He noted that he and Bender met at an internal post-doctoral symposium intended to foster cross-disciplinary communication, where they quickly realized their methods showed synergistic promise and so they started to collaborate.
“Drug discovery going forward requires a culture shift in the sense that having chemists, biologists, [and] informaticians all talking [will be necessary] because the problem is so much bigger than any one of the disciplines trained in one area generally,” said Jenkins.
Young explained that the motivating factor behind this work was handling the complex data coming out of high-content screens. “We wanted to come up with some better ways to do that — some more practical ways that could be used in the lab, in the drug discovery environment to sort of really get as much insight as one could get out of a screen and in the most efficient manner,” he said.
Building this process into the pharmaceutical or biotech workflow may be interesting to any number of companies, and the team hopes that the method will be integrated into the drug discovery process at Novartis. As Yan Feng explained, there is generally much interest in harnessing multidimensional data from imaging and phenotypic screens. “Although technically feasible to widely implement it in an organization, I think it is still going to take some time,” he said.