The US National Institutes of Health has found itself on the horns of a dilemma as it prepares to move forward on its Molecular Libraries initiative, one aspect of the interdisciplinary “Roadmap” strategy that it announced last fall. Although the agency is committed to funding a network of high-throughput screening centers and chemical genomics research under this initiative, it is still struggling to find the best model for providing the cheminformatics tools to manage the small-molecule data that these programs will produce.
The problem is of a historical and cultural nature, rather than a scientific or technical one. Bioinformatics had its beginnings in the context of the publicly funded genome sequencing projects, which led to a culture of shared resources. Cheminformatics, however, is a product of the pharmaceutical industry, which is fiercely protective of its intellectual property. While bioinformatics tools were developed with public funds to meet the growth of sequence data, cheminformatics projects — even academic ones — have traditionally been conducted with industry support. Predictably, this model has led to a wealth of freely available bioinformatics software and databases (and a shaky commercial market for these tools), and few — if any — publicly available cheminformatics resources (but a relatively well-established commercial software sector).
As the NIH moves into the high-throughput screening space traditionally inhabited by drug companies, it must defend against the criticism that it is undercutting the business models of these commercial cheminformatics firms, said Chris Austin, senior advisor for translational research at the National Human Genome Research Institute. “The whole objective here is to facilitate the use of small molecules in public-sector and private research,” he said. “Our objective here is not to be competitive with anybody. The government’s role is to be enabling to everybody, so we don’t want to compete with the private sector, whether on informatics or drug development or anything else. Our role is to provide enabling research tools and technologies to the community.”
The problem, Austin said, is that the cheminformatics community “has not had a history of making these things publicly available,” so NIH is still debating whether it can devise a financial means to make these existing cheminformatics tools available to the broader research community, or whether it will have to support a development effort to re-write similar resources for public-sector distribution. “That’s still an open question,” Austin said.
“To the degree that good cheminformatics tools have been developed already, and would be available to academic and non-profit researchers at reasonable cost, we don’t want to redo what’s already been done. That’s not a good use of taxpayer dollars,” Austin said. “On the other hand, if access to these tools can’t be obtained under reasonable financial terms, or if we discover as we get into this that we have needs that aren’t meant by currently available packages, then yes, there will be very much a need for public-sector development of tools.”
Austin said that NIH is in discussions with several private-sector cheminformatics firms to evaluate the terms of their current licensing strategies and determine whether — and how — they might fit within the needs of the publicly funded effort. In addition to the lower cost requirement for academic research, he said, the agency will also consider whether the source code can be made available. In some cases, “it could be that what we really need is to make the underlying code available,” which might rule out some commercial packages.
The agency is not rushing into a decision, Austin said: “We’re trying to carefully look at what’s out there already in terms of commercial packages.”
First Step: Data
Even as the NIH ponders its role in funding cheminformatics software development, some public-sector projects are already underway on the database side of the informatics equation.
Last week, NCBI hosted the first advisory committee meeting for PubChem — a publicly available database that will contain chemical structures and biological data for small molecules tested by a network of NIH-funded high-throughput screening centers that are expected to begin operations in late 2005. PubChem will help integrate and unify several chemical databases at the National Library of Medicine, Austin said.
In addition, Harvard University is developing another database, called ChemBank, supported by a $40 million grant from the National Cancer Institute. Stuart Schreiber, chair of the chemistry and chemical biology department at Harvard, is principal investigator on the project, which is still “at an early stage,” according to the database website. ChemBank currently contains chemical structures and biological activity data for more than 2,000 small-molecule bioactive molecules.
Another new effort, out of the department of pharmaceutical chemistry at the University of California, San Francisco, called ZINC (for ZINC is Not Commercial), is specially designed for docking and virtual screening. The free resource, which came online earlier this year, is based on information from the catalogs of eight compound suppliers, including Sigma-Aldrich, Specs, and ChemBridge. John Irwin, a research scientist in Brian Shoichet’s group at UCSF and a co-developer of the database, said the UCSF team plans to double the number of catalog suppliers over the next year, and increase the size of the database from its current 700,000 molecules to more than 7 million.
Irwin said that the database is only “step one of our master plan, which is to offer a free virtual screening service — basically to make virtual screening as easy to use as Blast is in the sequence alignment and sequence database searching world.” Irwin and his colleagues are developing virtual screening software called DockBlaster that will sit on top of the database and allow users to dock their molecules on the UCSF servers.
The project grew out of UCSF’s popular DOCK algorithm, Irwin said. “People were asking for the software, but they were also saying, ‘We’ve got your program, but now what do we do, because the chemical databases are too expensive for us to buy.’” Irwin said that the free database was developed “in the spirit of Linux and free software tools, and because we wanted to get these people off our backs.” So far, 35 research groups have downloaded some or all of the database, he said.
The molecules available in ZINC are the same as those found through MDL’s Available Chemicals Directory and other commercial resources, Irwin said. “We’re not claiming that we have a one-to-one overlap with MDL, but we’re using the same compound suppliers. Our goal will certainly be to have exactly the same compounds that are available through MDL — maybe a 90 percent overlap.” Irwin noted that the university doesn’t see itself as cutting into MDL’s business model, however, because “MDL serves the corporate clients, and the people who can really afford to pay get a beautiful quality database professionally managed by MDL, it goes into Oracle, and it’s got this level of care that a company would expect if they paid for it, whereas we’re offering an option for academics who weren’t going to by MDL anyway.”
NHGRI’s Austin agreed that even if the public sector ramps up its cheminformatics development, the future may not be all that dire for commercial software companies in the space. While he acknowledged that a large number of private bioinformatics companies have folded or changed their business models over the last few years, Austin noted, “I don’t think that’s going to happen with cheminformatics, because I think there are some things that the pharma community is going to need to do that are different from what we want to do.”
Public-Sector Small-Molecule Databases
- ChemBank (Harvard): http://chembank.med.harvard.edu
- Klotho Biochemical Compounds Database (University of Missouri): http://www.biocheminfo.org/klotho/
- National Cancer Institute 3D Structure Database (NCI): http://dtp.nci.nih.gov/docs/3d_database/dis3d.html
- PDSP Drug Database (Case Western Reserve University): http://kidb.bioc.cwru.edu/pdsp.php
- Pharmabase (Woods Hole Marine Biological Laboratory): http://zeus.mbl.edu/public/BRC/subj.php?func=explode&myID=181
- PubChem (NCBI): No URL available yet
- ZINC (UCSF): http://blaster.docking.org/zinc/