Researchers from Washington University School of Medicine in St Louis have developed a free resource called the Drug Gene Interaction database, or DGIdb, that aggregates data on drug-gene interactions from multiple scientific databases and published literature.
The database provides a single resource accessible through a user-friendly interface that makes it easier for researchers and clinician scientists to search for targeted therapies and potentially druggable genes.
DGIdb, which was developed over two years with a grant from the National Institutes of Health's National Human Genome Research Institute, has information gleaned from about 15 different sources including repositories like DrugBank, PharmGKB, the Gene Ontology, the Cancer Commons, and My Cancer Genome. It uses a combination of manual curation and automated parsing that has pulled and linked information such as drug and gene names, other identifiers, and metadata.
The database — which is intended for research use only and not for making treatment decisions — has over 14,000 drug-gene interactions involving 2,600 genes and 6,300 drugs; as well as 6,700 genes that are potential drug targets. These are primarily cancer-associated interactions, but the database also has information related to Alzheimer's, heart disease, and diabetes, among other ailments. Also, interactions in DGIdb are linked back to their source material so that users can access additional information that the database does not provide.
DGIdb's development was led by identical twin brothers, Obi and Malachi Griffith. The brothers said that the initial idea for creating the resource grew out of having to handle repeated requests from colleagues about whether lists of genes identified through cancer genome sequencing could be targeted with existing drugs. They discovered that many existing resources contained incomplete data, meaning that researchers had to search multiple sources to obtain the most accurate interaction information. In some cases, accessing the data proved problematic for basic researchers and physician scientists who would have to rely on informatics experts to write bits of code to extract information from things like spreadsheets.
The temporary solution to that problem, the Griffiths told BioInform, was to develop ad hoc scripts that could pull and summarize information from multiple sites and sources. But, ultimately, they wanted to create a simple, robust tool that researchers could use to search for information they wanted themselves. The program would automate the data search process and provide a more comprehensive first look at the drug-gene interaction space by making all the relevant information available via a single source — "something along the lines of a Google search engine for disease genes," Malachi Griffith, a research instructor in genetics in WUSTL's Genome Institute , explained in a statement.
According to a Nature Methods paper published last week that describes the database's development and an application to data collected from 1,273 breast cancer patients, data in DGIdb is stored in a Postgres database that has a simple web interface through which users can input genes, apply filters, and obtain results that can be exported as tab-delimited text files. It also has an application programming interface that researchers can use to incorporate DGIdb into their existing analysis pipelines.
DGIdb organizes data into two categories: genes with known drug-gene interactions mined from scientific literature and databases, and genes that aren't targets of current therapies, but have the potential to become marks because they belong to categories of genes that are associated with druggability, such as kinases. Users can search for interactions by gene names or by categories like clinical actionability, drug metabolism, kinases, and hormone activity. They can search for information on single genes or large lists of genes, and they can also browse the lists of potentially druggable genes without inputting any specific search criteria.
Having access to this sort of information is crucial as "we move toward personalized medicine," Malachi Griffith noted, because "there's a lot of interest in knowing whether drugs can target mutated genes in particular patients or in certain diseases, like breast or lung cancer."
DGIdb also provides an opportunity to survey how far the targeted therapy field has come and to identify gaps in current knowledge that still need to be plugged. According to statistics included in the Nature Methods paper, of the genes categorized as potentially druggable only 25.2 percent have a known drug-gene interaction and 5.8 percent are targeted by an anti-neoplastic agent. Also, "despite the tremendous interest in kinases as potential drug targets, 68.3 percent remain untargeted," the researchers wrote.
The database also provides a clearer picture of what interaction information has been verified by the community and what might be fodder for a new hypothesis, Obi Griffith, a research assistant professor of medicine in WUSTL's Genome Institute, pointed out. By looking at the agreement across different sources that have overlapping data, "you start to get this [interaction] picture that seems to be well supported across the diversity of sources of information [as well as] other things that are quite unique to one or two sources," he said. "One can infer that perhaps that is correlated with how solid that information is versus how speculative it might be. …It's useful to have information on how much agreement there is between sources so you can get a sense of how risky your hypothesis is."
All of the code used to develop DGIdb is currently available on github. Also included are instructions for researchers who want to develop their own instances of the database that are adapted to address questions they are interested in. For example, a group could reuse the code to create a database that provides access to genes that are known tumor suppressors or one that aggregates data on genes that are involved in regulation. They would have to mine the relevant literature and available repositories themselves to populate the database, but would not have to worry about developing new infrastructure.
The brothers Griffith and their colleagues plan to explore some new questions themselves, they told BioInform. They're also working on making future versions of DGIdb more interactive by creating mechanisms through which members of the scientific community can identify problems and suggest corrections to the information in DGIdb as well as submit their own data, according to Obi Griffith. Making DGIdb more interactive also gives the WUSTL team access to the largely untapped wealth of information that exists in the brains of scientists who have studied the druggable genome at great length and could make valuable contributions, the brothers said.
Additionally, the team is exploring new sources of information — one of which is the National Library of Medicine's clinical trials database, which has records of over 153,000 studies in all 50 US states and in 185 countries. Obi Griffith said that the team's approach to the task will likely use a combination of machine learning algorithms and crowd sourcing.
This particular project is of special interest to the group because "a lot of new clinical trials now are being designed with an emphasis on … defining cohorts of patients that have [or do not have] a particular mutation," and decisions about treatment strategies are influenced by this information, Malachi Griffith said. However, because this information isn't stored in a structured way in the clinical trials database "it's very difficult to actually get this information out of there," he said. Making it available via DGIdb could help ease the burden of access, the developers say.
Other resources up for consideration for future inclusion in DGIdb, according to the Nature Methods paper, include data from the ChEMBL database, the Comparative Toxicogenomics Database, as well as commercial sources of information such Thomson Reuters MetaDrug and NextBio Research's PharmacoAtlas database.
The team is also looking into including "empirical drug-gene association mapping based on compound screening datasets such as ConnectivityMap, BindingDB, the Sanger Institute's Genomics of Drug Sensitivity in Cancer, and Broad Institute's Cancer Cell Line Encyclopedia," the paper states. Other areas for improvement, the researchers wrote, "include capturing information regarding genes that mediate adverse responses and pharmacogenetic relationships."