NEW YORK (GenomeWeb) – Researchers at the National Institutes of Health's National Institute of Diabetes and Digestive and Kidney Diseases, the Broad Institute, University of Oxford, and elsewhere have released the first full version of the Type 2 Diabetes (T2D) Knowledge Portal, an online repository and discovery engine that offers search functionalities and comprehensive datasets related to type 2 diabetes.
The collaborators launched a beta version of the portal, built by researchers at the Broad Institute, last year hoping to get feedback from early users about what worked in the platform and what features users would like to see in future releases. "We've used that to make a large number of improvements over the last year," according to Philip Smith, co-chair of the type 2 diabetes arm of the Accelerating Medicines Partnership (AMP T2D), which is a five-year public-private partnership between the NIH, the US Food and Drug Administration, biopharma companies, and academic institutions that aims to identify and validate biological targets for new therapeutics.
Those improvements include more than doubling the amount of data available from patients with diabetes. In the initial release, the database contained about 13,000 exomes, while the updated system now holds data from roughly 26,000 diabetes patients' exomes. It also includes additional datasets from patients with other metabolic diseases, patients with other neurological conditions such as schizophrenia, as well as healthy individuals. In total the repository contains data from some 300,000 exome sequencing chips, according to Smith, including exomes gleaned from the Exome Aggregation Consortium. It also has datasets from several genome-wide association studies as well as some information from dbGAP.
Within the next year, the researchers plan to once again double that amount of data available in the repository and they will continue to add more datasets moving forward, he said. Members of the community are also encouraged to submit relevant datasets from their internal projects to the database. Potential submitters have to agree to make the data freely available to the community with no strings attached, Smith said. As an incentive to share, users who agree to include their data in the repository will have six months to analyze their data in the context of other datasets in the knowledgebase before the information is released to the community.
Also available in this release are a number of customizable navigation tools for running and visualizing the results of research queries. Users can search for information by gene, genetic variant and region, as well as access summaries of genetic variants. Available tools let users segment populations using genetic traits as criteria and they can also zoom in on specific regions of the genome that are known hot spots for specific traits and compare these regions across healthy individuals and patients with other health conditions. Users can also tailor searches to get more fine-grained answers to their queries.
Sample queries include whether or not a particular gene or variant known to be associated with diabetes is associated with additional traits such as weight gain or waist circumference. A user could also search for whether or not mutations that truncate gene function in some way are associated with particular phenotypes in the sample population.
"We've built in a lot of infrastructure software that allows this to actually operate on the entire dataset in real time and provide you answers," Smith said. Also, anyone with a Google account can now query detailed data from the portal — previously only approved researchers could access that content. The database is hosted on Amazon Web Services, but the underlying infrastructure is flexible enough to be deployed on multiple clouds systems, according to Smith.
Additional functionality currently in development includes a tool for incorporating expression data from sources such as the Genotype-Tissue Expression project. And then later this year, the researchers plan to roll out a federated node for the portal that will be hosted in Europe so that users can query data stored in multiple locations that cannot be shared across national borders, for example, or that cannot be shared for privacy reasons, Smith said. That node is being built and will be maintained by researchers at the European Bioinformatics Institute. The AMP T2D team have already begun testing the node using existing datasets in the portal and some newer datasets from the EBI. They hope to have it online sometime in the fall, Smith said.
Type 2 diabetes is one of three disease areas that the AMP consortium chose as pilot projects. The rationale for choosing type 2 diabetes is "primarily because of the real wealth of genetic data on diabetes risk that might lead to the identification of new targets," Smith explained to GenomeWeb. Since members of the diabetes research community were willing to share information with each other, the most logical step was to combine all of the disparate genotype, phenotype, and clinical datasets into a single source accessible to anyone in the community and simultaneously develop analysis tools for querying and mining that data.
AMP T2D is funded by two NIDDK grants to the Broad Institutes of MIT and Harvard; and a grant from the Foundation for the National Institutes of Health to the University of Michigan to support the portal infrastructure and expansion of analytical and visualization tools. In total, the project is expected to garner about $40 million in funding, according to Smith, half of which will come from industry and the other half from the NIH. But, he added, there are funded ancillary programs that are associated with AMP T2D, so the final funding amount could be much higher.
The T2D group hopes the knowledgebase will benefit a wider range of users all the way from researchers in pharmaceutical companies down to high school students. It could even help clinicians make better treatment decisions for diabetic patients as well as serve as a valuable resource for individuals seeking additional information on variants and traits highlighted in results from direct-to-consumer companies, according to Smith. "[These] kind of tools democratize this data, for the first time in taking it out of the hands of data generators or informaticists" and opening it up to "a very broad user base," he said.