NEW YORK (GenomeWeb) – Researchers from the University of Illinois at Urbana-Champaign will use their portion of the National Institutes of Health's Big Data to Knowledge funding to develop a cloud-based solution that supports and enables integrated analysis of different kinds of genomic information and scales as these biomedical datasets grow.
The scientists, according to their grant abstract, will establish a Center of Excellence at UIUC's Institute for Genomic Biology and build a cloud-based solution called the Knowledge Engine for Genomics (KnowEng), a tool that will help researchers bring the "full breadth" of publicly available knowledge about genes and gene function to bear on their research projects, and will provide scalable infrastructure to support their analysis needs. This integrated genomic environment will also allow scientists and medical practitioners to add their own datasets to the engine and explore models that are generated by incorporating their data into the existing knowledgebase.
Central to the UIUC's proposed infrastructure is the concept of "community knowledge-guided analysis of user data," Saurabh Sinha, a UIUC associate professor of computer science and one of the principal investigators of the NIH grant, told BioInform. That essentially means connecting biomedical researchers to analysis tools and large heterogenous data resources that could help them extract useful patterns and gain deeper insights into genetic activity than would be possible if they analyzed their data in isolation.
The BD2K funds will support UIUC's efforts to expand on that particular paradigm, according to Sinha, by addressing associated data integration challenges and a need for scalable analysis, including access to affordable compute capable of handling large quantities of data that is able to scale as needed. Specific project steps listed in the UIUC grant abstract include building a single network comprised of community-accumulated knowledge on genes and functions and interactions between genes and proteins.
The researchers also intend to develop computational methods that will allow users to analyze their data in the context of rich third-party information and to make these methods available as scalable software components that can be deployed on public or private clouds. Specifically, the so-called KnowEng system, according to its developers, will use data mining and machine learning techniques to extract and combine gene function and gene interaction information from multiple repositories and databases. Users will also have a mechanism for uploading their own internally generated datasets to the system in order to analyze them in the context of the much larger pool of community-generated information. The system will also include a simple-to-use interface based on the HUBZero toolkit, through which users will be able to interact with and use the tools and data.
As part of the grant, Sinha and his colleagues will work with internal and external collaborators to test drive and improve KnowEng in three pilot projects. One of these will be run in collaboration with researchers at the Mayo Clinic — it extends an existing collaboration between the two institutions. This particular project is exploring the pharmacogenomics of breast cancer and assessing genomic information from cell lines and tumor samples to better understand patients' response to treatments.
Two other test projects will be run by researchers at UIUC. One of these, led by Lisa Stubbs, a UIUC professor of cell and developmental biology, explores the molecular basis of behavioral patterns in animals using genomic data collected from the brains of mice, fish, and bees. The third project, led by William Metcalf, a professor of molecular and cellular biology and microbiology, focuses on novel antibiotic discovery. Specifically, researchers here are using genomic data to predict microorganisms' capacity to synthesize biologically active compounds.
The UIUC grant is part of a larger $32 million investment from the NIH in strategies and solutions that will help members of the biomedical community analyze and make use of the complex datasets generated as part of federally funded projects. UIUC is one of several institutions that the agency tapped to establish Centers of Excellence, each of which will tackle one of a series of data science challenges. One such center is being set up at the University of California, Santa Cruz. Researchers there along with collaborators at other institutions and centers are working on standardized protocols and tools for handling, sharing, and analyzing both genomic and clinical data (see related story this issue).