NEW YORK (GenomeWeb) – Multiple informatics researchers have been selected to receive a share of $21 million in grant funding from the Gordon and Betty Moore Foundation, which they will use to develop innovative solutions for analyzing, sharing, and using omic datasets.
The Moore Foundation selected a total of 14 investigators from various disciplines to receive $1.5 million in grant funding over five years. These awards are part of a larger $60 million effort by the Data-Driven Discovery Initiative, an effort within the foundation's Science Program that aims to enable new types of scientific breakthroughs by supporting interdisciplinary, data-driven researchers.
Award winners for this year's grants come from the University of California, Davis; University of California, Berkeley; Dartmouth College; University of Washington; Carnegie Mellon University; Stanford University; University of Texas Southwestern Medical Center; Princeton University; University of Chicago; North Carolina State University; University of Illinois; and University of Florida.
One of the winning researchers is C. Titus Brown, currently an assistant professor of bioinformatics at Michigan State University. He'll be moving to UC Davis in January to lead the institution's Laboratory of Data Intensive Biology and will also serve as a visiting associate professor of population health and reproduction.
Brown explained to BioInform that he'll use the award to develop cloud-based data analysis tools and interfaces that will enable users to share their raw or analyzed sequence data; and infrastructure for mining and querying datasets stored in disparate locations — he describes the proposal in more detail here.
Brown will also focus on providing these solutions for non-model mRNAseq, environmental metagenomics, and semi-model genomics, which involves organisms whose genomes aren't yet well defined such as horse, chicken, and cow. He is exploring open source infrastructure such as Galaxy and the iPlant consortium's platform, as well as other kinds of workflow and data management software, and is considering multiple cloud infrastructure options including Amazon, RackSpace, and OpenStack.
The infrastructure would provide tools and protocols for analyzing and exploring data including assembly, homology assignment, differential expression, and phylogeny analysis; a mechanism for integrating and mining large sequence datasets; and a distributed graph database system that would enable researchers to connect and explore patterns in disparate datasets. Brown and colleagues first discussed some details about the proposed infrastructure in a 2009 paper. Since it's cloud-based, users would simply spin up their own server instances; upload, analyze, and share their own data; and access servers where datasets from other researchers — who've agreed to share their data — reside.
The planned system, he said, would hopefully make sharing data a far less painful prospect and, furthermore, would incentivize researchers to share their data. In the current system, few incentives exist for that purpose and the unfortunate upshot is that data is hoarded with just enough information made available to reproduce the analysis post-publication per funding agency requirements; or it's released in an unusable format.
With his system, Brown explained, researchers willing to share their raw data would have access to the available tools but would also be connected to colleagues who have complementary datasets and have also consented to use the system, and could help interpret their sequences, providing an immediate return for sharing, he explained. Also, researchers who use results from unpublished datasets will have citation information to include in their publications — a second benefit that will hopefully entice researchers to be more open with their data.
Brown has two pilot projects in mind for his first steps and is currently hiring postdocs to work with him. One of these will involve data from the Woods Hole Oceanographic Institution's DeepDom cruise project. Researchers performed multi-omic sampling on multiple sites in the Atlantic Ocean and sent the data to be sequenced at the Joint Genome Institute. The project is generating various kinds of datasets, making it an ideal testbed for the sort of system Brown is proposing.
"We have a pretty good idea how to analyze meta-metabolomics, metagenomic, [and] transcriptomic data in isolation but what we don't know how to do is cross-correlate between the datasets to pull up patterns that we wouldn't necessarily see without being able to correlate between the different kinds of data," he explained to BioInform.
Data mining and cellular systems
Meantime, Casey Greene, an assistant professor of genetics at Dartmouth's Geisel School of Medicine, intends to use his portion of the grant funding to develop data-mining techniques and webservers that will help researchers explore relationships in publicly available genomic data. Specifically, his lab will build on existing methods to combine multiple genome-wide association datasets and identify complex gene-gene interactions that are associated with disease in the data, he explained to BioInform.
Greene first began studying these sorts of disease-linked gene-gene interactions as a graduate student at Dartmouth and then again as a post-doctoral student at Princeton. Historically, he and colleagues have developed and used supervised machine learning methods in their research, but they are also exploring the potential of using unsupervised machine learning approaches to study gene expression data. One such technique that's shown promise, he said, is called a denoising autoencoder. He's presenting a paper in January at the Pacific Symposium on Biocomputing about applying this method to breast cancer datasets.
Now, armed with the grant funds, Greene intends to create unsupervised methods "that do a better job of integrating publically available data [even] with all the problems that these data have [such as] batch effects and lab effects, [and] pull important biological features out of it," he said. Furthermore, "once we develop these methods, we want to put them into every molecular biology laboratory," he continued. And so part of the funds will be put towards developing webservers that will make it easier for researchers to access and use the newly developed tools to explore publicly available data and provide feedback to improve the methods. Greene's team will work initially with gene expression data from ArrayExpress but the goal is eventually to be able to apply the methods developed here to the much broader omic information spectrum including data types such as genomic, proteomic, and methylation, he said.
Meanwhile, Kimberly Reynolds, an assistant professor in UT Southwestern's Center for Systems Biology, will try to identify the general rules that govern the construction and activities of cellular systems, experimentally test their observations in cell models, and develop broadly available open source software that will encapsulate the statistical approaches they use for the project. "[We'll] conduct a comparison across genomes to try to find some statistical rules for how things are put together … using a few different measures of conservation and correlation across species," she explained to BioInform.
Reynolds interest in cellular system mechanics dates back to her days as a biophysics graduate student where she studied protein structure and design and to her postdoc where she used statistical analysis to understand how proteins work. Her own lab builds on her postdoc experience and uses statistical methods to try to understand how not just proteins, but entire cell systems, are designed and constructed.
For this DDD grant, Reynolds and her team will start work with bacterial genomes, since a large number of strains have been sequenced — which provides good fodder for statistical analyses — and also because they make good models for high-throughput experimental testing. However, the goal is to make tools that are also applicable to eukaryotic systems, she said. For their first steps, the UT researchers are working on defining genome models using measures such as gene presence or absence; and also exploring well-defined model systems that they can used to test their ideas experimentally; for instance, they are interested in folate metabolism as well as bacterial flagellum as potential model systems, she said