NEW YORK (GenomeWeb) – Researchers at the Buffalo Institute for Genomics and Data Analytics (BIG) at the University at Buffalo have begun offering services based on the Genomics Data Warehouse, a tool that they developed to help researchers in academia and industry sift through, query, and analyze large quantities of digital information.
The Genomics Data Warehouse, which was developed in collaboration with UB's genomics core, provides a database and search technology for efficiently storing and querying genomic and other kinds of data. When they began building it a little over a year ago, "we didn't have a way of cataloguing any of the sequencing that was done here or have any type of structure to put to the data," Adrian Levesque, a senior programmer/analyst with UB's Center for Computational Research and project manager for the warehouse, told GenomeWeb. Researchers' data was stored on the university's cluster but they needed significant computing expertise in order to access and use their information. "We wanted a very user-friendly method where [the researcher] doesn't have to be very computationally savvy to actually work with their data."
This first iteration of the solution leverages resources from the BioMart project, which provides free software and data services for scientific collaboration and discovery. Data within the warehouse is structured in a MySQL database and leverages the power of UB's cloud compute infrastructure to allow researchers to quickly query data.
The next version, however, will use the Elasticsearch open-source search engine on the backend, which will offer a lot more flexibility and support even faster searches, Levesque said. That iteration of the warehouse will use a non-structured database that will make it possible to query large quantities of data faster than is currently possible, over 800 million records per second, he added. This way, "as more samples come in here and the university starts to work on bigger projects, we are not bogged down by the computer needing to take more time to do searches."
The input to the warehouse are annotated variants gleaned from high-throughput sequencing experiments. When they receive the raw sequence, computational scientists at the institution run quality-control protocols, call and annotate variants, and then transform the annotated data into the required format for the warehouse database models. They then load the transformed data into the warehouse where the contributing researchers can run queries using the filtering criteria provided by the warehouse.
For example, they could search for frequency information related to specific alleles or they might look to see if specific variants of interest are present in their uploaded samples, Levesque said. Researchers can also choose whether or not they want to make their data private or if they want to allow other researchers to be able to query their data as part of their own study. Also, clients have the option to store their data within the UB infrastructure for possible reuse at a later time or they can ask the warehouse development team to delete their information on the queries have been run, Levesque told GenomeWeb.
The warehouse technology supports BIG's mission to drive innovation and job creation in the genomics domain in New York state. UB's BIG is one component of the $100 million genomic medicine initiative launched by Governor Andrew Cuomo in 2014. The funds were allocated to establish the NYS Genomic Medicine and Big Data Center, a genomics data partnership that connects institutions in Buffalo with the New York Genome Center.
"One of the things that we are doing across the entire university is really looking at how we can have more impact on the local environment and community and promote economic development," BIG's Executive Director Brian McIlroy told GenomeWeb. "The genomic data warehouse is a great example of where we've taken a tool which is highly useful for the university and ... [are looking at] how we package that in such a way that we can offer that to the local community ... to advance their own initiatives."
Companies that choose to work with BIG will benefit not just from access to the warehouse but also to the genomics core at the institution. As both resources are integrated, these companies will be able to access and use their sequence data faster without worrying about data transfer times and costs, the researchers said. Furthermore, access to UB software and cloud compute frees companies from paying the upfront costs of implementing the requisite infrastructure and hiring the staff necessary to run their own informatics solutions, which can be onerous for smaller companies.
BIG is offering use of the warehouse to UB researchers for free and also is reaching out to interested local companies in the Buffalo area working in genomics. Although the warehouse was developed with genomic data analysis in mind, the researchers said that the technology can help clients in other domains, such as drug discovery and materials development, manage and query their large datasets.
The institute is offering flexible pricing to industry users with price points that vary depending on the size of the company in question and the nature of the project — customers do need to create accounts with the university in order to access the warehouse infrastructure. There are also flexible pricing options for companies who want to store their data longer term on the UB infrastructure. Clients with small quantities of data who also ink partnerships with UB could be eligible to to store their data for free, however, clients with larger datasets will be charged a fee calculated based on the size of the data that they need to store.
Specifically, UB will offer the first terabyte of storage for free and then charge $1,000 per terabyte of storage after that — those costs cover data backup and redundancy if users want those services. The $1,000 price currently covers storage costs for four years. The fee is negotiable if the company in question makes its home in Western New York State and creates jobs, the researchers said.
BIG intends to offer subsidies to attract local industry clients and encourage them to use its platform, McIlroy said. Also, clients who are willing to include their data in a common pool within the resource to benefit the broader research community would be able to offset some of the costs associated with using the warehouse.
"The aim is not for this to be a cash generator; it is to provide a service that is used," he said. "As long as the service is used, we are not going to lose money on it."
Income earned from the use of the warehouse will be invested in BIG's ongoing efforts to build infrastructure that will support research efforts at UB.
So far, the institute has an agreement with at least one company that is looking to leverage the resource to enable them to identify a subset of patients in a larger cohort that may benefit from genetic testing. The center also has some internal projects lined up to use the warehouse that will focus on identifying genetic variants associated with particular diseases.
BIG could in the future open up the warehouse to researchers at other institutions outside of the university, McIlroy said. "There is some discussion about opening up to other SUNY institutions, which would probably be the first step. [But] if we got interest [from other universities] then I can't imagine why we wouldn't open it up," he said.