The US Department of Energy is developing a systems biology knowledgebase, dubbed Kbase, that will support efforts to study microbes, microbial communities, and plants.
Kbase is currently in alpha release and is targeted for full production release in early 2013.
The community-driven and scalable open-source system is intended to provide a framework for evidence-based functional annotation of genome sequences. It will also enable the creation of metabolic and regulatory models that may be used to generate scientific hypotheses; as well as put in place standards for data and metadata representation.
The resource is expected to foster the development of open source data analysis, visualization, and modeling and simulation tools. It will incorporate information and tools from public resources such as MicrobesOnline; RegPrecise, RegPredict, Phytozome, and relevant datasets from the National Center for Biotechnology Information and the DOE's Joint Genome Institute.
The developers of Kbase have completed several pilot projects. One project focused on developing interfaces for navigating metabolic networks and experimental functional -omic data in MicrobesOnline using the Genome-Linked Application for Metabolic Maps, or GLAMM, program.
A second project investigated mechanisms for storing and accessing biological data in a cloud computing environment. The researchers created a use case scenario to identify and curate published genome annotations, which was implemented using the same federated cloud architecture proposed for Kbase.
Last week, BioInform spoke with Susan Gregurick, a computational biology program manager at the DOE's Biological and Environmental Research office and one of Kbase's developers, about the status of the Kbase project and what users can expect when the institute officially launches the resource. What follows is an edited version of the conversation.
Let's start with some background on what Kbase is and why you are developing it.
It's designed to be an integrative platform that couples together what we know about certain organisms that are of interest to the Department of Energy in microbial, plants, and meta communities — to integrate the data, the information, the experimental work and the computational models in order to basically provide a platform for researchers, developers, and bioinformaticians to interact with existing data [and] existing models or to derive new models from data and to collaborate together. So it's an interactive cyberinfrastructure and its all built around our missions in bioenergy for clean energy and environmental processes that have to do with things like carbon cycling and nutrient cycling, [and] bioremediation.
What are some specific components of the platform and, along those lines, what sort of infrastructure are you putting in place to support the system?
It leverages a lot of the capabilities that DOE has already invested in. For example, we have invested in the Energy Sciences Network, ESNet, and this is a high-speed, tier one internet ... that allows for large amounts of data to be transferred across the country. Kbase is hosted at four sites: [Lawrence Berkeley National Laboratory, Argonne National Laboratory, Oak Ridge National Laboratory, and Brookhaven National Laboratory] and ESNet runs between all these labs as well as well as throughout some universities. By using ESNet we can transfer sustained 10 gigabits per second but we've recently been able to go up to 90 gigabits per second. That’s one of the components.
The other component is that we've leveraged a lot of the infrastructure that DOE has invested in for scientific computing, both high-performance computing through our [Advanced Scientific Computing Research] office as well as DOE’s Magellan cloud infrastructure ... There is a cloud site at Argonne and one at Berkeley so the Kbase folks have right now about 252 nodes but they are expected to grow up to about 700 nodes, and this is all cloud infrastructure, as well as smaller nodes at Oak Ridge and Brookhaven. As we develop the knowledgebase the number of nodes we are bringing online is still under development.
The infrastructure is all built on a service-oriented infrastructure; we utilize Open Stack ... There is definitely support for high-performance calculations as well as calculations using Hadoop. So what we are hoping is to build upon a lot of the work that Amazon has made publically available with EC2 and perhaps at a later time S3 and utilize their [application programming interfaces] to help us leverage our scientific APIs.
It almost sounds like Kbase will only accept data from the four national laboratories you mentioned. Is that the case or can anyone submit and access the data?
The knowledgebase is being built and mirrored at the four laboratories; however, data is being integrated from many sources, including international sources, and Kbase will be able to accept data from many sources. All users will be able to access the public data from wherever they are coming from, including universities, industry, and the national laboratories.
[Although] there are four nodes [that will] mirror each other, the node at Brookhaven may be more focused on plant research. So if you are a person who is accessing the knowledgebase and you want to look at the regulatory metabolic models of plants you might end up doing a lot of your work at Brookhaven.
Have the four labs in question had to increase their compute capacity in any way to support increased use when Kbase gets underway?
The national labs are set right now because we are leveraging a lot of the investments that ASCR and DOE have already put forward for computing, but ... it's quite possible that within five years from now we might actually hit a barrier in terms of computing ... or maybe the technology drives in a slightly different way that we have to think about how to invest in [more] computing hardware.
Everything that we are doing in the knowledgebase should hopefully be transportable because it's all industry standard ... We are building [standard] APIs so that we can transport and leverage what's happening generally in the computing field.
Are you developing and distributing any software through the Kbase website?
Yes. Absolutely everything is being developed not just within Kbase but the university collaborators who are also working and developing software. There is a certain API standard that the group has developed and everybody will work within that.
What we are seeing is a lot of people in the plant regulatory network focus [area] developing APIs in collaboration with the microbial folks so that they can use the same APIs to look at regulatory networks for plants and microbes. There are differences between the organisms but some of the underlying API structures are the same and they can look at what's common and share that and then build out and up from that. So the modeling framework [is] synergistic and [they can add] organism-specific [components] later rather than developing silos of excellence and trying to merge the two later.
Are there any ongoing efforts similar to Kbase that you are looking to partner with?
There are. We are hoping to partner with a National Science Foundation-funded effort called iPlant which is very focused on plants ... we are hoping to partner with what they develop and bring that into the knowledgebase. There is the Galaxy effort and we are hoping to make our platform compatible with Galaxy so that people can run the knowledgebase from Galaxy. There is an effort in the UK called the European Life-science Infrastructure for Biological Information, ELIXIR ... that we could potentially partner with in the future but they don’t come online until 2016 and we will be live in 2013. We hope that because we are doing things to industry standard that it will be much easier for people to plug into our effort and there won't be a big barrier to entrance.
We are also partnering with efforts within our own program — [for example,] the Joint Genome Institute, which is a high-throughput sequencing institute and they do an awful lot of bioinformatics so we can leverage and link with their capabilities as well.
Kbase will officially launch in 2013. What have you been able to put in place so far and what do you plan to do prior to its debut?
We funded the project in August 2011 and in February 2012 we had our first community workshop to tell people what the scientific goals were and to make sure those were aligned with our principal investigators and researchers. They have already had two builds [and] will make things publicly available in a very limited fashion by August of 2012. By November, we should be able to have people start working with the system but they won't ... really engage the community until February 2013 and that would be a beta ... The entire project is supposed to be fully functional in three years ... probably August 2014.
Are these pre-releases meant to test the system with specific groups?
Yes. They have a subset of scientific researchers who are fairly savvy in bioinformatics but are experimentalists so these people will start working with the knowledgebase this summer. By November, a bit wider group can start using it. In February, anyone can use.
How much did the DOE provide for the project?
DOE has invested $12 million to date.