A research team based in Germany has developed a computational platform that addresses a need to integrate different data types — for example gene expression, mass spectrometry, and protein interaction data — in collaborative systems biology projects.
A recently published BMC Bioinformatics paper that describes the Data Integration Platform for Systems Biology Collaborations, or DIPSBC, noted that these multiple partner projects require standardized storage and exchange formats so that they can store and cross-link heterogenous datasets.
DIPSBC provides "a flexible representation of collaborative data" that is based on the extensible markup language, XML, which is used in several data domains including proteomics, genomics, molecular interactions, and mathematical models, the paper states.
The system comprises a web server, a search index, and series of Java-based helper applications that provide data-specific analysis capabilities, such as the Broad Institute's Argo genome browser which is used for visualizing and annotating whole genomes.
According to its developers, DIPSBC is best suited for small- to medium-sized research collaborations although larger partner projects could potentially use the platform as well.
BioInform spoke with Felix Dreher, a researcher in the bioinformatics arm of the vertebrate genomics department at the Max Planck Institute for Molecular Genomics and one of DIPSBC's developers, about the platform's capabilities and where it best fits in the systems biology tool landscape.
What follows is an edited version of the conversation.
Let's start with a brief summary of the DIPSBC platform, its components, and capabilities.
It is a data integration platform for consortia, for example, that do research on a certain topic with different experimental setups. Typically, systems biology collaborations [involve] several groups that work together; one group does microarray analysis, another group does mass spectrometry, and a third one does cell assays and so on. The platform that we developed give[s] researchers the possibility to collaborate and to integrate all these different data types.
The first idea was to use XML as the general data type and to integrate the different experimental data with a text search engine. We combined XML with a search engine — Apache Solr, which is comparable to Google — and a content management system, Foswiki, and we developed the code that [connects] all these components to provide the data integration platform. People can upload their data, the researchers that are participating in the consortium can see what others do by querying the website and they can see which experiments are available and which are done already.
How would an interested user access and use the platform?
That’s part of the Foswiki component. People can ... register, provide an email and password, and after confirming their registration, they get access to the system. [That's] one possibility for providing a more open system. One could also [make] it more restricted. An administrator who runs the DIPSBC software could send out passwords to certain people so that only [those] ... participating in the project can get access to the system.
I should [note that] the system that is mentioned in the paper ... is more of an example installation. One could use it to get information [but] the idea is that ... groups of researchers do their own installations. The website that is mentioned in the paper [has] a link to installation instructions.
Following up on that point, how easy is it to install this system locally?
It's not trivial because it contains some Perl code. The average computer user would need [a] bit of time for it but its possible. An experienced programmer can install it quite easily. It's not like clicking an installation file and then 10 minutes later its up and running ... it needs some hours altogether to set up the whole system. One has to install the server, the Foswiki, and the Solr index.
You talk about this next point in the paper but let's address it here as well. Resources such as the ISA infrastructure and BioMart are also intended to support collaboration and data sharing among systems biologists. How is DIPSBC different?
The big advantage of our system is that it's very flexible in comparison to, for example, BioMart because it's based on XML and BioMart is based on MySQL, i.e. more or less big relational databases. I think especially for systems biology collaborations, it's an advantage because the technological progress is very fast [on] the experimental side. So for example, if one sets up [our] system and after one year, a new experiment type comes in because a new instrument is used, it's quite easy to adapt. It's easy to add new data types during a running project via XML to the index. In database-based systems, this is not so easy.
It's also comparable [to the ISA infrastructure] but I think our system adds some more functionality in terms of analysis because we have the option to quite easily add plugins — in our case, they are Java applets. So one [could] analyze different data types, for example, a mass spectrometry data can be viewed with an applet; yeast-2-hybrid experiments like protein-protein interactions can be viewed in a graph browser; [and] we have a genome browser incorporated to view genome annotations.
[A user] could query [the system] for a protein and [would] get a list with different results [from] different experiment types and ... just click on one ... and for each one, view different results.
So perhaps DIPSBC complements existing tools such as ISA and BioMart in the sense that it is better suited for some research cases and not others?
Our system is probably best suited for medium-scale collaborations, for example, for five to 15 partners or institutes ... and their data. I think for bigger collaborations with 20, 30, or more different participating institutes, there might be others better suited.
Earlier you mentioned that it's possible to use Java plugins to extend the platform's capabilities. Besides those, what other features do you intend to include in DIPSBC?
We are in the process of adding a more automated data upload [capability]. Right now, data is uploaded, transformed to XML, normalized, and indexed by the administrator. This has the advantage that possible errors can be corrected but it would be favorable to automate the system [to make] it faster. That’s one advantages of the Solr index. It can be changed and updated and queried at the same time. For example, one institute might be uploading and extending the index and another institute is searching it at the same time.
Another [update is] to add more fine grained user management. This is an issue for larger collaborations [where] only parts of the consortium can see certain results. One could create user groups [where] one group can see all results and another group can see only the results of some other partners.
Are you hoping that, as is the case with the ISA group, to encourage the growth of a community of users that both use and contribute code to DIBSPC?
I think so. All the components are open source ... it would be great if people that use [the system and develop] extensions for their datasets...would provide them for the community so the system could [evolve].
Speaking more generally now, what other computational challenges do you see on the systems biology horizon that you think ought to be addressed?
In general ... what is the definition of systems biology? For example, if one uses the definition ... [that] systems biology tries to create a computational or mathematical model of a cell or an organism, this is highly complex and there are so many levels of information ... genome, RNA, proteins, epigenetics, and so on. I think [that’s] still the main computational problem...that it's so complex and therefore needs huge computer resources.
In terms of our system ... another definition of systems biology would be to integrate all of the different levels of information about a biological problem like the integration of DNA and protein experiments, cell assays, and so on to make it possible to get a more complete view of what's going on in an experiment or a cell type. There is so much data generated but it's not connected. [That's] what we are trying to address with our system, by using XML standardized data sets.