The database proliferation problem is no secret. For years researchers have talked about ways to combine or manage data so that the sheer number of data repositories out there would be invisible to the average user. In fact, just in the last few months, two new efforts have come out to tackle this problem — they’re different approaches, but both highlight the underlying problem of having just too many databases to handle.
Nitin Baliga, an assistant professor at the Institute for Systems Biology in Seattle, wanted to help researchers get a global view of data without having to sift through data banks separately. Using a Java-based platform, Baliga pulled together databases containing various data types into a single system — a one-stop shop called Gaggle. “It doesn’t really make sense for us to rebuild everything — it’s an inefficient use of limited time and resources,” he says.
Tracing interactions in systems biology can produce a lot of data, especially when using high-throughput techniques. A single environmental cue can lead to changes seen at the gene, protein, and protein-interaction levels. Not only do the data differ, but so do each of the software tools that capture and analyze that data. “The problem is that there are many types, formats, and dimensions of data, and, therefore as many or even more types of software. It’s a two-fold problem,” Baliga says.
Furthermore, he adds, databases are usually only well-equipped to handle one type of data. To answer many systems biology questions, researchers need information from various databases in a handy and accessible format.
That’s what Baliga hopes Gaggle does. The tool shares information across different, already available platforms, rather than creating a new database. “What we’ve done, then, is we’ve figured out what technology exists that enables communication across different types of software databases,” Baliga says. “Further, we’ve discovered that the use of such communication protocols to pass defined packets of data from one tool to another achieves extraordinary integration for little effort.”
Using Java’s Remote Method Invocation software, Gaggle brings together a variety of databases, or “geese.” Baliga and his lab have connected the NCBI sequence database, STRING, and DAVID, among other databases, to Gaggle. Once connected, the databases share information so that a user may search them with a single query, such as a gene name. That search results in information about the DNA sequence, protein sequence, microarray data, or nodes in a network related to that gene, Baliga says. Soon, Gaggle may even provide the back story of how the experiments were conducted.
Though Baliga found a way to integrate information, he is cautious. “It’s just going to get worse with time as you have more and more data being generated,” he says.
Meanwhile, at the University of Pittsburgh, Ansuman Chattopadhyay has developed a bioinformatics portal called the Online Bioinformatics Resource Collection to help investigators get a better handle on this massive number of databases.
This searchable collection is hosted by the university’s library site and contains more than 1,600 links to open source bioinformatics databases and software tools. A mouse click away, researchers can access databases and tools in categories ranging from PCR primers and DNA sequences to proteomes and organelles.
In addition to providing a one-stop research solution, Chattopadhyay wants to attract investigators who might not be familiar with what all of these databases have to offer. “My target audience is bench-top scientists, not traditional computational biologists,” he says. “They are not aware of all this text, but they should use all of the advancements available in bioinformatics because it will help them in their research.”
Later on this year, Chattopadhyay plans to add a “How Do I” section to the OBRC that will provide users with a step-by-step guide on where to go and what kind of information is accessible for their particular query.