Hear that cacophony? It’s the sound of bioinformaticists trying desperately to communicate with their — and everyone else’s — data.
Genesis XI. There was a time when all people spoke the same language, making it easy to organize them into large project teams. An enterprising fellow named Nimrod took advantage of this and organized a project to build a tower tall enough to reach the heavens. God wasn’t pleased. She snapped her fingers (or whatever God does when she needs a miracle) and suddenly the project team spoke many different languages. Unable to communicate effectively, the team quickly dissolved, and the project screeched to a halt.
Today we are faced with a Tower of Data, as new laboratory technologies and analytical methods give rise to vast new datasets. The people who produce these datasets think that each is a majestic creation, but in reality, each is just another brick in the tower. It falls to the bioinformatics team to bind the bricks together — with tar if we follow the biblical narrative — to make a tower that will stand. The tar I’m talking about, of course, is data integration.
Our tower, like the one of yore, is afflicted with a communication problem. Each dataset is implemented in its own peculiar way, and we must labor tirelessly just to overcome these idiosyncratic differences. It’s unfair to blame God for this curse: we, the bioinformatics community, have brought it on ourselves by refusing to obey the basic commandments of data compatibility.
There are technical and scientific challenges of data integration, and some products that try to help. It should come as no surprise that there isn’t a magic cure.
The Bottom Floors
Integration occurs through a series of stages that I call technical, conceptual, scientific, and user-specific. The boundaries between phases are not hard and fast, but it’s a good way to explore the issues.
Technical integration deals with differences that affect the format of data rather than its meaning. Some of these differences are computer-oriented (even geeky): text vs. HTML vs. XML vs. SOAP vs. relational. Others are specific to bioinformatics, such as GenBank vs. EMBL formats for sequence data, or BLAST vs. FASTA formats for alignments. Technical integration can be accomplished by bodily converting all external data to a common format, e.g., loading all the data into your own relational database or XML files, or through on-the-fly converters that extract data from external files or databases as needed.
The conceptual phase copes with the diverse ways bioinformatics programmers have chosen to represent biologically equivalent concepts, or more accurately, biological concepts that are equivalent for the purpose at hand. Consider, for example, two of the many ways one could represent the alternative splice forms of a given gene: you could simply list the multiple transcripts as independent sequences, or you could get fancy and list the exons of the gene first and then describe each transcript in terms of those. To integrate such data, you have to pick one representation for the concept of “alternative splice form” and convert the given data into that representation. Another approach, using object-oriented methods, is to leave the data in its original format and provide automatic, on-the-fly conversion routines.
Stairway to Heaven
At this point in the process, your software is operationally capable of integrating the data, and in simple cases you’re just about done. In many real cases, though, you’re still faced with the very hard problem of scientific integration. I recommend Keith Allen’s online presentation from this year’s O’Reilly bioinformatics conference for a great example of scientific integration.
The purpose of scientific integration is to ensure that what you’ve integrated makes sense scientifically and is not just computerized gibberish. Suppose you’re trying to integrate several gene index databases, such as UniGene from the US National Center for Biotechnology Information and STACKdb from the South African National Bioinformatics Institute. Both of these databases contain clusters of transcribed sequences that overlap and presumably come from the same gene. Since the databases use different clustering algorithms, the clusters they produce frequently differ in detail — and sometimes differ dramatically. If you don’t reconcile these discrepancies, your users will continuously trip over the inconsistencies.
Let’s dig into this example. The first step might be to align each cluster to the genome. Next you might verify that all members of each cluster land in the same part of the genome, and that the alignments have reasonable intron/exon boundaries. Inevitably, some sequences will align to multiple places in the genome, and you’ll need some scheme to choose the right one or, better, to work with multiple tentative assignments. Finally you might look for clusters that overlap in the genome and merge their contents according to some criteria.
I’m not aware of any publicly available software that does a complete job of this, although partial solutions certainly exist. The development of such a method would be a significant scientific accomplishment.
This highlights an obvious, but often overlooked, aspect of scientific integration: namely, that it is a scientific activity, and not “just programming.”
The final phase is to use your integrated database to answer the specific questions posed by your users. Your users may only care about a handful of the genes in your integrated database. For those few genes, they may want to see very specific information, or to browse the data using their favorite graphical viewers. It is important to make sure that your integration scheme can support these common modes of access.
In a lot of cases, it makes sense to start with the users’ questions and work backward. It’s a lot easier to integrate data for 30 genes than for 30,000, and your users may not be pleased to learn that you’ve just spent six months integrating data for 29,970 genes they don’t care about. The flip side, of course, is that if you brute-force the answer for only today’s genes of interest, you may have to do it all over again tomorrow. Balancing these two factors is perhaps the greatest challenge in managing a bioinformatics effort.
Stein’s Seven Commandments
Lincoln Stein, bioinformatics guru extraordinaire, has enunciated a code of conduct for database providers aimed at simplifying data integration. His code focuses primarily on the technical and conceptual phases, which are the ones most amenable to quick fixes. You can read his words in Nature, or see the movie (really, an online PowerPoint presentation) from his keynote address at this year’s O’Reilly bioinformatics conference.
Stein offers seven commandments:
1. Make it easy for programmers to access your Web pages from scripts. This takes only a little more work than creating human-only pages.
2. Once programmers start to access your web pages, don’t change the format without good reason and without advance notice. Even small changes can break the integration programs written by your colleagues.
3. Provide your data in multiple formats since different programmers will have different skills and needs. Always provide a text format since everyone can handle that.
4. Make the whole dataset available for batch download, typically via FTP.
5. Use existing formats when possible. Creating your own format may be cool, but it only creates extra work for you and everyone who uses your database.
6. If it is absolutely necessary to create a new format, make it as simple as possible.
7. Support a true query language, rather than forcing users to point and click their way to the information they need.
Even better, in Stein’s view, is to exploit an emerging web technology called web services. This uses standard data formats, such as XML, and standard access protocols, such as SOAP, so that programs can access equivalent data and computations in a common manner. The technology also provides a directory service that programs can use to find services of interest on the Web.
While these commandments don’t solve all data integration problems, they would certainly ease the pain. I can think of a few commandments I would add — such as, Test your database to make it more challenging for the IT Guy to find errors!
The Tower of Data lies at the heart of bioinformatics, and filling the cracks with tar is what we do. We don’t have the option of giving up like the tower builders of old since that would doom the entire omic edifice. Much of the hardship is of our own making and could be saved if data providers would obey Lincoln Stein’s seven commandments.
Many software vendors have seen our pain and are offering salvation through clever products. Some of these may help and are certainly worth a look, but I suspect we’ll be shoveling tar for some time to come.
Perhaps a Thesaurus Would Help?
A common problem that recurs across the phases is the need to identify biologically identical or equivalent entities.
The most basic form of the problem is that some data providers employ multiple identifiers for the same data — for example, NCBI’s use of ‘gi’ numbers in parallel with accession numbers. A related problem is the tendency of data providers to invent their own accession numbers for data they modify slightly and redistribute, such as the ‘NM’ accession numbers that NCBI assigns to RefSeq entries. There are good reasons for these practices, but they create obvious confusion.
Closely related is the practice of assigning separate identifiers for a biological entity, such as a clone, and data derived from the entity, such as its sequence. Obviously a clone and a sequence are different things, but we often treat them the same from an analytical standpoint.
A harder version of the problem comes from the inherent redundancy of many biological databases. EST and cDNA sequence databases are a clear example: It is natural and unavoidable that a single gene will give rise to numerous sequences, each with its own accession number. This can be especially confusing when integrating microarray data, since the probes on an array are often identified by sequence accession numbers and different arrays use different probes for a given gene. The net effect is that the same gene will have different accession numbers on different arrays.
To sensibly integrate such data, you have to translate accession numbers to genes, perhaps via UniGene. This is not a perfect translation, since the connection between a sequence and its gene may not be clear cut, and different sequences may reflect different splice variants or may come from different parts of the gene and so have different survival rates in the sample preparation, and so forth. This is an example of the subtlety in deciding when two biological entities are “equivalent.”
The practical solution to the identifiers problem is straightforward, though a pain in the butt. You need to maintain a synonym table that lists identifiers that you deem to be equivalent for the problem at hand, and force your software to consult this table whenever it sees an identifier. This solution can be implemented with much less pain using object-oriented methods, since the lookup can be done automatically.
Tar, Tar, Everywhere Tar
There are a lot of data integration products on the market, and more to come. Here are some highlights of the major and a few minor players.
SRS from Lion Bioscience is the clear market leader. It focuses on technical integration and basically seeks to present a convenient flat file view of as many databases as possible, with links between databases that users can follow. It features a powerful language, called Icarus, for accomplishing technical integration — it’s even better than Perl for pulling information out of text files. But it’s non-standard, and many programmers find it hard to master. SRS can also work with XML files, relational databases, and other data sources.
DiscoveryLink from IBM is the clear hype leader. No disrespect intended, but few companies can match IBM’s marketing prowess. Big Blue emphasizes that DiscoveryLink is a “solution,” not a “product,” comprising several software elements plus services. It provides a means for programmers to write wrappers that convert external data sources into virtual tables. Users can then combine data from multiple sources by writing queries in the standard relational query language, SQL.
The product, er, I mean solution, comes with wrappers for comma-delimited text files (but not tab-delimited files), Excel files, BLAST, and Documentum. Additional wrappers must be written in the C++ programming language, which seems an odd choice in this market. If you want to use Perl or something else, IBM recommends that you further wrap your code in SOAP, which can then be connected to C++.
Entigen offers a similar product called ADAAPT that sits underneath its BioNavigator collection of tools.
GeneticXchange takes a slightly different approach in its discoveryHub product. Like DiscoveryLink, the basic idea is to create wrappers around external data sources. A key difference, though, is that discoveryHub doesn’t force you to convert all your data into a flat relational form, but rather provides a query language that handles hierarchical structures. This is an important difference since so much of biology is intrinsically hierarchical. The product comes with more than 60 wrappers and the company offers a Wrapper Generator Kit to add more.
Acero’s data warehouse product Genomics Knowledge Platform takes a very different approach. It focuses squarely on conceptual integration and provides a very detailed object-oriented model containing approximately 300 classes of data, storing the integrated data in Oracle. The product also provides a means of firing off and managing big compute jobs, like large BLAST runs. The object model was originally developed for Incyte and is part of the software tools Incyte provided to its customers. My guess is that this will be a great product for folks who like Acero’s object model, and torture for those who hate it.
Tower Builders’ Notes
Keith Allen, Data Integration for Function Discovery
Lincoln Stein, Bioinformatics — Building a Nation from a Land of City States
Acero/Genomics Knowledge Platform
US National Center for Biotechnology Information/UniGene
South African National Bioinformatics Institute/STACKdb