Teaching your old database new tricks could ease your troubles
My guess is that it will be quite a while before we have computers as smart as our comic hero Dataslave. But even with current technology, databases can be a lot smarter than the inert lumps they are today. To make this so, we have to spend more effort programming biological concepts into the database, rather than just filling its head with biological facts.
The trick is finding the right database technology and then setting a good balance between conceptual learning and rote.
A database is more than a storehouse of facts. We generally expect even simple databases to process information in ways that make it more useful: checking for errors when data are entered and sorting and querying data based on their content. To perform these functions, the database must be programmed with some understanding of what the data mean.
Let’s look at a simple, everyday example: an address-list database, such as you might find in a typical e-mail program. Each entry in the database contains the name, address, phone number, and e-mail address of a person, place, or organization.
Here’s a sample entry:
Name: George W. Bush, Jr.
Address: The White House,
1600 Pennsylvania Avenue, N.W.,
Washington, DC 20500
E-mail: [email protected]
As a person well versed in the ways of the Western world, I’m sure you have no trouble understanding what this entry means. You know that we’re talking about a person whose first name is George and last name is Bush. This person lives or works at a place called the White House, which is located in the city of Washington in the pseudo-state of the District of Columbia, within zip code 20500. You can also tell that the White House is on a street called Pennsylvania Avenue, and that 1600 is the number of the place on the street; you can guess that the abbreviation N.W. after the street name means northwest, but unless you are proficient in DC geography, you may not know exactly what this means. From the phone number, you can see that 202 is the area code. And if you’re up on phone number-ology, you know that 202 is the correct area code for DC and some government offices in the immediate environs. You can parse the e-mail address into the username and domain, and from the .gov suffix, you can tell that this guy works for the government.
This simple example illustrates that even something as commonplace as an address list embodies a great deal of specialized knowledge.
From Kindergarten to Grad School
When building a database, you can choose to program in as much or as little knowledge as you want, depending on how capable we want the database to be. If you want the database to sort people by last name, you must give it the ability to parse names into their component parts. If you want the database to check that addresses are valid, print delivery barcodes, or map addresses via services like MapQuest, you must give it the ability to parse addresses into parts.
There’s no end to it. Many people have multiple addresses and phone numbers. Some have a different address for winter than summer, or one for weekdays and another for weekends. Sometimes a mailing address differs from the shipping or physical address. Sometimes the phone number is a direct line, while other times it’s the main number for their business or home. Some people use one e-mail address for work and another for personal correspondence.
The more of this knowledge you program into the database, the more the system can do. While we rarely see such smarts in address lists aimed at casual users, products intended for sales reps, fundraisers, and other contact-info-collecting pros often have this degree of expertise and more.
Doing a Bio Post-Doc
Biological data are far more complex. Let’s think about the knowledge you might put into a database of human genes.
Each gene has one or more transcripts, and each transcript has a gene model made up of a list of exons. The evidence supporting a transcript may come from full-length cDNA sequences, ESTs, computer predictions, or some combination.
You may know some of the regulatory sites controlling the gene’s expression. Some may be close to the gene and directly modulate the transcriptional machinery, while others may be more distant and affect phenomena like chromatin remodeling. Some sites may regulate all splice forms, while others may control splicing or otherwise affect specific splice forms.
You may know something about the gene’s function from direct experimental evidence, extrapolation from a close homolog in another species, or computational prediction. You may be aware of SNPs or other variations in a transcript or regulatory site, and you may have information about how they impact the gene’s function.
Each sequence feature has a position in the genome. As there are multiple assemblies of the genome vying for our attention, the database may have to store a position for each feature in each assembly.
There’s no end to it. Any biological concept pertaining to genes and genomes could potentially be added to the database. Scientists discover new biological concepts all the time, so you can never really be done.
When should you stop teaching your database new tricks? There’s no simple answer. What you’re really doing when you program knowledge into a database is building a model of reality. Like modeling in other areas of engineering and science, this is a difficult balancing act in which you have to reconcile the conflicting demands of realism and cost.
A smart database needs a good teacher. Bear in mind that a computer has no innate intelligence and no common sense. To educate this dumb beast, you have to break the material into tiny pieces, organize it in a precise, logical structure, then break it up again, and program all this into the computer with unwavering accuracy. For any reasonable subset of biology, you’ll end up with thousands of informational and organizational pieces. This is in addition to the data itself. It’s painstaking, time- consuming work.
The only way to make this practical is to employ powerful data-modeling technology (see sidebar). Today, people typically build biological databases using relational database systems, like Oracle or MySQL. They are fine products, but they’re not great for knowledge-rich databases.
Sadly, there are no mainstream commercial products intended for smart databases. There is, however, a lot of technology available from the computer-science research world, including some hot work called ontologies, with which bioinformaticists are starting to work (see table).
I’m not sure what will prove to be the winning solution. But I know we can do better by reaching out to our computer science colleagues and adapting their masterful database technologies.
What Codd Created: A History of Data Models
First, an introduction to some computer-science jargon: A database design defines the kinds of data that can be stored in a database, any associated knowledge, and how it is all organized. A data model is a formalism for expressing a database design.
The modern history of database management begins with the invention of the relational data model by E.F. (Ted) Codd in 1970. Codd’s idea was simple. He proposed that data be organized in tables, with each row representing an entry or record. A database as a whole could contain many tables, and complex structures could be constructed by letting rows refer to each other using values called keys.
In our address list example, we might store people and their names in one table, places and addresses in a second, and we could link persons to places by storing the person’s key in the place’s row.
What made this idea so brilliant — beyond its elegant simplicity — was Codd’s observation that these tables could be treated as relations in mathematical logic. This allowed him to exploit the machinery of formal logic to devise powerful, yet efficient, query languages for accessing a database.
Codd’s idea quickly took hold. All the well-known database packages today — Oracle, MySQL, and others — have been built on Codd’s seminal work of 30 years ago.
After digesting Codd’s creation for a few years, the database research community realized a significant weakness in the work: the model is essentially free of semantics. It offers precious little help in representing what the data means. This led to a research track that continues to add more semantics to Codd’s model.
The first big breakthrough along this path was the publication of the entity-relationship data model by Peter Chen in 1976. Chen proposed that data be classified as entities or relationships. In his words, “An entity is a ‘thing’ which can be distinctly identified. A relationship is an association among entities.” The model also includes attributes, which are simple properties of entities or relationships.
In our address list example, we might model persons and places as entities, and connect persons to places through a relationship; name would be an attribute of person, address would be an attribute of place. And we might treat phone number as an attribute of the relationship to capture the fact that people’s phone numbers usually change when they move.
The next big step was the object-oriented model, which evolved through the work of many investigators in the mid-1980s to early ’90s. This approach is orthogonal to Chen’s model and is centered around two main ideas. First, it organizes data into class/subclass hierarchies. And second, it allows — indeed encourages — the commingling of procedural and structural facets of the database.
For example, we might create a class for persons, and subclasses for particular kinds of persons: men vs. women, girls vs. boys. In addition, the procedure for parsing a name into its components might be associated with the various classes and subclasses. This is a handy way to accommodate variation in the procedure, for example to permit the title Mr. for men, Ms. for women, and Miss and Master for girls and boys.
The current hot idea is ontologies. The term comes from philosophy where it is the branch of metaphysics that studies how we decide what things exist.
Ontologies in the computer science sense combine entity-relationship and object-oriented thinking. Data are organized into classes and subclasses, as well as relationships and sub-relationships. In the most sophisticated versions of this idea, data are placed in the correct subclasses or sub-relationships based on logical properties of the data.
For example, we could define the man subclass as containing all persons whose sex is male. Then as data are entered into the database, the system would check the sex field and place the person in the correct subclass automatically.
Nat Goodman, PhD, helped found the Whitehead/MIT Center for Genome Research, directed a bioinformatics group at the Jackson Laboratory, led a bioinformatics marketing team for Compaq Computer, and has been consulting ever since. He is currently a free agent in Seattle. Send your comments to Nat at [email protected]
SOME BIOLOGICAL ONTOLOGY WEB SITES
|Bio-Ontologies||R. Stevens||Bio-Ontologies meeting||
|Gene Ontology||Gene Ontology Consortium||Controlled vocabulary for gene function, biological process, and cellular component||www.geneontology.org|
|Interaction Ontology||P. Karp, S. Paley||Metabolic reactions||
|RiboWeb||R. Altman, R. Chen, R. Felciano||Structure of ribosome||
|TAMBIS||A. Brass, C. Goble, N. Paton, R. Stevens||Ontology-based data integration||
|A Relational Model of Data for Large Shared Data Banks||E.F. Codd||Communications of the ACM, Vol.13, No. 6, June 1970||
|The Entity-Relationship Model-Toward a Unified View of Data||P. Chen||ACM Transactions on Database Systems, Vol. 1, No. 1, March 1976||