NAT GOODMAN proposes an integrated data repository. If someone would build it, he bets scientists would come.
Biological databases, like the Web at large, are suffering from a disease of wealth. There are bushels of useful data out there, but it can be hard to find what you need. Often data are spread out over multiple databases and pages, that even after you find what you’re looking for, it can take a lot of work to pull it together.
This is a big money problem for the Web in general, and you can bet that some hotshot startup is working on cool software to help consumers with online shopping. This improved software will help scientists, too, but we can do even better by exploiting special characteristics of biological data and research.
It would be enormously helpful to have a gene-centric integrated database that pulls together the important information about genes, transcripts, and proteins. Imagine a database where you could type in the query “human caspase 1” and get back all the important data about this gene, from sequence to literature. What molecular biologist wouldn’ t come out to play on something like this?
Dream team database
Picture what this database might look like: The core entities would be individual genes in specific organisms, for example, the caspase 1 gene in human. Linked to it would be various categories of experimental and computational data, including sequences, domains, structures, expression patterns, and small molecule affinities. Also linked to the entry would be the literature and other textual descriptions of gene function, plus “practical” information such as available reagents.
Zooming in on the sequence data linked to the gene, you’d see genomic, transcriptional, and protein sequences linked together. Each transcript would be linked “down” to the genomic sequence from which it is transcribed, and “up” to the protein sequence into which it is translated. The transcriptional portion of the database would contain clusters of overlapping ESTs and other cDNA sequences; it would also identify the unique full-length (or “long”) transcripts that can be determined from the dataset.
The genome portion of the database would identify the genomic location of the gene, its exon/intron structure, promoters and other regulatory sites (to the extent these are known), and SNPs and other polymorphisms in the region.
The protein portion of the database would have an entry for each isoform which would in turn be linked to the domains, functional sites, secondary structure motifs, etc., that are found in the sequence.
How to keep the ball in play
Other data would be linked directly to the appropriate elements of this data network and indirectly to the gene itself. For example, gene expression data generated at the transcriptional level — such as through gene chip experiments — would be linked directly to the transcripts involved, while expression data generated by proteomic methods would be linked directly to the corresponding proteins.
Having integrated the data for each gene, the next step would be to link homologous genes in the same and different organisms. The database would directly link each gene to its close homologs and indirectly to more distant ones. For example, the database would directly link human caspase 1 to mouse and rat caspase 1, and to human caspases 2-12, but only indirectly to the rodent caspases 2-12.
It would be very useful to go beyond close homologs by using protein family and super-family information, for example, to connect the mammalian caspases to homologs in more distant species, and to connect the caspases to other proteases.
The database would also need an “identification layer” that understands how to find a gene given its name, sequence, or other identifying information. This layer should be smart enough to recognize alternate names for genes, alternate spellings, and common misspellings.
To use the database, you would enter identifying information to specify the genes you’re interested in, indicate the kinds of data you want to retrieve, and provide a taxonometric range to tell the system how far to go in collecting data about related genes. The system would do the rest, creating, in effect, a customized database for your genes of interest.
A FEW PINCH HITTERS
No one has yet built this complete scheme, but workable solutions exist for many parts of it.
The US National Center for Biotechnology Information has developed a series of new interrelated databases that speak to this point. LocusLink provides a single point of access for information about genes and other genetic loci in human, mouse, rat, zebrafish, and fruit fly. RefSeq contains semi-curated reference sequences for human, mouse, and rat genes. HomoloGene links homologous genes across human, mouse, rat, and zebrafish.
The European Bioinformatics Institute has long operated SRS, which provides a unified query interface to more than 125 databases and links related entries across them. EBI recently introduced InterPro, an integrated database of functional sites and domains in proteins. It combines data from Pfam, PRINTS, PROSITE, and ProDom.
Of three well-known, publicly accessible databases of transcriptional sequences, NCBI’s UniGene has emerged as the gold standard. In contrast to UniGene, the other two ¯ Gene Index developed by The Institute for Genomic Research, and STACK developed by the South African National Bioinformatics Institute ¯ identify unique full-length or long transcripts in each cluster. This is a critical step in linking transcriptional data to genomic and protein sequences.
For public annotated human genome data EBI’s Ensembl is still at an early stage of development and is poorly integrated with other databases. As far as I can tell, there is no way to query Ensembl by gene name. I’ve had good luck querying the database by sequence; this seems to find reasonable gene models and even connects the results to entries in the established sequence databases. It will take just a little more programming by the Ensembl team to get queries working in the way needed for integration.
PLAYING THE FIELD
I find myself using all these databases while doing electronic research on a gene. If I know the gene name, I go first to NCBI’s LocusLink as this gives easy access to reference DNA and protein sequences (via RefSeq), transcript-clusters (via UniGene), textual descriptions (via OMIM), literature citations (via PubMed), and more.
Next, I go to EBI where I use SRS to get curated protein sequences from SWISSPROT (or better, SWALL, which includes sequences waiting to be curated); then I ask SRS to link the retrieved protein sequence with InterPro, and—voila!—this gives me all the known functional sites and domains in the sequence.
I then take the reference DNA sequence from NCBI and use it to query TIGR’s Gene Index and SANBI’s STACK database to get my best look at transcriptional data. Finally, I use the reference DNA sequence to query Ensembl for genomic sequence. If all I have is a sequence, typically from an EST, I first query UniGene to see if I’m working with a known gene. If not, I go to Gene Index and STACK to get long transcripts that I can then use to query InterPro for functional information and Ensembl for genomic information. The Baylor College of Medicine Search Launcher is also a good resource in the sequence-only situation.
There are about 40 databases of protein families that provide highly curated information on homologous genes. PROSITE and Hovergen are also good sources for homology information, as is HomoloGene for the organisms it covers.
WHO WILL BE THE MAJOR LEAGUE PLAYERS?
With the human genome complete, we can treat the universe of human genes as a finite collection. It means gene-centric data integration is no longer a terribly hard problem. But it is a lot of work and can only be tackled by people with lots of money. The large data vendors, notably Incyte and Celera, have the money, interest, and expertise to take it on, as do some of the Web portal companies, such as DoubleTwist and Compugen. NCBI and EBI have the interest and expertise but may not have the funds.
One or more commercial firms will likely develop an integrated, gene-centric database of the sort I’ve outlined. A public offering from NCBI or EBI would be nice, but not likely unless the academic community rallies behind them. The smart commercial firms will do their best to head off such an initiative by making their products attractive to academics. This might be the outcome that lets us all play ball.
THE BULLPEN: WEBSITES WITH POTENTIAL
BCM Search Launcher http://www.hgsc.bcm.tmc.edu/SearchLauncher/