Skip to main content
Premium Trial:

Request an Annual Quote

NAR's Online Database Collection Approaches the 1,000 Mark; Proof that Good Databases Never Die?

Premium

Nucleic Acids Research has published its 13th annual issue dedicated to molecular biology databases, along with a supplement that's becoming an important resource in its own right: an online compilation of databases with summaries of their content and direct links to their homepages.

The current database issue includes 162 open-access papers, which represents only 19 percent of the 858 databases in the freely available online supplement, called the Molecular Biology Database Collection (available at http://nar.oxfordjournals.org/cgi/content/full/34/suppl_1/D3/DC1). At its current rate of growth, the collection is poised to hit the 1,000 mark next year.

In a paper describing the collection, Michael Galperin of the National Center for Biotechnology Information noted the "remarkable resilience" of the resources in the collection. Out of 719 databases featured in last year's list, only two are no longer being maintained "because their authors graduated, retired or changed focus," and one has shifted to restricted access. In addition, three databases that were "considered dead" last year -- ABCdb, EID, and KDBI -- "have now been resurrected," he wrote.

"Databases are born, they grow, and at some time many of them become senile," Galperin told BioInform, "but the fact is that very few databases actually die."


Databases are born, they grow, and at some time many of them become senile … but the fact is that very few databases actually die."

Galperin said that he and the NAR editors initially thought that the databases in the collection would drop out at the same rate at which new ones were added, "but that doesn't happen … because disk space is cheap, and because there are cable modems and so on so that people have the Internet at home, so people are building them not only at work but also in their spare time."

In addition, he said, other databases are "like the phoenix. Sometimes they die and are reborn under new names, sometimes under the same management," but in a new home.

NAR began publishing the database issue in 1994, and the online supplement was first launched in 1999 as a means of organizing the growing list of published resources into broad categories. Galperin has produced the collection for the last three years, and has implemented a classification system that now includes 64 categories and subcategories (see table, below, for a list of these categories and their growth since 2004, along with a chart highlighting the growth of some key categories since 2001).

Galperin also began issuing accession numbers for each of the databases in the resource, which are used to keep track of those databases that are "resurrected" or undergo a change of name or host organization, and can also be used to access updated summaries on the NAR website.

But even as the resource gains database-like functionality as a means of navigating a growing set of bioinformatics databases, Galperin described the collection as "a reward mechanism" for database developers and administrators and a "gateway" for users. "It's not primarily a database in its own right," he told BioInform.

"The online supplement was originally just to keep track of these databases" published in the issue, he said. The collection soon grew to include links to resources that were not published in the issue due to space considerations -- an obvious challenge for a printed journal, but Galperin noted that even the online version of the journal "can hold only so many of them."

The databases in the collection are not subject to the same "harsh" peer review as those in the issue itself, but Galperin stressed that "this is not just a list of all databases. There are hundreds, maybe thousands, [more] and our work is as a gatekeeper."


click for larger view

He noted that the collection walks the line between two extremes: an extensive collection of everything available, and a highly selective list. "This list is something in between," he said. Data warehouses are off-limits, for example, as are databases that essentially "repackage" information from other publicly available resources.

This policy has its drawbacks. "We have not always been completely successful in weeding out repackaging sites, but at least we're trying to do that," Galperin said. In addition, "some people get extremely annoyed" when they put a lot of time and effort into creating a "nice interface" for repackaged data and it is not accepted into the collection.

Galperin said that several categories are growing more quickly than others. RNA databases, for example, "have exploded," and there is also solid growth in databases related to immunology, plants, model organisms, and protein-protein interactions.

Going forward, Galperin said that he expects the recently announced human cancer genome project [BioInform 12-19-05] to lead to many new repositories for cancer-related data. More generally, he said he expects the collection to continue growing by more than 100 databases per year.

Aside from the low cost of storage and ready Internet access, "everybody knows a little bit of programming now, so anybody can basically build a database from scratch," he said. "The question is, do you have any good ideas, and do you have any data to show how you can be of interest to the community?"

-- Bernadette Toner ([email protected])

Growth in Database Categories and Subcategories, 2004-2006
2004
2005
2006
1. Nucleotide Sequence Databases
1.1. International Nucleotide Sequence Database Collaboration
3
3
3
1.2. DNA sequences: genes, motifs and regulatory sites
1.2.1. Coding and coding DNA
15
17
20
1.2.2. Gene structure, introns and exons, splice sites
12
13
16
1.2.3. Transcriptional regulator sites and transcription factors
18
20
28
2. RNA sequence and structure
30
34
46
3. Protein sequence databases
3.1. General sequence databases
8
11
12
3.2. Protein properties
2
4
8
3.3. Protein localization and targeting
6
9
11
3.4. Protein sequence motifs and active sites
10
14
16
3.5. Protein domain databases; protein classification
19
20
25
3.6. Databases of individual protein families
49
47
48
4. Structure Databases
4.1. Small molecules
5
8
8
4.2. Carbohydrates
5
6
6
4.3. Nucleic acid structure
4
4
4
4.4. Protein structure
44
46
54
5. Genomics Databases (non-human)
5.1. Genome annotation terms, ontologies and nomenclature
9
8
8
5.1.1. Taxonomy and Identification
5
6
6
5.2. General genomics databases
21
26
36
5.3. Organism-specific databases
5.3.1. Viruses
3
10
13
5.3.2. Prokaryotes
3
4
5.3.2.1. Escherichia coli
11
12
13
5.3.2.2. Bacillus subtilis
3
3
3
5.3.2.3. Other prokaryotes
7
11
14
5.3.3. Unicellular eukaryotes
8
10
12
5.3.4. Fungi
5.3.4.1. Yeasts
11
15
18
5.3.4.2. Other fungi
5
5
7
5.3.5. Invertebrates
5.3.5.1. Caenorhabditis elegans
6
6
7
5.3.5.2. Drosophila melanogaster
6
9
11
5.3.5.3. Other invertebrates
4
10
12
6. Metabolic Enzymes and Pathways; Signaling Pathways
1
6.1. Enzymes and Enzyme Nomenclature
4
7
8
6.2. Metabolic Pathways
5
5
6
6.3 Molecular interactions and signaling pathways
15
24
31
7. Human and other Vertebrate Genomes
7.1. Model organisms, comparative genomics
22
30
33
7.2. Human genome databases, maps and viewers
21
27
27
7.3. Human proteins
6
7
8
8. Human Genes and Diseases
8.1. General Databases
9
9
9
8.2. Human Mutations Databases
8.2.1. General polymorphism databases
12
17
19
8.2.2. Cancer
11
14
16
8.2.3. Gene-, system- or disease-specific
26
29
33
9. Microarray Data and other Gene Expression Databases
31
42
50
10. Proteomics Resources
5
7
11
11. Other Molecular Biology Databases
11.1. Drugs and drug design
7
9
13
11.2. Probes
5
6
6
11.3. Unclassified databases
2
2
3
12. Organelle Databases
8
19
22
13. Plant Databases
13.1. General plant databases
10
16
18
13.2. Arabidopsis thaliana
7
16
17
13.3. Rice
8
10
13
13.4. Other plants
4
6
10
14. Immunological Databases
0
20
21
Source: Nucleic Acids Research Molecular Biology Database Issues, 2004-2006
File Attachments

Filed under

The Scan

CDC Calls Delta "Variant of Concern"

CNN reports the US Centers for Disease Control and Prevention now considers the Delta variant of SARS-CoV-2 to be a "variant of concern."

From FDA to Venture Capital

Former FDA Commissioner Stephen Hahn is taking a position at a venture capital firm, leading some ethicists to raise eyebrows, according to the Washington Post.

Consent Questions

Nature News writes that there are questions whether informed consent was obtained for some submissions to a database of Y-chromosome profiles.

Cell Studies on Multimodal Single-Cell Analysis, Coronaviruses in Bats, Urban Microbiomes

In Cell this week: approach to analyze multimodal single-cell genomic data, analysis of bat coronaviruses, and more.