Getting ready to celebrate its 25th birthday in October, GenBank has come a long way. The nucleic acid sequence database was established by NIH in 1982 and grew out of the Los Alamos Sequence Database developed by Walter Goad and others at the Los Alamos National Lab. This was at about the same time as the EMBL Data Library was created. Many of us have seen the graphs showing its exponential growth, doubling about every 18 months. Once published as a physical book (two hardcover volumes, for example, in 1984 with a whopping 3,424 sequences containing 2.8 million bases), GenBank began exchanges with EMBL and DDBJ so all three could provide the same sequences. With its monopoly on monster sequence repositories in the United States, every American molecular biologist learns to use GenBank — and its host, NCBI — sooner or later. It’s much easier than in the early days, when the data were distributed on reels of magnetic tape or on large floppy discs, causing numerous problems in timeliness and accessibility.
For sure, GenBank has been a success. Given an accession number, we’re able to quickly get a sequence and associated annotation information from the GenBank-format file. This annotation-rich format even had a brush with trendy popular culture when a nucleotide file appeared in an episode of Cowboy Bebop, a Japanese anime series about space-age bounty hunters. But if the bounty hunters had searched by gene name, however, they’d probably find it confusing and time-consuming to wade through the results. Fortunately other databases like Entrez Gene have stepped in to help, making it clear that GenBank is a great repository but doesn’t try to make sense of all the repetition and links between related nucleotide sequences.
Has it grown up yet as a database? This is hard to answer, especially since it’s not really even a database — at least not in the usual relational sense. We like the GenBank file format because we can parse it with BioPerl or by hand to get the annotation we want, but effectively searching through a transcriptome’s worth of annotation is harder to do. We’re very glad that translated sequence and CDS coordinates are included in many GenBank files, but we’d be even happier to be able to get information about exon structure and genetic variation. Even though the file format is computer-friendly, human interpretation of the features listed in GenBank format isn’t so easy (although NCBI’s Graph display is a start). Is there any chance of getting a relational database version of GenBank that we could just query with something like MySQL? We built one for Entrez Gene but aren’t quite ready to tackle all of GenBank.
To download single records, GenBank has the friendly Entrez interface, and with medium-sized datasets, as long as we already have a list of accession numbers, Batch Entrez works great. To download larger selected sets of GenBank sequences, however, our browsers time out so we have to look for another solution. We got NCBI’s E-Utilities to work for these large sequence sets (like all ESTs of a species) but to get something like all mouse 5’ ESTs from 2006 is still tricky.
How has GenBank learned from its mistakes? Currently the database is full of evidence of the perils of over-confident gene characterization. It makes us work a bit when we come across a sequence containing the “complete cds” of a gene and then another one with an even longer coding region. Do these sequences refer to different splice variants, or is the shorter one just wrong? Even more curious are sequences that cannot be aligned to the reference genome. It is possible that both are correct, and everything will become clear as we learn more about genome variation, but what should GenBank do for now? Officially submitters themselves should make any needed revisions or updates, but what if they don’t? This is an issue for all publicly submitted data, so perhaps GenBank is doing about as well as its cousins. Except for the early days when GenBank curators did the sequence-gathering, it’s a member of the community of researchers, not GenBank, who contributed each sequence in the first place.
If GenBank’s purpose is to include all sequences, then it should include all published sequences, even if they are subsequently shown to be incomplete or wrong. The “accession.version” nomenclature (like NM_153253.28) works well to keep this sequence history available when desired, while maintaining the same ID. We do need to rely on other databases like gene-centric NCBI Gene and transcript-centric UniGene to sensibly organize the contents of GenBank and provide summaries and selected high-confidence sequences. With the daily updates, we appreciate that we’re always sure to have the most current data. Unlike the floppy disc days, when a new set was mailed every three months, if our newly cloned gene isn’t in GenBank, we can be pretty sure that we have something novel.
Does GenBank play well with others? GenBank sequences link to a wealth of other NCBI resources, but it seems to have so many siblings that it doesn’t make much effort to connect with other resources such as those at EBI. Other databases link to NCBI, but the connections are mostly unidirectional. We’d especially like to see more connections (at least via other NCBI databases) with well-curated protein resources (like UniProt) to balance out the nucleic acid emphasis of GenBank, and with Ensembl, a database of somewhat independent gene sets.
What could other molecular biology databases learn from their cousin? Perhaps our favorite thing about GenBank (along with EMBL and DDBJ) is that everything’s there, available, and in a similar format. We took this for granted for a while until the arrival of public microarray data. Even with MIAME standards, we still have to search for our desired array data in several repositories, project websites, and journals’ supplementary information. Even if we can find the study we want, we have to build custom parsers to deal with different data formats and missing annotation (like oligo sequences and processing details). For sure microarray repositories could learn from GenBank, or maybe it’s simply that the community has to agree on and implement the details of how to organize and present all the data in a one-stop shop like GenBank.
Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a bioinformatics scientist in Fran’s group.