The next release of GenBank, EMBL, and DDBJ — expected before the end of June — will do away with a 350-kilobase limit on the sequence length of records that has been in place since 1995. The limit was imposed at a time when the emergence of megabase-scale sequences threatened to break existing bioinformatics software tools. But now, with whole-genome shotgun sequencing commonplace, the International Nucleotide Sequence Database collaborators formally declared the limit obsolete last May, and gave the community a year to prepare for the change [BioInform 06-20-03].
So what does the impending removal of the limit mean for bioinformatics developers writing software for these records? Initially, not much, according to Mark Cavanaugh of NCBI’s GenBank division. “It’s not like things are going to radically change as of June, or that we have any plans to be creating huge, hundreds of megabase records at the drop of a hat,” he said.
One area that is likely to change in the relative short term, however, will be records for bacterial genomes. As new bacterial genomes are submitted to GenBank, they will no longer be broken into 350-kb pieces as they were in the past, he said. Older bacterial genomes that were previously split into pieces will eventually be replaced by single records, but NCBI has not set a formal schedule for this effort. “We do have an update to one of the E. coli genomes that we’re holding off on until June,” Cavanaugh said, “because it is a case where we’re going to recombine, and essentially replace, all the little pieces with a continuous sequence.”
As of the end of the month, the only limit to the length of a database sequence record will be the “natural” structure of an organism’s genome, whether it’s a chromosome, a plasmid, or a circular bacterial genome. The 350-kb limit was an “artificial” — and somewhat arbitrary — cutoff, Cavanaugh said. “It just doesn’t match the realities of current technologies and techniques.”
That’s not to say that the limit hasn’t served a useful purpose. For some model organisms, such as Drosophila, Cavanaugh said the research community prefers the 350-kb chunks. “Those who are submitting that data to us have actually expressed the fact that they like smaller pieces,” he said. “Although they submit the data to us in a chromosome-based way, for some of the downstream work that they do with the records upon retrieval of the data from NCBI, they actually don’t want to see it as a chromosome-sized object — some of their software may not be capable of handling it.”
So even though NCBI could begin distributing Drosophila sequence records in chromosome-sized units by June, it isn’t likely to, Cavanaugh said. “We would say okay, this makes sense for the community, it makes sense for the submitter, it makes sense for their colleagues to remain with the current practice of smaller chunks.”
Currently, the Contig (or CON) division of GenBank contains “virtual records” for each organism that define the method for assembling longer sequences out of the smaller pieces. Each of these records bears its own accession number, “So, effectively, what happens is that a CON division record [will] become a regular record of megabases in size,” Cavanaugh said. “Instead of instructions of how to piece together things, it is just the sequence itself.” The smaller pieces will still be accessible, but the CON record will replace them so that they won’t be considered “live” records within the database. Systems that link directly to smaller pieces via older accession version numbers will still be able to access those records, but a term-based query via Entrez would only pull up the single CON record.
Cavanaugh said it’s likely that most software developers have made the necessary changes to their tools to adapt to the change. After all, he noted, “We knew that almost any software package out there has to be already handling megabase data. They have to be because of the human genome project data that’s been around for years now, and that’s always been allowed to exceed that limit.” He said that NCBI chose the one-year deadline “as a very conservative choice in our opinion, because the majority of the players out there already could handle it.”
As for users running Blast on NCBI’s servers or locally, the removal of the limit is not expected to have any impact. “The Blast databases will not be altered by this change and hence will not have negative impact on Blast configurations,” said Scott McGinnis, a member of the Blast team at NCBI.