Blast is like water — always there, ready to use. Want to know something about a sequence? Just open the Blast faucet, and before you know it your cup will be overflowing with alignments.
Let’s go behind the scenes and see what it takes to keep Blast, and all the other NCBI services, flowing so smoothly. As we’ll see, it’s a sophisticated operation that rivals the best of what you’ll find in the real world of commercial webbery.
The US National Center for Biotechnology Information, known to all as NCBI, is a unit of the National Library of Medicine, which in turn is part of the National Institutes of Health. NCBI was established through an act of Congress in November 1998. It is located on the main NIH campus in Bethesda, Md., occupying several floors of a modern office tower.
NCBI is organized into three branches. The Information Resources Branch, led by Dennis Benson, has day-to-day responsibility for keeping the data flowing. Jim Ostell’s Information Engineering Branch is responsible for data content — keeping the data clean and pure. The Computational Biology Branch, headed by David Landsman, does research. NCBI’s overall director is David Lipman, known for his early work on FASTA and Blast. To get the inside story, I spoke with Benson, Lipman, and Ostell.
NCBI is connected by large-diameter water mains (155 megabit per second) to both the Internet and Internet 2. A pair of four-way SGI machines serve as web front-ends, and are the main point of contact when you visit the site. These operate in tandem so that if one fails, the other will pick up the slack.
Blast searches are pumped through a network of approximately 25 computers containing a total of about 200 CPUs. Most of the computers are eight-way multiprocessor Intel boxes running the Solaris operating system. There are also a few Sun machines mixed in. The Blast stream is managed by separate computers that queue up searches waiting to be run, and that send work to machines as they become free.
All text searching — whether of GenBank, PubMed, or other NCBI databases — is handled by Entrez. Text searches gush through a network of six Sun Enterprise-class machines (containing up to 12 processors each), augmented by a few smaller Intels. This network houses the text indices used by the search software, but not the actual documents, which swim on separate database servers. The database servers are presently a group of four Sun 420 four-way machines running Sybase.
When you click the “Go” button on the Entrez search form, the software consults the indices located on the search network and produces the summary pages you see next. If you drill down to look at a specific document, Entrez ships the request to the database servers to get the details.
Data inflow also demands a lot of juice. When new sequences arrive at NCBI, they percolate through an automated process that does quality checking and then links the new data to what’s already there. The big job is Blasting each new sequence against the current database to create the sequence neighbor links you see in Entrez. Links between the sequence and PubMed are also established at this time. This process runs on a network of 28 servers, containing a total of 122 CPUs.
NCBI’s research folks have their own large system with about 170 Intel and 20 Sun CPUs, spread out over some 40 machines. The research team uses this for its own purposes as well as to produce database material for public use, including NCBI’s assembly of the human genome.
NCBI’s use of Solaris on its Intel boxes, rather than Linux, may seem a bit eccentric. The explanation is that at the time NCBI ramped up its use of Intel servers, Solaris was a better choice for the eight-way multiprocessors used for Blast searches. This rationale is less compelling now, and NCBI expects to use Linux for most new projects.
One new project that will affect many of us is a redesign of Blast to run on a cluster of smaller, two-way, Intel Linux machines. At present, each Blast search runs on a single computer. The search can use all eight processors of that computer, but it cannot spill over onto another computer even if another machine is completely idle. The new version will be able to split each query over any number of computers, which should improve performance when searching the database with a large sequence, especially when the system is not too busy. Performance should be really good in the middle of the night.
Keep That Data Rolling
The NCBI website handles about 25 million hits per day, representing about 10 million page views, from about 240,000 unique users. Over the course of a month, NCBI reports 2.2 million unique users. Usage is growing rapidly from about 150,000 daily visitors two years ago and 190,000 last year.
These numbers, like all measurements of web usage, should be taken with a grain of salt. The art of web measurements is rather arcane. A “hit” is anything that causes a file or page to be sent out from the web server; this number gets inflated by image files and other strange things that a normal user might regard as being part of a single page. The “page view” number excludes some of these extra bits, and is a better measure of popularity from a user’s perspective.
It is especially hard to measure “unique users,” which is meant to be the number of different people who access the website in a given period of time. NCBI, like many websites, uses the concept of unique IP address as a surrogate for user, with full awareness that the equivalence is approximate at best. An IP address is the Internet identifier for a computer, and tells the web server where to send the result. A single person can give rise to multiple IP addresses if, for example, he accesses the web from several different computers. Conversely, multiple people can share a single IP address if, as is common practice nowadays, their institution funnels all web requests though a firewall or proxy server.
In the commercial web world, where eyeballs translate into advertising dollars, companies have switched from counting IP addresses to collecting actual usage data from real users, using sampling in the style of the Nielsen TV ratings. One major web rating company, Jupiter Media Metrix, has a table on its website of the top 50 websites measured by the number of unique visitors during a one-month period. In March 2002, AOL topped the list with about 92 million unique visitors. Ebay and Amazon were at eight and nine with about 29 million visitors. The New York Times and FortuneCity Global Community were numbers 49 and 50 with about 8.5 million visitors.
NCBI’s self-reported figure of 2.2 million users is only off the top 50 list by a factor of four. The site ranks in the top five US government websites, below agencies like the Internal Revenue Service, NASA, and the Library of Congress.
Another way to measure NCBI’s flow rate is by the number of searches performed on the site. This is 60,000 to 70,000 Blast searches and about 1.2 million Entrez searches per day. The Entrez number surprised me. About a million of these are PubMed searches, which drives home the fact that NCBI is a lot more than just an aquarium for captured sequences.
Another amazing statistic is the amount of data that spews out of the site each day —now almost a terabyte. Most of the outflow is FTP traffic and reflects daily (or nightly) data downloads by companies and such. Talk about drinking from a fire hose!
The exponential growth of GenBank is familiar to all. As of April 2002, the database contained about 20 billion bases in 17 million entries. I calculate the current growth rate to be about 75 percent annually for number of bases and 40 percent for entries. This leads to a projected 33 billion bases and 24 million entries by year’s end. To put this growth in context: at the end of 1995, GenBank contained 400 million bases and 555,000 entries. This seemed like a lot at the time, but is only two to three percent of its current size.
NCBI is a real bioinformatics success story. Like running water, it’s there everyday, serving everyone — commercial and academic alike — and doing it pretty well.
The size of the operation is pretty amazing. I count more than 500 CPUs in NCBI’s various compute farms. They serve millions of accesses each day, including more than a million Entrez searches and 60,000 to 70,000 Blast searches, from almost a quarter of a million unique users. And they distribute almost a terabyte of data each day for people to use on their own sites. That certainly qualifies as big bioinformatics in my book.
What NCBI does may not be glamorous, but it’s vital. And it works.
URLs for Organizations and Websites Mentioned in the Story
US National Center for Biotechnology Information
NCBI’s Reference Genomes
Jupiter Media Metrix
RefSeq: The Purified, Bottled GenBank?
As GenBank grows, it is becoming more redundant. This trend threatens to overwhelm the novel sequences, making Blast searches of the complete database useless for detecting anything beyond the closest homologs.
NCBI’s solution to this problem is RefSeq, which seeks to provide “reference sequence standards for the naturally occurring molecules of the central dogma, from chromosomes to mRNAs to proteins.” The idea is to identify one good sequence for each molecule, and describe all others as variations from that reference. There’s no pretense that the reference sequence is the only correct one or even a “normal” one (whatever “normal” might mean). Indeed, natural variation virtually guarantees that numerous correct, normal sequences will exist for any given protein, transcript, gene, or genome.
Reference sequences, or rather the database entries containing these sequences, also serve as a convenient thumbtack for collecting biological knowledge about each molecule. Where better to put the functional description of a gene or protein than in the reference entry for that molecule?
The challenge, as I’ve discussed in previous columns, is curatorial. It takes an expert to leaf through the numerous sequences for a given gene and figure out which ones are correct, which are alternative splice forms, which are natural variants, and so on. And, of course, only an expert can provide an authoritative account of what is known about a gene’s function, what is unknown, and what is in dispute. And it goes without saying that all of this must be kept up to date, especially in fields that are rapidly changing.
NCBI is trying several curation strategies in the hopes of finding something that works. It is collaborating with the genome informatics teams of several organisms to directly import curated genomes from those groups. Examples include yeast (from Saccharomyces Genome Database), worm (from Sanger Centre and Washington University), and Arabidopsis (from TIGR and collaborators). It has begun a collaboration with the Mouse Genome Informatics group at the Jackson Laboratory to curate mouse genes. However, for most human genes, the curation is being done by internal NCBI staff, who do a good job in general but cannot possibly be experts in all areas of biology.
NCBI is also starting an effort to have NLM data editors associate publications with genes in the normal course of entering publications into MedLine. These associations are the GeneRIFs (References into Function) you see on some LocusLink pages. NCBI is also providing a web interface that lets anyone suggest a new RIF; I have no idea how it curates these.
Yet another strategy is to link online books and book chapters with RefSeq entries. This ties in with NCBI’s Bookshelf project, which provides free, online access to a growing number of biomedical texts.
All of these are good ideas, and every little bit helps. But I’d like to see a more focused effort to make authoritative annotation a central element of NCBI’s databases.