HINXTON, UK--With new sequencing equipment, computer hardware, and information storage facilities installed, the Sanger Centre here is on track to meet the accelerated deadlines of the Human Genome Project, said Phil Butcher, Sanger's head of information technology.
Originally, Sanger's ten-year role in the Human Genome Project was to sequence one-sixth of the genome by 2005. But that changed about a year ago when advances in sequencing technology and competition from Celera Genomics pushed the project's sequencing centers to move faster. This meant that all the sequencing centers--Baylor College of Medicine, Sanger, Washington University, and the Whitehead Institute--had to increase their sequencing and computing capacities. With the acceleration, Sanger also doubled its research commitment, agreeing to sequence a third of the human genome's 3.5 billion base pairs by 2003.
As Sanger was finishing a five-year plan to boost its bioinformatics systems, it had to bring in new equipment to be able to complete its part of the rough draft of the genome, expected in February or March next year. With the shorter timeframe, the Wellcome Trust provided more funding to Sanger, which "enabled us to purchase additional sequencing machines as well as increase the storage capacity and the number of compute engines to cope with the increased sequence production," Butcher told BioInform. "Dramatic increases in storage were made from 2 terabytes to around 4.5 Tb. This is continuing to increase and will be at least 6 Tb in the very near future. We have incorporated near-lining strategies--a further 6 Tb of nearline storage--to create a much larger virtual storage capacity."
Currently, the center uses approximately 200 sequencing machines, employing a mix of Amersham Pharmacia Biotech's MegaBace devices and an assortment of PE ABI Prism 373s, 377s, and the latest 3700 models, which were bought recently, Butcher said.
He pointed to the modifications in the Human Genome Project as just the sort of thing for which Sanger was prepared, after implementing its scalable information technology foundation. Besides scalability, Butcher based the systems design on adaptability and resiliency.
In addition to adjusting to alterations in the human genome effort, Sanger has some 20 pathogen genome sequencing projects--including bubonic plague, malaria, and tuberculosis--as well as a newly announced Cancer Genome Project that could require increased bioinformatics capacities as the projects progress. Changes such as that "can mean that we need to do whole-genome assemblies or that we need to increase our CPU or our disk storage capacities fairly significantly and usually fairly quickly," he commented.
When Butcher joined the center six years ago, it was using a range of machines from Silicon Graphics, Sun Microsystems, and Digital Equipment--which was later acquired by Compaq Computer. As the center's technology plan took shape, Unix-based Digital (now Compaq) Alpha machines became Sanger's standard. This soon evolved to clustered compute farm of 160 Alpha workstations that handled reduction of gel files from the DNA sequencers as well as assembly and finishing of genome sequences. Four AlphaServer 1200 systems managed data storage and a small PC cluster served as a Blast farm to support public search access to the genomic databases over the web.
As the genome mapping work proceeded, however, gel file data from the DNA sequencers poured in and external Blast search requests overwhelmed the public access service. To rectify this, a 12-CPU AlphaServer 8400 symmetric multiprocessing system running Compaq's Tru64 Unix operating system was installed, along with 48 Compaq Deskpro PCs, providing a twelve-fold increase in Blast performance and reducing response times to 5-7 seconds. Sanger now processes 2,000 hits a day on its Blast web service.
Sixteen AlphaServer DS20 dual-processor systems, each with 4 gigabytes of memory, were added to the collection of compute servers, and the center's pathogen research program has its own 4-CPU AlphaServer 4100 system. All told, Sanger's bioinformatics computing resources now consist of 250 Alpha systems running Tru64 Unix software.
Wanting to move to a high-speed network, the center decided on a two-tier data network which included an asynchronous transfer mode network, which has been the backbone of Sanger's network ever since it was implemented three years ago. Such a network was needed to handle traffic from servers, desktop systems, storage, and sequencing machines. The second tier is an Ethernet network which links other desktop machines such as X-terminals, network computers, Macintosh systems, and PCs. In total, the data network now supports a total of approximately 800 machines and over 450 users.
From the software side, Sanger's application strategy has not led to compatibility problems between the Human Genome Project centers, said Butcher, because most of its code is written in Perl which runs on almost any software platform.
Communication between the sequencing centers is good, said Butcher, especially between Washington University and Sanger, which discuss methods and techniques, among other topics. "On the scientific side, there's been a lot of visits. St. Louis people come here and Sanger people visit St. Louis almost on an annual basis," he added.
Nat Goodman, director of Compaq's bioinformatics solutions center, said that Sanger's system is one of the largest with which Compaq has been involved. Celera's setup is the largest, and Compaq's experience meeting Sanger's requirements "was no doubt a factor in our success at Celera," Goodman said. Like Sanger, Celera relies heavily on Compaq equipment.
Goodman expects the technological push driven by the genome projects to continue as the next big effort--sequencing the mouse genome--gets underway. Celera may provide further motivation when it finishes Drosophila, he added.