A collaboration between IBM and the US Department of Energy to link the DOE’s largest unclassified supercomputer into the nationwide DOE Science Grid two years ahead of schedule is good news for genomics researchers hungry for extra computational power and storage capacity.
“The emphasis of the Science Grid is on data sharing and transparent access to multiple users on multiple platforms,” said Dan Rokhsar, head of computational genomics at the DOE’s Joint Genome Institute. “It will be like having a petabyte of storage attached locally to every system throughout the Grid, and will hopefully eliminate the problems associated with keeping track of and managing large data sets.”
The JGI is familiar with large data sets — Rokhsar estimated the institute currently generates around a gigabase of sequence per month, with an associated increase in storage demand by a factor of one hundred. But while the JGI could simply scale up its NetApp RAID system (currently more than 10 terabytes) to contend with this growth in data, “There is great value in centralizing these archives through the Science Grid in a way that makes them transparently available to other researchers,” said Rokhsar. “Add to that our growing efforts in functional genomics, which will generate tens of thousands of images, and a high-performance, centrally managed data center becomes very attractive,” he added.
IBM Plugs in
While the DOE has been tinkering with distributed collaboration and data-handling technology for the past decade, the DOE Science Grid project had its official beginnings just over a year ago, with the ultimate goal of a 10-teraflop DOE-wide computing and data infrastructure slated for early 2004. But IBM’s decision to aid the DOE’s National Energy Research Scientific Computing Center (NERSC) in bringing its share of the resources online will give the entire project a leg up. The collaboration will add the third most powerful computer on earth (according to top500.org) to the grid by year’s end. With a peak performance of five teraflops, the NERSC’s 3,328-processor IBM SP Power3 supercomputer will provide half the Science Grid’s planned computational power in one shot.
In addition to the large supercomputer system, IBM servers support a 1.3 petabyte archival storage system at the NERSC as well as a 160-processor Netfinity cluster computer system. All three of these IBM systems are expected to be on the grid by the end of the year. Under the terms of the collaboration, IBM will make its operating system software compatible with Globus and other grid software, and NERSC researchers will move the software into service at the site.
IBM, which is also a collaborator on the National Science Foundation’s 13.6-teraflop “TeraGrid” [BioInform 08-27-01] and the North Carolina BioGrid [BioInform 11-19-01], has singled out grid computing as one of its primary short-term focus areas. Carol Kovac, general manager of IBM Life Sciences, told BioInform recently that the company sees the technology as “a very important next wave.”
While work done at the NERSC could eventually be integrated into IBM’s commercial products, no money is changing hands in the Science Grid collaboration, said Jon Bashor, a NERSC spokesman.
Around 2,100 government and university researchers currently have access to the NERSC systems. Bashor estimated that life sciences research currently comprises around four percent of this work, but “that’s growing.”
Rokhsar said the JGI is already “making heavy use” of the NERSC IBM supercomputer for analyses of the pufferfish and sea squirt genomes. JGI is working with the NERSC and IBM to install a range of bioinformatics tools on NERSC architectures, including the JGI’s whole-genome assembler, JAZZ.
Bringing Genomes to Life
The accelerated schedule for the project is welcome news for another DOE genomics program, as well. The department’s newly launched Genomes to Life initiative [BioInform 06-25-01] is just hashing out the exact structure of its computing activities now, said Gary Johnson of the DOE’s Office of Advanced Scientific Computing Research. But with program goals strikingly in line with those of the Science Grid itself — “making biological data of all sorts more accessible to biology researchers and accessible in ways that are useful and natural to them,” according to Johnson — early access to the Grid’s resources will give the project’s infrastructure a jump start.
“In all of computing, the game is to remove bottlenecks,” said Johnson. “If the grid is available it puts the onus on us to put the tools in place for biologists earlier.”
— BT