The National Science Foundation is partnering with Google and IBM to give the academic community access to a large-scale computing cluster in order to help researchers improve ways to process many terabytes of data in parallel.
In a “dear colleague” letter sent to the research community this week to announce the initiative, the NSF said that it is designed to address the challenges of “data-intensive” computing, in which “the sheer volume of data is the dominant performance parameter.”
The NSF hopes to issue a solicitation for the program, called the Cluster Exploratory, or CluE, by the end of March, though agency officials noted that it is still ironing out the timeline. It expects to support between 10 and 15 research projects in the first year of the program, and to expand that in later years.
The program, which is open to other vendors, could appeal to a range of scientific computing disciplines, including bioinformatics, James French, a program director in NSF’s Computer and Information Science and Engineering Directorate, told BioInform.
“We definitely imagine a role for bioinformatics investigators in this,” he said, though he stressed that “it depends on the nature of the problem.”
While some bioinformatics proposals are likely to be “suitable for this environment,” French said that many bioinformatics problems “are compute intensive over [relatively] small amounts of data,” which would not be appropriate for the program.
“What we would be mostly interested in out of the bioinformatics problems that are available are those [where] the analysis being conducted is over what might otherwise be an intractable amount of data,” he said.
In the data-intensive computing paradigm, storage and computation are co-located in order to avoid the data-transfer challenges of moving terabytes of data around a network.
In one sense, “it’s pretty much standard commodity cluster computing,” French said. “What’s different is how much data is available or can be made available … It’s better to think of this cluster as a large-scale storage device that has computation. You wouldn’t want to put a process on a thousand processors and have it go across the Internet somewhere to get the data.”
French said that the storage capacity of the available cluster, which will include more than 1,000 processors, is expected to be on the order of hundreds of terabytes.
The agreement builds upon a partnership that Google and IBM forged last fall to create a cluster of several hundred Google machines and IBM servers to support undergraduate curriculum development in the area of “internet-scale” computing. To date, six universities have joined the initiative, which allows students to access to the cluster via the Internet in order to test parallel programming projects.
French said that NSF approached the companies soon afterwards to discuss extending the initiative to include researchers. He said that the agreement is not exclusive, and that the agency would welcome similar partnerships with other large firms, such as Amazon, Yahoo!, and Microsoft, which either offer or are planning to offer similar on-demand computing services, often referred to as “cloud” computing.
“What we would be mostly interested in, out of the bioinformatics problems that are available, are those [where] the analysis being conducted is over what might otherwise be an intractable amount of data.”
While the details of the computational capacity available to participants in the CluE program are still being worked out, it is expected that the Google-IBM cluster will contain more than 1,000 processors connected to several terabytes of memory and several hundred terabytes of storage. The system will run Linux and Apache’s Hadoop software, a large-scale distributed computing platform. IBM is contributing its Tivoli software for management, monitoring, and dynamic resource provisioning.
The NSF said in its letter that CluE has three primary challenges: “to use existing tools and to develop new programming abstractions for such a ‘computer’ to solve problems unsolvable any other way; to solve old problems in simpler or more efficient ways; and to enable new applications.”
Several bioinformatics developers who have begun writing programs for cloud computing environments welcomed the initiative.
Jong Youl Choi, a computer science graduate student at Indiana University, Bloomington, has developed a prototype protein sequence analysis program called V-Lab-Protein that runs on Amazon’s Elastic Computing Cloud, known as EC2.
He said that when he was developing the system, EC2 was the only such platform available, but the entry of other players such as Google and IBM will now give users “more choices,” which should ensure competitive pricing for these services.
Michael Cariaso, senior scientific consultant for the BioTeam, has been using Amazon’s EC2 as a development platform for bioinformatics projects and has also released a service called RunBlast that allows users to run a range of bioinformatics jobs on Amazon’s clusters without writing their own code.
Cariaso said that the cloud approach offers a low-cost supercomputing option for research groups that may not have the budget to invest in an in-house cluster, or for developers who want to be able to access supercomputing power from any location.
As an example, he said he recently performed a large-scale analysis of James Watson’s and Craig Venter’s genomes using the Amazon cloud. “It required downloading hundreds of megabytes of data and running Blast for a couple days, and I couldn’t have done it on my own laptop, and inside the cloud it was not just possible, but cheap to do,” he said.
Amazon’s EC2 pricing varies based on the memory, storage, and computational power required for a job. It starts at $0.10 per hour for a small “instance,” which comprises 1.7 GB of memory, 160 GB of storage, and one EC2 compute unit – the equivalent of a 1.0-1.2 GHz 2007 Opteron processor. Cariaso said that for a typical bioinformatics job, the total cost is generally under $10.
Google, Yahoo!, and Microsoft have not yet disclosed pricing for their cloud-computing services.
Cariaso said the NSF initiative should benefit computer science students because it will help them test parallel code even if they don’t have access to a supercomputer at their own institute.
Yet despite the potential benefits of cloud computing, Cariaso noted that it’s not the right answer for every job, especially those that require transferring large data files.
“It’s obviously much faster to transfer files locally than it is to transfer them across the net, so I have to pay attention to how big are the data files that I’m working with, and how I get them into the cloud in the first place,” he said. “For anything cloud based, you need to be sensitive to the data size and where the data comes from initially.”