Researchers with overwhelming amounts of sequencing data and no quick-and-ready access to large computational resources must still ship hard drives through the mail to facilitate off-site data analysis or share their data with collaborators. Current Internet bandwidths available to most researchers are not large enough to make uploading and downloading genomic datasets convenient. And while cloud computing advocates have made considerable noise about how it provides researchers with an alternative to mail-based data transfer, the cloud does depend on existing bandwidth.
To address this problem, and provide the genomics community with a bioinformatics solution for the transfer and analysis of large amounts of data, a team from Indiana University has established the National Center for Genome Analysis Support. NCGAS was created with a $1.5 million grant from the National Science Foundation, and is intended to provide researchers with access to Indiana's supercomputers, including the Mason system — a Linux-based, 16-server supercomputer with 500 gigabytes of RAM per node.
But for NCGAS to become a national center, its director, William Barnett, says that it needs more than powerful hardware and all the genome analysis software under the sun — what it really needs is an infrastructure that provides access to those resources. With such an infrastructure in place, researchers will have a seamless way to move sequencing data from their labs to the computational resources at NCGAS.
"We see that just having the software and machine doesn't solve the problem. What we really need as a national center is to have a national infrastructure, and that has to be network-based," Barnett says. "We already had our high-throughput disk system — we had this wide-area network, parallel-file technology that we could mount across the country, because we saw this need for being able to remotely access sequence data so we could start to do things like assembly and alignment on them."
[ pagebreak ]
Now that the funding for NCGAS is in place, Barnett says that his focus has shifted to piecing together a national cyber infrastructure capable of supporting genomic analysis.
Last November at the SC11 high-performance computing conference, a team from IU's Global Research Network Operations Center ran a demonstration of how such a national infrastructure might work. The team established a connection between two parallel file systems at both ends of a 100-gigabit-per-second network connection that spanned the distance from the university's Bloomington campus to the conference in Seattle. The technicians used networking infrastructure developed and supported by Internet2 — a consortium of academic and government research groups attempting to build a 100 Gbps national Internet backbone for the research community.
"For Indiana University's SC11 research sandbox demo, we implemented a biological application to simulate a sequence alignment and SNP-identification pipeline," says IU research associate Le-Shin Wu. "The goal is to demonstrate that, with a 100 Gbps network connection available between computing nodes at Seattle and a remote storage file system at IU, we are able to conduct a data-intensive pipeline without repetitive data file movement."
Once the connection was established, the demonstration team executed a three-step analysis workflow through which data was transferred from the show floor in Seattle to the Mason cluster at NCGAS. The workflow included data pre-processing to evaluate and improve the quality of input sequence, sequence alignment to a reference genome, and SNP detection to identify new polymorphisms in the alignment.
"We use the hg18 human genome with a data size of about 3.5 gigabytes as the reference dataset, and the testing input sequence SRR040810 has a data size of about 3.7 GB. Both the reference genome data and the input file containing the reads reside on a file system [called] Data Capacitor at IU," Wu says. Data Capacitor is a 40 Gbps high-speed and high-bandwidth storage system with 427 terabytes of available storage connected to all Indiana campuses as well as other sites across the US.
"The computing nodes and the remote site file system are connected with a 100 Gbps link, and all of the output data files generated during the course of the simulation are stored on the remote file system as well," Wu adds.
Over the course of the simulation, the total size of data transferred between the computing nodes at Seattle and the remote file system at IU through the 100 Gbps link was roughly 84 GB.
A key component of the long-distance genomic analysis workflow, called Genomics and Data in Motion, is OpenFlow. OpenFlow is a technology pioneered at Indiana that enables so-called software-defined networking — a network architecture that can be controlled remotely with software. OpenFlow is a firmware-level software embedded on routers and network switches that allows researchers to acquire dedicated bandwidth within a network and then access that bandwidth on the fly to enable large data transfers.
[ pagebreak ]
The NCGAS team is currently working on getting the proof of concept demonstrated at SC11 up and running. Barnett says it will take some time to get the moving parts working properly, adding that the solution will work with either a 100 Gbps link or 10 Gbps network, so it will not be dependent upon Internet2 completing its 100 Gbps infrastructure.
Barnett is quick to point out that he understands why some researchers might be skeptical about the feasibility of transferring multiple genome datasets over a network for offsite analysis. But with the right infrastructure, he says this could be a reality. "I think a lot people working in genomics have just considered these genome datasets to be too big, and with the rate that they're growing, this is only going to be a problem solved by FedExing hard drives," Barnett says. "But we're saying that if you have that combo of infrastructure — the disk systems, the ability to mount them remotely, the bandwidth, and parallel file systems you can mount remotely — you can take a pretty good shot at solving this challenge."
Barnett and his colleagues involved in the Genomics and Data in Motion project are looking to partner with sequencing centers and researchers in order to establish a connection to their system, and iron out the technological problems, to improve the service. "That means reaching out to sequencing centers and saying, 'We can connect with our file system. Can we connect that with your sequencing center?'" Barnett says. "We can just mount them on our systems and do the assembly. We're starting to look at developing those partnerships to support the workflows that we want to put in place to accelerate the bioinformatics part that needs to happen once people have sequences."
In the near term, they are also aiming to develop a process though which people can fire up an allocation on not only the IU system, but elsewhere as well. Indiana is partnering with the Texas Advance Computing Center and the San Diego Supercomputer Center to provide a national-scale resource to get the necessary infrastructure up and running, and then to develop workflows that will utilize parallel disk systems to support that. "We're hoping to develop this online national infrastructure to help chip away at this [problem] and make it a little bit more transparent and a little bit easier for people to access these resources and build online workflows, submit their sequences, and get their data back in a lot less painful way than they are doing it right now," Barnett says.