The Genome Sequencing Center at Washington University St. Louis is building an $11 million, 16,000-square-foot facility designed to handle massive amounts of data from its growing fleet of next-generation sequencing systems.
The data center construction, which broke ground this summer and is slated for May completion, is part of a larger-scale building project on the Wash U campus that is overseen by its BioMed 21 program. The program also includes the building of a 240,000-square-foot research space in its new BJC Institute as well as 15,000 square feet of space added to the previously established Center for Genome Sciences building.
David Dooling, assistant director of the GSC, told BioInform this week that the final build-out is slated to support 120 racks for data storage.
Wash U is building the freestanding data facility on Newstead Avenue across the street from its Genome Sequencing Center. Right now, the GSC has a 1,200-CPU cluster that supports 130 AB3730xl Sanger sequencers, as well as a new SOLiD system from Applied Biosystems and three 454 Life Sciences' FLX machines, which Dooling said would likely jump to five in time for the opening.
Dooling said they currently have 500 terabytes worth of disk space or 500,000 gigabytes. He said that over time, he envisions that infrastructure increasing rapidly.
Dooling said “the sequencers typically only have enough storage to accommodate one or two runs. The data must be transferred off the sequencer computer to perform further analysis."
The AB3730xl generates “on the order of a few MB of data per run,” Dooling said. “Next-gen sequencers generate between a few hundred gigabytes to several terabytes of data per run.”
Richard Wilson, director of the GSC, said in a statement that "As we adopt the next generation of DNA sequencers, we will increase the amount of data we generate by several thousand times per day.” He added that the new center will provide additional storage “and more efficient data processing required by advanced sequencing technologies,” and is expected to meet the center’s computing needs “for the next several years.”
In addition to technology, the new data center will require additional staffing, Dooling said.
"We will probably be hiring a couple more system administrators to help with the management of that. We have a long experience of building scaleable infrastructure, particularly scaleable in the number of man hours that are required to maintain it. We have a fairly systemitized infrastructure that leverages a lot of commonalities so we can maximize the efficiency of our IT staff through standardization and things of that nature," he said.
Right now, Dooling said there are "around 80 or 90" informatics staffers at the GSC, including himself, and this is buttressed by about a dozen IT staff which he defined as user support, database administrators, and system administrators.
Dooling said that to support the "maximum amount of data" generated by the GSC’s next-gen sequencers, the new data center will include additional CPUs, blades, storage, and networking capabilities
"While the cost of sequence generation is declining rapidly, making larger and larger projects possible, the cost of bioinformatics is not. Sequencing project budgets are becoming dominated by bioinformatics costs."
Final numbers will be determined according to how “everything fleshes out,” Dooling said, adding that they currently are installing about 620 nodes. Two-thirds of the center, when finished, will comprise “both highly dense disk, fiber-channel disk, Sata disk and the supporting infrastructure, backups and network to make it all work,” Dooling said.
The push toward next-gen sequencing is posing similar informatics challenges elsewhere. “All users of massively parallel shotgun sequencing technologies are faced with a new bottleneck in bioinformatics," said Stephen Kingsmore, president of the National Center for Genome Resources, via e-mail. "This places significant demands on compute infrastructure to base call and align or assemble millions of reads, on RAM to visualize and analyze the results, and on storage capacity."
NCGR said this week that it plans to officially launch the New Mexico Genome Sequencing Center, which will house two Illumina Genome Analyzers, on Nov. 19.
While acknowledging that Wash U's GSC is “much larger than the center that we're debuting,” Kingsmore said that the new technologies present many new challenges regardless of scale.
For example, he noted that many researchers doing large-scale sequencing projects “are dramatically different from a few years ago. Instead of being dedicated 'genomicists' who create community resources, they are physicians and scientists who wish to use genome sequencing to test hypotheses.” These users, he said, require “a very different, more user-friendly bioinformatic interface."
Of the challenges that lie ahead for Wash U and other genome centers, Kingsmore said, "While the cost of sequence generation is declining rapidly, making larger and larger projects possible, the cost of bioinformatics is not. Sequencing project budgets are becoming dominated by bioinformatics costs."
He added that Wash U is not the first to recognize "that this is not just an incremental change and will require major change."
However, he noted that adding new storage and processing capacity may not be enough in the long run. “Soon no one will be able to continue to scale their compute infrastructure with sequence generation — we need to develop more efficient algorithms and file formats and to rethink how we process and store data," Kingsmore said.
He added that “One pretty obvious sea change is to delete the raw data when basecalling, alignment, QC have been completed. We're thinking about the actual read format to see how we can store the needed information much more effectively from a machine language standpoint.”
Then Kingsmore quipped, “We haven't had a Eureka moment yet.”