The National Human Genome Research Institute this week awarded the University of Califiornia, Santa Cruz’s Center for Biomolecular Science and Engineering a four-year, $5 million grant to establish a data coordination center to collect, store, manage, and display all data from the full-scale version of ENCODE. The $80 million expansion of ENCODE is moving from its pilot phase, which studied just 1 percent of the genome.
Jim Kent, David Haussler, and colleagues at UCSC are are scaling up the UCSC Genome Browser to handle enormous amounts of data from the next phase of the Encyclopedia of DNA Elements project, an effort to identify and understand the biologically functional elements of the human genome.
Moving from 1 percent to the complete human genome would seem to present formidable data-management challenges — the pilot phase of the project generated more than 200 data sets and analyzed more than 600 million data points — but the Haussler-Kent team have had plenty of experience with high-profile genomics projects. Kent originally developed the UCSC Genome Browser to house data for the Human Genome Project, and the system also served as the repository for sequence data in ENCODE’s pilot phase.
Kent told BioInform that the center will use the funding to grow its CPU capacity from 500 new and 1,000 “older” units to a total of 1700, and said that his group will also develop new software tools for the UCSC Genome Browser to help researchers quickly find the data they need.
From Pilot to Production
NHGRI launched the ENCODE research consortium in September 2003. The project comprises three phases: a pilot project phase, a technology development phase, and a production phase. The pilot phase wrapped up earlier this year, and culminated with a paper in the June issue of Nature and 28 companion papers in the June issue of Genome Research. The technology phase ran in tandem with the pilot phase and will continue into the production phase.
During the pilot phase, the UCSC team housed sequence data for the project and tested and compared existing methods to analyze genome sequence data. Despite this experience, however, the group still needed to pass NHGRI’s formal RFA process before winning the award for the full-scale project, according to Peter Good, extramural research director for genome informatics at NHGRI.
Good said that several other proposals went through peer review before UCSC was formally selected, though was not at liberty to disclose the names of the other applicants.
One key difference between the pilot phase and the production phase from an informatics perspective is “we expect Jim to do a lot more tracking of the data that’s produced,” Good said.
That will involve coordinating data coming from around a dozen different labs, and looking to see if all the files “contain all the information we expect,” Good said. “We expect any metadata associated with this data — how experiments were done, any computational methods done, et cetera, [to be deposited along with the data] — and we expect him to come up with ways of storing that.”
Following the pilot project, Good said, NHGRI realized “we needed better control over data … We needed more of a data coordination center, and not just someone who is going to house it and display it.”
According to UCSC’s grant abstract, participation in the pilot phase of ENCODE enabled the team to establish “cost-effective, high-throughput approaches for incorporating and displaying ENCODE data, and we have developed an effective interface to the consortium for uploading data files and methods documentation to our FTP site.”
The abstract outlined several goals for modifying the UCSC Genome Browser to meet the requirements of the ENCODE DCC, including coordinating with ENCODE data providers to collect, store, and manage sequence-based functional element data and related metadata; providing a long-term storage system for ENCODE data; and providing the research community with freely available tools to access, search, and analyze the ENCODE data.
Kent told BioInform this week that scaling the resource up one-hundred-fold isn’t necessarily a daunting task. He said that computers and their ability to handle increased data load aren’t as important as other factors.
“The quantity is some challenge, but we’ve already done that with a lot of whole-genome data sets,” said Kent. “The big challenge is not so much the scale of the data, but the diversity of the data.”
“The quantity is some challenge, but we’ve already done that with a lot of whole genome data sets … The big challenge is not so much the scale of the data, but the diversity of the data.”
He said “it’s easier on the computational side than on a lot of sides, I think, [because] once you write a software program, whether it works on one program [or many more] it’s easier … whether it’s 10 things or a thousand things … that’s one of the beauties of computers is they are so good at doing the same thing over and over.”
He added that one challenge could also be “to make it so that it’s easy and convenient to find the bits of data that are relevant to a particular researcher, [such as] that 10 terabytes [of information the researcher needs or] the one web page that will really make or break their paper.”
Cluster Muscles the Effort
To support this effort, Kent’s group has armed the center with a significant hardware and software boost.
At the heart of the DCC is a computer cluster and data-storage facility that comprises two large machine rooms. The group’s current CPU cluster is from Rackable Systems and Kent said the team is very happy with it, though other vendors “are under consideration” to boost an increase in power to its CPUs.
Kent acknowledged that CPUs are “relatively power-hungry,” which will make power consumption a primary consideration.
To manage the cluster, the UCSC team is using Parasol, a batch-scheduling system that Kent developed, as well Ganglia, an open source distributed monitoring system.
Asked what he hopes to achieve with the project, Kent said “Mostly, we want to just make the data accessible to everyone, to make the interfaces fast and intuitive and make sure that insomuch as possible, it’s very easy to … not only see the data, but see how it was gotten and what it needs.”
Kent added that “the biggest resource we use in creating the system is people. It’s a mixture of programming effort, documentation effort, and …collaborative effort.”
The production stage “is also [about] making sure that it looks good to our users and is also understandable to our users,” Kent said.