The Southwest Foundation for Biomedical Research plans to more than double its compute cluster and increase its storage capacity tenfold as it begins to migrate several of its large-scale population genetics studies to next-generation sequencing platforms.
SFBR's Department of Genetics was recently awarded a $2 million grant from the National Center for Research Resources to expand the cluster in its AT&T Genomics Computing Center from 3,000 processors to more than 8,000 processors. The center's storage capacity, meantime, will grow from 50 terabytes to around 500 terabytes.
The SFBR genetics group specializes in gene localization and identification studies involving very large extended pedigrees — a task that is already very computationally intensive. Now, said John Blangero, a statistical geneticist at SFBR and director of the computing center, "we're really gearing up for the next big phase, which is going to be complete genome sequence data on our large epidemiological-sized studies."
Blangero said that the center currently has two Illumina Genome Analyzer IIx systems in place and has ordered a HiSeq 2000 instrument that will be installed in July. "We're really going to take a big jump in storage, which is really necessary these days with high-throughput sequencing," he said.
For now, the SFBR researchers are conducting exome sequencing on samples that they've already collected for linkage analysis and genome-wide association studies.
The plan is to perform whole-genome sequencing on the same samples in the future, but that will require the price of sequencing to fall a bit further. "We really need the whole genome cost to come down to a thousand, two thousand dollars before we launch whole-genome sequencing," Blangero told BioInform.
"We'll move there as quickly as possible, though, because we definitely are of the opinion that that's where all the fruit lay — in these many rare variations that will accumulate in different families," he said. "Every family will be a unique experiment that might lead us to a gene that we didn't know about."
SFBR, based in San Antonio, Texas, specializes in very large families. One study, the Jiri Helminth Project, involves 1,000 individuals from 50 families in the Jiri region of Nepal and is investigating genetic susceptibility to infection by parasitic worms. The foundation's flagship study, the San Antonio Family Heart Study, has followed approximately 2,000 people from around 40 families for more than 20 years.
"These large families are of course extremely valuable now that the whole tide of the field is turning towards rare variation versus common variation," Blangero said. "Even if something is family-specific, we can end up capturing 20, 40, 100 copies of a rare variant, which gives us the power to test whether or not they have an effect on the disease phenotypes that we're looking at."
As a result, he noted, "it's a very different sort of computing than what you'll find in most genome centers."
While many genome research groups have large compute clusters, they primarily focus on "piecing sequencing together," Blangero said. "But we're really a statistical genetics group. Our money comes primarily from relating sequence variation to human phenotypes, and particularly complex diseases."
Blangero and colleagues have developed a suite of software tools for statistical genetics, including Pedsys, a database system developed specifically for managing genetic, pedigree, and demographic data; SOLAR (Sequential Oligogenic Linkage Analysis Routines), a suite of algorithms for linkage and quantitative genetic analysis; and Pedigree/Draw, a genealogy visualization tool that commercially available from Jurel Software.
[ pagebreak ]
Blangero said that he and his colleagues have already parallelized their software tools for the current cluster. They all scale "pretty much linearly," he said, so he expects a proportionate speedup when the cluster grows from 3,000 to 8,000 processors.
"The biggest set of variants we're currently working with is a million, and that's going to go up to 10, 12, 15, 20, 50 million as we go to sequencing," he said, so the expanded capacity "will help us get through that in reasonable amount of time."
As far as the hardware that the foundation will use for the expanded cluster, Blangero said that the center has decided on Opteron chips, but is still weighing whether to use four-core or six-core processors.
"It's going to come down to the final price," he said, though he noted that "it kind of looks like we'll go with the [six-core processors] right now."
One advantage for the six-core processors, he said, is power efficiency. "We have a relatively finite amount of power that we're working with here, and [as for] cooling, we are in Texas, after all."
He added that the center should decide on the final architecture within the next month. SFBR has hired M&A Technology, an IT firm based in Carollton Texas, to install the cluster.
Blangero noted that SFBR's focus on family-based studies offers several advantages when it comes to analyzing sequencing data.
"It's actually very efficient for sequencing because if you're sequencing unrelateds you basically have to sequence everybody, but in families you get an efficiency because you can impute sequence," he said. "So I can usually get maybe two free sequences for every sequence that I do in a family, but it has to then be really put together and statistically imputed."
In particular, he said, the team has a "huge advantage" when it comes to rare variation. He noted that it's currently estimated that every person has 200 new variants, which means that there could be more than a trillion possible variants across the world's population of seven billion people.
"The potential number of rare variants out there is enormous, and if you're only looking at unrelateds, you just see a single copy no matter what, so by definition it's a private variant," he said. "But if you happen to have the founder of a larger pedigree, then getting 20, 30 copies isn't that hard. And that's what lets you do the test in the end."
Blangero noted that working with large families is "critical for this swing of the pendulum from common variation that we all focused on in [genome-wide association studies] to the next phase, this deep sequencing variation that's likely to [reveal] most of the meat, and the meat will be in the rare variation."
There are challenges associated with moving these studies toward sequencing, however. For one thing, Blangero said, "when we start to get these hits we'll go back to swamping our functional laboratories again."
In addition, he noted that human genetics now faces "a serious power problem" based on the amount of data that is available. "Basically every variant that you find is a hypothesis, so you get this huge multiple testing problem," he said. "I think we'll be doing a lot of looking for intelligent prior information that we'll basically dredge bioinformatically [in order] to place prior hypotheses on likely functionality of these variants that we'll then use to kind of filter the umber of tests."
He added, however, that it's "still early days for that because we're not very good at bioinformatically predicting the function of a typical sequence variant. We're pretty good at coding variation, but we're not very good at regulatory variation yet."