Skip to main content
Premium Trial:

Request an Annual Quote

GWAS Studies Drive SNP Analysis Onto a Growing Bioinformatics Computational Grid

Premium
Several recent announcements indicate that distributed computing is gaining status as a valuable high-performance computing option in the bioinformatics community.
 
This week, the University of California, San Diego, announced that it is coordinating an international bioinformatics grid project that will link computational resources from six different countries in an effort to find drugs to treat avian flu.
 
That news followed last week’s announcement that SNP-analysis software firm Golden Helix has enabled its HelixTree and PBAT software packages to run on United Devices’ Grid MP platform — the first SNP-analysis software to run on a grid architecture, according to the companies.
 
Lastly, United Devices announced last week that it has signed on Bristol-Myers Squibb as its seventh top-10 pharmaceutical customer.
 
These agreements underscore the fact that many emerging bioinformatics applications require so much processing power that the grid approach — which harnesses available cycles from a distributed network of clusters, desktop grids, and supercomputers that can range anywhere from a handful of CPUs in a single office to hundreds of thousands of processors spread around the globe — is often the best available option, and sometimes the only one.
 
Grid-Enabled GWAS
 
In the case of Golden Helix, the company has found that the rise of genome-wide association studies has dramatically increased its customers’ computational requirements.
 
“What’s driving the change is the industry’s movement toward whole-genome association studies,” Josh Forsythe, director of marketing for Golden Helix, told BioInform this week.
 
"The size of the data sets involved with this type of analysis is becoming extremely large,” he said, noting that Affymetrix recently introduced an array with nearly one million SNPs, and Illumina plans to launch one with over a million SNPs later this year. “Multiply this by the thousand or more samples in a given study, and you can expect incredibly intense computations,” he said.  
 
Forsythe said that the integration of the company’s software with the UD platform was driven by the requirements of an undisclosed customer who was trying to perform whole genome analyses on a single processor which, in some cases, was taking nearly a month to calculate.
 
“They already had a large UD grid when they acquired one of our whole genome-related products,” he said. “When they and some of our other customers told us of the amount of time it was taking to run more detailed analyses, grid-enabling the tool became a logical decision.”
 
Forsythe estimated that around 75 percent of the company’s customers are involved with some variety of genome-wide association studies. On average, he said, those customers using the software on a grid-enabled environment see nearly a 50-percent reduction in processing time for every doubling of processors used: A job that takes an hour with one processor would take around a half hour with two processors, 15 minutes with four processors, and so on, he said.
 
In one case, Forsythe said, “a major pharmaceutical company [that] is relatively new to grid computing for genetic analysis” is seeing a 90-percent total reduction in processing times “with far greater reductions possible as they optimize their job/processor utilization.”
 
Another Golden Helix customer, Torrey Pines Therapeutics, is running the Golden Helix software on Condor, a freely available grid computing platform.
 
“We were completing a 500K SNP genome screen on a 1,500-sample data set, and we discovered that one computation would take weeks on a single CPU,” Kari Ohlsen, director of bioinformatics at Torrey Pines, told BioInform via e-mail. “We were concerned that a single system could crash during that time, and knew that we might well have to do more than one computation.”
 
The company is currently running the software on a 12-CPU Condor-enabled system and is able to complete the 500K calculation in around two days, Ohlsen said.
 
“Running the calculation on roughly 10 times the number of CPUs cuts the compute time 10-fold,” she said. “This made this a doable calculation for us using existing systems.”
 
Battling an Epidemic
 
Making the undoable doable appears to be a key selling point of grid computing. Arthur Olson, a professor in the department of molecular biology at the Scripps Research Institute, told BioInform that what grid computing “really does is it makes something feasible” that might not be possible otherwise.
 
Olson’s lab developed the virtual screening program AutoDock, which will run on the international avian flu grid to help identify inhibitors for the neuraminidase enzyme of the virus. AutoDock is currently the basis for another project called http://fightaidsathome.scripps.edu/, which runs on the IBM-supported World Community Grid, an international network of around 500,000 processors.
 

“If a computation is going to take five years, an epidemic could come and go in the meantime.”

For that project, Olson said, “What would have taken on the order of five years on the cluster that we have here at Scripps — which has around 2,000 processors — took on the order of five months on the World Community Grid.”
 
In the case of avian flu, the problem is even more time critical, he said, “If a computation is going to take five years, an epidemic could come and go in the meantime,” he said.
 
The avian flu project will run on an international computational network that has been under development since 2002 called the Pacific Rim Application and Grid Middleware Assembly. The network, known as PRAGMA, currently includes 26 sites in 14 countries with a total of 726 CPUs, more than half a terabyte of memory, and 13.2 terabytes of online storage.
 
This week, the US Army’s Telemedicine and Advanced Technology Research Center awarded UCSD $350,000 to coordinate an effort to adapt PRAGMA to the bioinformatics challenges of avian flu — specifically, structure-based annotation for different subtypes of the avian flu virus, as well as virtual screening to identify antiviral compounds.
 
Olson said that his group plans to take advantage of the processing power of the international grid to integrate molecular dynamics with AutoDock’s ligand-docking capabilities in order to account for the flexibility of the target protein — an approach called relaxed-complex docking.
 
Instead of being represented as a static structure, the protein “is represented by an ensemble of structures that are derived from a molecular dynamics simulation,” Olson said. “The idea is that some small molecules or potential drugs could in fact catch the target in a particular configuration that otherwise you might not have seen.”
 
However, he noted, this additional calculation requires an enormous amount of computational power “because you’re not only looking at potentially thousands of small molecules interacting with a static target, but you’re looking at thousands of small molecules interacting with an ensemble of potential configurations of the target itself, so it’s kind of a moving target.”
 
Peter Arzberger, director of life science initiatives at UCSD and principal investigator on the PRAGMA/TATRC project, said that the avian flu initiative should serve as an effective test case for an international grid infrastructure.
 
“Science is a global activity,” he said, noting that the goal of grid computing is to create a truly on-demand computational resource for scientists. “It shouldn’t matter where you sit — you should be able to run jobs, get data, and perform your analysis,” he said.
 
Arzberger said that while there are a number of technical challenges in creating a truly interoperable international grid, some of the biggest barriers are economic, political, and social. For one thing, he said, “there is no international single source of funding,” so there is no top-down guidance regarding design goals as there has been for national grid efforts like the National Science Foundation’s TeraGrid in the United States.
 
In addition, he said, the current political climate has made it difficult for some of the PRAGMA participants from China to secure visas in order to attend international meetings.
 
“The biggest challenge,” he said, “is getting us all together at the same place at the same time.”

Filed under

The Scan

NFTs for Genome Sharing

Nature News writes that non-fungible tokens could be a way for people to profit from sharing genomic data.

Wastewater Warning System

Time magazine writes that cities and college campuses are monitoring sewage for SARS-CoV-2, an approach officials hope lasts beyond COVID-19.

Networks to Boost Surveillance

Scientific American writes that new organizations and networks aim to improve the ability of developing countries to conduct SARS-CoV-2 genomic surveillance.

Genome Biology Papers on Gastric Cancer Epimutations, BUTTERFLY, GUNC Tool

In Genome Biology this week: recurrent epigenetic mutations in gastric cancer, correction tool for unique molecular identifier-based assays, and more.