NEW YORK (GenomeWeb) – GenePattern, the web-based portal for genomic analysis, is moving onto a supercomputing cluster at the Indiana University National Center for Genome Analysis and Support (NCGAS). The move is being made with an eye towards eventually relocating away from the Broad Institute and expanding to other supercomputing resources around the country.
Last week, the researchers and institutions behind GenePattern announced that the software platform is now available on the Mason supercomputer cluster at NCGAS. The program provides hundreds of bioinformatics analysis methods and visualization tools to researchers with little experience in computer programming.
Michael Reich, assistant director of bioinformatics at the University of California, San Diego School of Medicine and a GenePattern veteran said that the Mason cluster greatly increases the computing power and storage available to users. The center boasts 288 dual-CPU nodes running a total of 4,608 cores and 3.5 petabytes of storage.
NCGAS would help accommodate increasingly compute-intensive analyses, increasingly large datasets, and a growing number of researchers doing this type of work, he said.
In addition to getting the web-based interface to run on a bigger engine, the GenePattern and IU teams have collaborated to parallelize one of the more frequently used algorithms and prepare the software for new kinds of analyses not even dreamt up yet.
NCGAS is the first of several high-performance compute installations that will host GenePattern on resources in the extreme science and engineering discovery environment (XSEDE), a nationwide cyber infrastructure network, with more to come soon.
Reich noted that for the foreseeable future, the Broad will continue to host a deployment of GenePattern.
GenePattern was first introduced to the public in 2004 and has gained over 50,000 users worldwide, Reich said. Accounts are free and open to the general research community with few restrictions on who can join. Initially working out of the Broad Institute, the GenePattern team led by Reich and computational biologist Jill Mesirov moved to UCSD in June of last year.
The program is designed to provide a "friendly interface" for many types of genomic analysis, such as gene expression, single nucleotide polymorphisms, copy number variations, proteomics, network analysis, clustering, and classification. Those are being joined by newer methods, some based on information-theoretic and Bayesian approaches.
But new technologies such as short-read sequencing and the combination of multi-omics data types are increasing the requirements for the compute power and storage power required for analysis, Reich said. "Recently we looked around and saw that at IU — specifically the NCGAS —had the existing infrastructure to support these bioinformatics tools."
The IU team said that the Mason cluster running GenePattern has 10 times as many nodes as the GenePattern cluster at the Broad, sporting 512 GB of memory per node, compared to the 32 GB per node that the Broad has for the service. And soon IU will move GenePattern to a new cluster with even more capacity and even faster processors.
Founded in 2011 with a $1.4 million grant from the National Science Foundation and renewed last year with a $627,854 award, the NCGAS has the goal of making the nuts and bolts of bioinformatics all but invisible. "What we let cancer scientists do is focus on cancer science," Craig Stewart, executive director of IU's Pervasive Technology Institute, said.
The technical team at IU handles programming to let bioinformatics software take full advantage of the center's computing resources. "GenePattern has been one of those cases where we can step in and provide more computational muscle as well as work on the underlying algorithms to make them more efficient," NCGAS manager Tom Doak said. His facility also supports genomic analyses in other fields, such as ecology and more basic genomics. There are a lot of researchers who are obtaining these large genomic data sets for the first time and don't particularly know what to do with it. His job it to help them and GenePattern offers a straightforward way to do that.
One of the specific challenges the collaboration has worked on is fully optimizing the algorithms to use the hardware available. Different analyses can be done in different ways. "In assembling a transcriptome, what you need is a machine with a lot of memory," Reich said, "But if you're trying to identify complementary genomic aberrations in different types of cancer, you'd want a parallel architecture, where individual nodes don't need as much memory." Simply shifting around such voluminous amounts of data requires special attention.
One example of a new algorithm implemented in GenePattern with the help of IU is Revealer, an iterative approach to uncovering context-dependent complementarity of genomic alterations published in April in Nature Biotechnology.
"Diseases such as cancer are caused by more than one genomic alteration and many times they're complementary or mutually exclusive," Reich said. The Revealer algorithm uses a metric called mutual information to find a collection of mutually exclusive genomic alterations.
"You can connect those to outcomes," he said, allowing a scientist to ask a question like "How dependent is a tumor's activity on certain chromosomal aberrations?"
"One benefit [of Revealer] is your data don't have to all be the same type," Reich said. The analysis can incorporate many types of genomic alterations. But that's computationally intensive. "It's not practical to run on a laptop. You would need a supercompute resource to do the iterative approach this represented," he said.
Revealer is already available in GenePattern and the team is working on implementing other new algorithms, like one to decompose gene expression transcriptional signatures into component parts. "In the same way that an algorithm can decompose a facial image into eyes, nose, and hair, we're trying to decompose cancer samples into their componenet cellular states to determine how those states may affect a tumor's drug sensitivity," Reich said.
The teams are also prepping GenePattern for a new generation of analyses that will combine multi-omics data types.
"Many of the analyses, they're very diverse and require very different compute dependencies and even different versions of the same [coding] language," Reich said. In addition to supporting all the existing methods available in GenePattern, the teams wanted the system to be able to grow along with the field as new methods are developed.
This work is resulting in a system that is general enough to be run on other supercomputing resources. As part of the XSEDE network —which includes the Pittsburgh Supercomputing Center (PSC) at Carnegie Mellon, the Texas Advanced Computing Center, and the San Diego Supercomputer Center at UCSD — what's implemented at IU can be implemented elsewhere as well.
Doak, the NCGAS manager, said that his team has begun working with the PSC to implement GenePattern on the Bridges cluster.
"We envision a GenePattern 'gateway' that has access to all of the resources of those sites and the different types of compute requirements they support," Reich said.