Pittsburgh Supercomputing Center
This week, the National Institutes of Health’s National Center for Research Resources awarded the Pittsburgh Supercomputing Center $8.5 million to renew its program in biomedical supercomputing for five years.
PSC’s biomedical supercomputing program was established in 1987 and renamed the National Resource for Biomedical Supercomputing, or NRBSC, last year.
The renewal supports NRBSC’s research in three core areas: cellular modeling, large-scale volumetric visualization and analysis, and computational structural biology. Further information on NRBSC resources is available here.
BioInform spoke to Joel Stiles, a senior scientist at PSC, after the award was announced this week to get a better idea of how the NIH award will help advance the NRBSC’s goals.
Can you provide an overview of the NRBSC’s current activities and what projects this award will support?
Historically, I would say that the high-performance computing crossover into biology has been in two areas. One is in molecular structure analysis and prediction — protein folding, protein structure types of issues. That’s been ongoing for quite a number of years, and it still remains a very computationally demanding area, where bigger and bigger computers are still needed.
Another general area historically has been in bioinformatics, which covers a lot of ground, but the particular part of it that I’m talking about is genome analysis, genome sequencing. At this stage of the game, that has evolved more to analysis of sequence and comparison of sequences — across different species, across different subpopulations within a species, these kinds of things.
So that’s in very quick, broad strokes, the historical backdrop. What’s happening today is that it’s now being expanded into areas that focus on new types of data acquisition on very large scales, which in many cases use biomedical imaging technologies — CAT scan data, MRI data, electron microscope data, all kinds of different light microscopy datasets.
These are all increasingly collected at large frequency and at very high increasing resolution. So the kinds of data that we’re now getting for molecular information are being coupled to visual information with all kinds of different modalities at very large scale.
Once you start getting that combination, then you’re in a position to start thinking very seriously about moving into modeling and simulation of biological systems with the real spatial design of the biological system coming from the imaging, and the biochemical aspects of the system coming from the bioinformatics and biochemistry and high-throughput data collection.
So we’re sort of at a crossover point now where the pieces are coming together in such a way that it will allow us to push forward in modeling and simulation, which will, we hope, dramatically influence our ability to understand how biological systems work in real life, and why they go wrong in some cases, and of course, how to fix things when they go wrong.
In our particular case, this new award was coincident with a kind of redefinition of our own group, so at this point we focus on three areas: We have one focus area on spatially realistic modeling and simulation of cells; and then we have another that is focused on this issue of very large-scale imaging data acquisition, analysis, and visualization; and then we have another, which is the pre-existing focus on molecular structure at the protein level in particular, and several subcategories of that, for example ion channels and enzymes.
And then there’s still a connection to bioinformatics approaches in the protein structure modeling and prediction, because much of what goes on there still makes heavy use of pre-existing protein structure information from databases in order to basically obtain starting points for analysis of unknowns.
What are some of your specific goals? It seems like there would be a lot of challenges in all three of these areas.
In each of these areas, we’re intimately involved with software development, as well as applications. So a large part of our goals are to continue to develop these software packages that go inside each of these three focus areas, because at present today a large impediment to what we’ve been talking about — especially modeling and simulation — is a lack of simulation software.
In order to do spatially realistic cell modeling, to use that label, you have to have some way to go through an elaborate pipeline at the moment of starting with some kind of imaging data, pulling out from that imaging data the bits and pieces of it that are relevant to the biological question in mind, building a model from those bits and pieces in 3D detail, and then populating that model with molecules and chemical reactions and what not. And again, you’d want to have some bioinformatics database sources feeding you that information, if possible. And then ultimately you want to run some kind of simulation of that system to see if you understand how it works, and to try and predict things that can be tested in a laboratory.
At present, each one of those steps that I just went through takes a lot of energy because the software is still being actively developed and there are a lot of difficult problems in designing software to do all those steps effectively, let alone having it created in such a way that your non-computer-expert scientist can use it.
So a large part of the challenge right now is just to get these different pieces of elaborate software to do what we need them to do, to get them to work together efficiently and quickly, and of course to make it easy for others to do so as well.
In our center we’re trying to do all that, plus then collaborate with experimentalists in a variety of different settings and actually apply it at the same time. And that’s how we keep a real-world check on how useful it is and how usable it is outside of our own hands.
How many people do you have working on these projects?
There are about 15 or so within the context of this new award.
How close would you say you are to the goal of having a cell model in which you can run in silico experiments?
We can already do that, and we do it, but it’s at the scale of pieces of cells, not an entire cell. So we do that now, and other people who use our software do that now for pieces of cells. To go to a complete cell model in three-dimensional detail is a very big challenge, and then to go beyond that to very detailed tissue modeling, where you’re dealing with populations of cells and circulatory systems and all types of other things running through them, is yet another level of gigantic expansion of the problem.
I think that to get a real handle on quantitative models of whole cells, let alone populations of cells, we’re talking about at least petascale computing. So the NSF’s current competition for a petascale computing award becomes very relevant to this aspect of biology. In fact, at a recent NSF workshop this was one of the major areas of discussion, [that one of the] driving problems for petascale computing in biology is cell modeling, and going then from cell modeling to tissue modeling in three-dimensional detail.
So the [petascale] hardware will come online in the space of five years or so. In most people’s minds, the biggest impediment is the software development. The difficulties and challenges of software development are far beyond the difficulties and challenges of building the next generation of big machines. And it’s very difficult to predict how long that will take. I don’t really see it on a five-year timescale, but I certainly see it on a 20-year timescale.
One analogy that I like to make for people to help put these things in perspective is to think about weather prediction, and what weather prediction was like 20 years ago. Twenty years ago, we’d turn on our TV set and watch the evening news and we’d see the weather forecast for the next day and we would mostly laugh at it.
Nowadays, we take for granted turning on our TV set or looking on the web and seeing a five-day forecast, or even a two-week forecast, and the vast majority of time it’s pretty accurate. And if we have something like a large storm system coming up now, the question is no longer will it strike land, but exactly where will it strike land, at what time, and with what force? And how long will it stay there?
This is a gigantic transformation in capabilities that’s happened over the past 20 to 30 years. Now these are slow transitions and you don’t notice them while they’re happening, but now you watch your nightly weather report on the news and you see these three-dimensional models of the globe, and you see all the clouds and the weather systems moving across, and you see the replays of the real-time Doppler radar and so on. You have a very concrete example of a synthesis large-scale data collection, supercomputer modeling, and predictions that have a huge impact on all of our lives every day. And this has happened in the space of 20 to 30 years.
Now if you think about what we’ve been talking about in biology, large-scale data acquisition, large-scale high-resolution imaging modalities in biomedicine, a growing set of hardware tools, and people working on software tools to go with all of that aimed at modeling and simulation, then if you look 20 or 30 years down the line, you can really think, ‘Wow, there’s a real chance that at that point and time each of us will have our own genome mapped, we’ll know a whole lot about the expression of different protein variations in each of us individually, and we’ll have imaging and simulation tools at our disposal that will help us actually predict our own medical needs and how to care for ourselves on an individualized basis.”
There are some other cellular modeling and simulation projects like E-Cell and the Virtual Cell at the University of Connecticut. What sets your project apart from these, or would you say they’re all moving toward the same goal?
I’d say we’re moving toward the same goal, but what sets us apart in the cell modeling area is the kinds of algorithms that we use. Our work is very complementary to theirs in that way. All of our modeling is based on a program called MCell, which uses stochastic simulation of diffusion and chemical reactions. So when we model a piece of a cell, we’re actually tracking all of the individual molecules in space interacting with each other in some way.
These other packages that you mentioned, either in total or for the most part, are based on what you would call continuous methods, rather than stochastic methods. So they model the behaviors of populations of molecules using mathematical equations, rather than tracking individual molecules through space and time.
So that’s a huge difference computationally.
Yes. That’s why it will grow into petascale computing.
In terms of the hardware, are there unique architecture requirements for a project like this, or is it just a matter of more and more compute nodes?
There are some differences. Because we’re doing stochastic simulations and we oftentimes are looking at phenomena that, like real biology, are things firing at different places in space at different times, two things can come up. One is, you need access to memory in kind of random ways, and you also may have big problems with one piece of a computer doing a whole bunch of work at one point in time and then later sort of sitting there with nothing to do. And yet another part of the computer is doing a whole bunch of work. So it makes it harder to apportion the work efficiently across a big computer. Other types of projects, in some cases at least, don’t have that kind of load imbalance problem, so they’re more easily able to use a big machine efficiently.
So load balancing, memory access, and the amount of memory available to each processor are things that we’re interested in and in some other cases are not as big an issue.
Here at the PSC we have a variety of different types of machines, and some of them fall into large shared memory classes and others are the traditionally really big massively parallel machines with distributed memory. So that sort of, to some extent, dictates who runs what on which machine.
Do you have any dedicated computational resources for biomedical supercomputing, or is it all shared?
We do have a smaller(??) of smaller machines that were funded directly through a previous part of this same award, and they are dedicated to biomedical projects.
Is there anything else that you think is worth adding about the goals of this project?
Our main goal through this upcoming award period is to increasingly be able to integrate these different software pieces to allow us to create models of cells and run simulations on them much more efficiently at large scales, starting from real imaging data sets, and also then educate the rest of the scientific community on how to do this kind of thing efficiently, and make the tools available for them to do that.
I would say it’s really an overriding aim of synthesis and integration of these different tools for this next period.