Skip to main content
Premium Trial:

Request an Annual Quote

After PNNL Scores SC’08 Victory Proteomics Tool Development Continues


Deborah Gracio
Computational & Statistical Analytics Division
Pacific Northwest National Laboratory

At the recent Supercomputing ’08 High Performance Computing Analytics Challenge, a team of Pacific Northwest National Laboratory researchers were awarded “Best Overall” for their entry comprised of multiple tools:

          • ScalaBlast, an open source distributed implementation of BLAST,
          • SHOT a new algorithm developed at PNNL,
          • Starlight, a visualization tool,
          • MeDICi middleware for data-intensive computing developed at PNNL and available for download here

Victor Markowitz, who heads Lawrence Berkeley National Laboratory’s Data Management and Technology Center and was previously chief scientific office of Gene Logic, told BioInform in an e-mail he and his colleagues use ScalaBLAST for content updates of IMG, the Integrated Microbial Genomes data resource hosted by the Department of Energy’s Joint Genome Institute.

“We have been using ScalaBLAST for very large computations regarding pairwise similarities of millions of genes across thousands of genomes and tens of metagenomes. Such computations, which would take weeks to months on medium sized clusters running BLAST, take 1-2 days on PNNL's large scale cluster running ScalaBLAST which has been optimized to run on such clusters,” he said, adding that it “scales very well, but the effect is especially felt on large clusters such as that at PNNL.

Chris Oehmen
Senior Research Scientist,
Computational Biology & Bioinformatics 
Pacific Northwest National Laboratory

The trend in the community is to provide smaller labs access to such large infrastructures since they indeed have difficulties coping with large computations, he said. He added that scientists may not be familiar with scalaBlast since it is removed from their day-to-day work but that their microbial genome analysis at IMG “is supported by computations carried out by ScalaBLAST.”

BioInform caught up with PNNL researchers Deborah Gracio and Chris Oehmen to speak about the victory and future plans.

Gracio directs PNNL’s Computational & Statistical Analytics Division. Her work has included R&D of integrated computational environments for biodefense, computational biology, computational chemistry, and atmospheric modeling.

Chris Oehmen is senior research scientist in a computational biology and bioinformatics researcher at PNNL who works on high-performance computing applications for bioinformatics and computational biology.

He leads the ScalaBLAST project and works to find methods that optimize memory management for high-throughput bioinformatics applications and explores implementations of support vector machines to binary classification in data-intensive problems.

The following text is an edited version of that conversation.

What happens now that you have won the competition?

Chris Oehmen: It was very exciting to win the challenge. It was the work of a lot of people and different agencies.

Can you talk a bit about the challenge, which involved 10 species of a bacterium called Shewanella and a search through 42,000 proteins? Where did the data come from?

CO: Shewanella is a microbe that can interact with heavy metals like the kind you would find at waste sites. The potential to help deal with legacy DOE waste sites is part of our mission. … The data we used came from public repositories. These were not special datasets only generated for us.

Deborah Gracio: PNNL has been involved in Shewanella research along with other national labs for many, many years. There is a Shewanella Federation that is funded by the Department of Energy to understand this organism and its ability to help in cleanup of nuclear waste sites.

Did the motivation for this challenge come from the fact that you at PNNL need tools to do this kind of work?

CO: The Shewanella challenge is that there are so many strains; there are way more than 10 species of Shewanella. We used it because we have access to people here who understand the biological details. What we were trying to do is show that we can drive analysis at the scale of multiple genomes. And that is the thing that is interesting way beyond the questions that focus on Shewanella.

DG: We showed that these tools can leverage this kind of data but it shows what it can do for different types of problem spaces, even beyond biology.

Many tools are developed at large genome sequencing centers, are you sequencing on a large-scale?

CO: We are not a genome sequencing facility, but we do have a large proteomics facility here so there is a lot of interest in understanding all aspects of proteins. I have a project with some of the large genome centers to help with large-scale protein analysis. PNNL is starting to gain more exposure on the protein side.

What gave your platform the competitive edge, the time, or the accuracy in rounding up proteins?

CO: Most of the other challenges at the Supercomputing Conference are very well-defined, such as the bandwidth challenge in that the problem is to drive as many bits as you can through a pipe for some useful purpose. Everyone knows upfront what the metric is going to be, the number of bits that get moved per unit of time.

In the analytics challenge, the problem statement is much more flexible, so what they are looking for are applications that combine visual analysis tools and high performance computing algorithms or implementations. The set of finalists is solving very different problems from one another with different tools all across the board.

What helped us is that the tools we were focusing on scaled to hundreds or thousands of processors, showing the tools were highly scalable. Our process was iterative; you can visual tools to see results from a large computational task.

We took the output of high performance computing we were looking at with the visual tool and used that to drive a second round of high performance computing. So you are letting someone get a cursory look at the data, so they can come up with a potential hypothesis.

They select a subset of data to hand it off to the supercomputer again, but this second time it only looks at the fraction of data that they care about and does a more detailed analysis. The visual tool is driving the supercomputing.

Does this set of tools that you use exist all ready to go as a workflow, sort of like Taverna?

DG: Taverna is more related to the workflow component, which is what MeDICi provides. We have done some tests employing Taverna instead of the BPEL workflow engine that we are using in MeDICi so we could use either workflow engine very easily, we can swap them both out. It gives us flexibility as we deploy these technologies to different client spaces.

The advantage of the [XML-based business process engine language] BPEL engine is, while it is an open source product, it has solid commercial backing. When we build software, and because MeDICi is being developed for lots of different kinds of uses, we want to give the people to whom we are going to give this software the option to have a commercial backer.

MeDICi was important in winning the competition, so although middleware is supposedly that boring part of software no one talks about, please explain how it gave you an edge?

DG: It’s like the plumbing in your house. MeDICi was funded by internal research dollars so we have to go through export release and copyright release with the Department of Energy. So that paperwork is filed, so we can release the source code. Right now people can download and use MeDICi but they won’t have access to the source code.

Is there anything specific about MeDICi that works well for bioinformatics?

DG: We‘re using MeDICi for cybersecurity applications, for subsurface research [which includes studying contaminants]. We have an application in homeland security and some applications we can’t discuss because they are in the intelligence community. There is one application in climate research.

There’s no secret sauce in MeDICi, the real piece of that is in the application you are running that you are using MeDICi to schedule and tie together.

The secret sauce is in the applications that Chris and his team have developed like the scalaBLAST and SHOT.

CO: ScalaBLAST and SHOT are ways of analyzing proteins from large sets … bioinformatics in the traditional sense. We started the way I imagine a lot of people do. You have sequence data, you have some idea of the things you want to look for but you don’t have a hard and fast hypothesis in hand yet. It’s easy to get overwhelmed. You look at 40,000 proteins and say ‘Good Lord, now what do I do?’

One of our team members, Lee Ann McCue, is a microbiologist. We all got into a room and said, ‘OK Lee Ann, here’s an enormous data set. Is there a way to figure out what a good hypothesis is from this data set?’ rather than starting off looking for something and then big surprise you found what you were looking for. Can we find anything interesting by looking at the links between proteins across all these different species?

She is thinking in terms of what a microbiologist is looking for, processes that the different cells could be involved in, what’s the difference between certain species can or can’t do? Those are high-level questions. You don’t know which proteins you are looking for, you are looking for a pattern. We started by exploring what these patterns were.

Aren’t there software tools out there for that pattern hunt?

CO: It may be. Our intent was not something that could be shrink-wrapped and then compared to all other existing methods necessarily. We were more interested in: can we use the high performance computing that we have and algorithms that will efficiently take advantage of it and drive that with the visual tool? The goal is that the new rate-determining step is the analyst’s time staring at the patterns and thinking about stuff, not waiting for a computational task to be done or data files to get moved.

DG: Or that there is so much data that you can’t explore it visually.

Are your tools just meant for labs with high compute power levels?

CO: The tools are meant to be scaleable on a variety of platforms. They were written to be used in mpi, which sort of implies that you have more than one processor. But that doesn’t mean anything about the scale; it doesn’t mean you have 10,000 processors in order to take advantage of it.

If you have a cluster that has got 30 nodes in it, you can expect to get a 30-times-faster answer with these tools. One of the challenges we have seen is that it is sometimes difficult to get people who are comfortable with certain kinds of tools to even use high performance computing. They have to take six months out of their research plan to understand how to use the scheduler, how to port the code, and sometimes that is too much to ask.

That’s what’s nice about MeDICi. We launched these jobs on a hundred-something cores from my laptop from the show floor at Supercomputing during the presentation that we gave for the challenge. We were driving computing that was done here at the lab. We plugged these things in such a way so I don’t have to learn how to use the scheduler and move my code around, even know where the files are.

As long as we connect the pipeline correctly with MeDICi, it handles those problems for you. That is one of the problems people with clusters face: now that I have the fire-power, how do I get people to use it?

DG: All of these applications could run on your desktop even. It is really based on the problem you have and the kinds of applications you want to put in there. Let’s say you have three systems, you can launch different parts of the problem to different systems using MeDICi because it helps you parse up your application.

CO: In our case, we determined what number of processors we needed to do the calculation by how long a person was willing to wait for that answer. Once you got beyond a few hundred processors, we were able to turn around reasonable data sets in just a few minutes. For other problems it might be many [more] processors or different number of minutes.

In the challenge, we had another server driving the visual tool. So part of the MeDICi pipeline understood how to get data from the cluster where the computing was being done to the server where the visual tool needed to see things and how to mediate that handshake back and forth.

DG: MeDICi is extremely configurable for programmers and scientists. One of the things currently being added is a visual programming environment so you can plug applications from a toolbox together visually.

At the heart of your system is SHOT, a new algorithm. How did you evaluate it?

CO: What makes SHOT special is it is much more sensitive. Doing sequence similarity alone to identify homologous proteins is fine when there’s a large similarity in those sequences. Some sequences might come from a common ancestor, still result in proteins that do the same basic thing, but the sequences have diverged such that you can’t recognize them as similar from regular sequence analysis.

This algorithm increases the sensitivity of finding those homologous proteins and lets you start to find more distant homologs than you can find using other methods, because it doesn’t only rely on sequence similarity. It uses some statistical classification techniques applied in novel ways.

SHOT is very new and in the bioinformatics community it takes a very long time for widespread acceptance. I imagine it will take some time before a lot of people use it. This challenge helps because we can show the value that it brings to help find what might not have been found otherwise.

DG: One of the other important things is that SHOT has been run on high performance computing systems. Statistical techniques traditionally have never been adapted to high performance computing systems in the past, so they were not able to do this many comparisons in the past.

So what’s next for these tools?

DG: We are looking at how to go through a technology maturation process for these technologies. Once the research funding is over, we have a very difficult time continuing the maintenance and support as we put them out. That is one of the reasons we like to be able to put them out in the open community, but then the community can continue to add to the value of the product.

The other path is to commercialize them. That is probably not the path we would take with most of these technologies, especially not with MeDICi because we see that as one of the key elements of continuing to building these types of products and programs with our collaborators.

Starlight [which was commercialized last year] was developed as an information analysis and visualization tool for the intelligence community. It is a very broad-based technology that can look at different types of data.

[For this project] the team just had to make sure they got the data into the XML-input format that Starlight could read. Starlight has not traditionally been used for biology. Most people in the biology community will have never heard of Starlight.

File Attachments

Filed under

The Scan

Genetic Risk Factors for Hypertension Can Help Identify Those at Risk for Cardiovascular Disease

Genetically predicted high blood pressure risk is also associated with increased cardiovascular disease risk, a new JAMA Cardiology study says.

Circulating Tumor DNA Linked to Post-Treatment Relapse in Breast Cancer

Post-treatment detection of circulating tumor DNA may identify breast cancer patients who are more likely to relapse, a new JCO Precision Oncology study finds.

Genetics Influence Level of Depression Tied to Trauma Exposure, Study Finds

Researchers examine the interplay of trauma, genetics, and major depressive disorder in JAMA Psychiatry.

UCLA Team Reports Cost-Effective Liquid Biopsy Approach for Cancer Detection

The researchers report in Nature Communications that their liquid biopsy approach has high specificity in detecting all- and early-stage cancers.