The use of graphics processing units to accelerate code has firmly taken hold in the bioinformatics community, but GPU programming standards like Nvidia's CUDA or the OpenCL framework require a degree of expertise that most bioinformatics researchers don't have.
But a new standard has emerged that promises to make it easier to port bioinformatics code to GPUs. OpenACC, released in late 2011, departs from "low-level" application programming interfaces by relying on a compiler to do the heavy lifting. The new standard, developed by Nvidia, Cray, the Portland Group, and CAPS enterprise, takes a so-called "directive-based" approach to accelerator computing that allows the programmer to provide high-level "hints" that tell the compiler to accelerate specific bits of code.
With the directive-based approach, "essentially you are directing the compiler," says Sumit Gupta, senior director of Nvidia's Tesla group. "All you do is write a few of these hints in your application. You don't modify the application, but instead you tell the compiler, 'Hey, look at this next loop and try to parallelize it.' Then the compiler actually does the work of accelerating that loop with the GPU."
Gupta says that OpenACC is gaining ground in the scientific computing community because it is much quicker and easier to use than low-level APIs like CUDA and OpenCL.
The standard grew out of a realization at Nvidia and elsewhere that current accelerated programming methods aren't for everyone. "You have a class of developers who want the maximum performance or who want to have the maximum control over the processor," Gupta says, "but there was also this category of programmers who mostly are domain scientists and they've written an application and all they want is to see how fast it can run."
The bioinformatics community is a perfect example of the latter case, he says. "If you're a genome scientist and you're developing an application, that's a tool. That's not what you do. What you do is do scientific research. So you just want the application to run faster."
Several life science informatics teams are finding that this is indeed the case.
Bharat Medasani from the University of Texas at San Antonio has used OpenACC to accelerate code for simulating proteins in solvents and reports a speedup of three-fold to four-fold compared to CPU code "in roughly a day's worth of work."
The aim of the work, he says, is to use the code for Monte Carlo simulations of the protein energy landscape. "Monte Carlo simulations allow us to explore thousands of variations of potential molecular structures of proteins and ultimately tell us which structure is likely the right one," he says. "The problem is that these simulations with CPUs typically take months."
He estimates that a four-fold speedup with OpenACC would allow him to reduce a three-month simulation to less than a month.
According to Medasani, the main advantage of OpenACC compared to CUDA is the ease of programming. Using OpenACC, "I could focus on the problem and code it in C/Fortran and then put directives on top of the code to get decent speedup," he says. "With CUDA, you have to continually focus on the GPU hardware and code algorithms in addition to the main problem."
The new standard is also easy to learn, he says. "With OpenACC, I need to read a few documents and I am good to go," Medasani says. "With CUDA, I need to understand the GPU hardware thoroughly and learn the CUDA framework, which takes significant lead time."
In addition, he notes that most academic developers aren't full-time programmers. They, rather, just need to accelerate some code every few months — a timeframe in which it is easy to forget the intricacies of CUDA. "With OpenACC, there is less to remember and to relearn," he says.
Other groups are seeing far more dramatic acceleration with OpenACC. Researchers at China's Shanghai Jia Tong University recently used OpenACC to achieve a 16-fold speedup of DNADist, part of the Phylip phylogeny package, for researchers at Roche.
James Lin, the vice director of the HPC Center at Shanghai Jiao Tong University, recalls that a friend of his, Steve Pan, who is a project director at Roche Pharma Global Informatics, was looking for a way to accelerate the program, which the group uses to compare DNA sequences for drug discovery. It was taking the Roche researchers up to a week for some DNADist jobs, "and their cluster was already overburdened," he says.
The Shanghai Jiao Tong team developed several parallel versions of DNADist, including an OpenMP implementation on a four-core CPU as well as CUDA and OpenACC implementations on Nvidia GPUs. OpenACC's performance was similar to the 17-fold speedup the group saw with CUDA, but took much less time to develop, he says.
The OpenACC version is easier for end users as well, Lin says. The Roche researchers "can read and even modify the OpenACC version of the source code, which is almost similar to the original code," while the CUDA version "seems totally different" to them, he says.
Lin predicts that OpenACC will eventually "be more popular than CUDA and OpenCL" and is the "best option for bioinformatics researchers for programming on Nvidia GPUs and other many-core devices."
He adds that the 16-fold acceleration of DNAdist with OpenACC is actually at the low end of the scale for potential GPU acceleration of bioinformatics programs, which could be juiced up to 100-fold if the original code was running as a single thread in a CPU.
Nvidia's Gupta stresses that this level of acceleration depends on the state of the original code. If it was never optimized to a CPU, "any parallelizing complier would be able to accelerate that application quite well." In such cases, "we've seen people get 50 times speedup using OpenACC."
On the other hand, an application that has been fully optimized for a CPU might see only a two-fold or three-fold acceleration.
While OpenACC makes it easier for developers to accelerate their code, the end result isn't as dramatic as it could be with low-level APIs like CUDA and OpenCL.
Researchers at the Center for Computing and Communication at Germany's Aachen University have found that the performance of code accelerated with OpenACC generally runs at a fraction of the level of the same code implemented using low-level APIs. In a study presented at the Euro-Par 2012 conference in August, they reported that an application for simulating gear cutting in the auto industry performed about 80 percent as well when implemented with OpenACC as compared to OpenCL. An OpenACC-accelerated program for neuromagnetic brain imaging, however, exhibited only about 40 percent of the performance as the OpenCL version.
Nvidia's Gupta says that OpenACC can get as fast as CUDA "in some cases," particularly when implemented by an expert developer who specializes in programming to the processor.
"When a compiler is trying to do it, it may or may not get as good as a well-trained developer," he acknowledges.
Christian Terboven, a member of the Aachen team that compared OpenACC with OpenCL, says that the directive-based approach limits the ability to exploit "all the possible tuning opportunities" with GPUs. "That's always a drawback if you're using an abstraction like directives," he says.
In addition, the community-driven standard doesn't support some new hardware features that are available in vendor-specific programming models. Terboven's colleague, Sandra Wienke, notes that "CUDA will always support the latest features, but OpenACC will take some time to include it into the standard."
OpenACC is actually a stopgap intended to help developers accelerate their code until the broader OpenMP API specification for parallel programming incorporates accelerated programming. The vendors behind OpenACC are also on an OpenMP subcommittee tasked with developing a version of that standard for accelerators, but it hasn't been determined yet whether that capability will find its way into OpenMP 4.0, which should be ready for public comment before the end of the year.
In the meantime, OpenACC works well as a standalone API, Terboven says, and is a good option "in case an [accelerator] standard doesn't find its way into OpenMP 4.0."
Terboven adds that anyone looking to adopt GPU computing — regardless of the API they use — should "first understand their goals" since many developers "go crazy porting something to a GPU that won't solve their problem."
Gupta's advice for any new developer on GPUs, meantime, is "start with OpenACC" and then determine whether additional acceleration is necessary.