Skip to main content

Making the Most of Multicore

Premium

 

It's no secret that multicore processing power, despite all the marketing and tradeshow hoopla, arrived long before the software was ready. Valiant and significant efforts are being made by vendors and academic developers alike to devise solutions for easily parallelizing code to take full advantage of many-core processing power.

Parallel processing itself is nothing new, at least to those steeped in the finer points of high-performance computing, but its availability to the masses in the form of multicore desktop computers is a recent development. What's most important to life science researchers is the promise of potent processing power that this new generation of quad and octo-core desktop computers holds. Given the flood of data facing researchers thanks to high-throughput screening and other technologies, being able to crunch data in a parallel fashion on a desktop is certainly an attractive alternative to waiting in a queue for time on a cluster or grid. So it makes sense that efforts to parallelize software and programming languages for life sciences have concentrated on staples of the toolbox.

To this end, software engineer Gonzalo Vera and his colleagues at the Groningen Bioinformatics Centre have recently released R/parallel, a parallel add-on package for R that promises to take full advantage of multicore processing capabilities for the widely used statistical analysis toolkit. Vera and his team say that implementation is as simple as adding a few lines of code contained in the add-on module to your R package; the small difference means that daily tasks can be greatly accelerated.
Previous options for parallelizing R using a message passing interface and parallel virtual machine frameworks, while effective, have proven challenging even for biologists with above-average computer savvy. While these solutions offer a level of abstraction so users don't have to think "in parallel" when coding their R scripts, they still depend on clusters that are enabled with external frameworks for parallelization — and having a knowledgeable IT person around is probably required as well. Vera and his collaborators say that for researchers using R without those two factors in mix, a desktop solution is a viable alternative.

The R/parallel team demonstrated that after inserting the few lines of code contained in the R/parallel add-on module, a gene expression job processing 37,685 traits from 73 individuals running on a quad-core processor takes about one hour to complete, compared to four hours running serially.

Ritsert Jansen, a professor of bioinformatics at the University of Groningen in the Netherlands, regularly works with high-throughput technologies that provide data on every gene, protein, and metabolite, which they then might analyze using 40,000 traits in 200 individuals. Even though Jansen knows that the normal desktop PC would be useless for such a job, he is not convinced about high-performance computing options. So with the boost in desktop power thanks to multicore processors, Jansen says that R/parallel is a feasible and welcome alternative to wading into the world of grid, supercomputer, or cluster management. "Normally in R this would take far too long. What we tend to do at this moment is to use our large computer cluster, something like 1,500 machines," Jansen says. "But what we see and what we also hear is that these grid technologies or cluster technologies are fine, but you are quite quickly in a queue — there are so many other people who also want to make use of the resources."

In addition to speedups gained with parallel computing, there's also efficiency from attempting to get full use out of multicore processors. It's worth noting that a single multicore desktop can get the same performance on some R jobs as a more traditional parallel environment. The R/parallel developers cite an example where one quad-core machine took roughly the same amount of time as 16 machines in a distributed environment with an equal amount of memory. "Because we are able to use the multicore processor, we've removed the need to send all the data through the network. So with [a] four-core machine, we get the same performance, same processing time, using the same amount of data," Vera says. "The next problem is what we do with a terabyte of information and how we move it around the processors, but that's another story."

A major issue for developing or implementing a parallelized version of a particular software tool is debugging. To this end, R/parallel allows users to run their code both sequentially and in the parallel form without any further tweaking. That way, users can compare the results of the parallelized calculation to the same calculation run in a serial processing fashion.

Developing a parallelized R has slightly different challenges than porting other codes to multicore processors because one of the key features is that academic researchers want to be able to share and have others reproduce what they've developed using the program. The debugging feature helps with this code sharing between researchers, so that if the R/parallel module is not installed on another user's computer, the code will still continue to function properly, only in a serial manner. "Users of R like to retain full control of what they are developing, and they don't want to have something hidden. Also, they like to add additional functionality," says Vera. "One of the main objectives of the users of R is to be able to reuse and share the new developments, so that's what we are trying to do with R/parallel — to remove the barrier and make it easy to share … the versions of any algorithm or methods implemented with R."

Other offerings

Commercial vendors trying to peddle their own versions of a parallel R platform are very tuned in to the fact that average users really don't know what kind of power they could be getting out of their desktop if their software were parallelized. "It's true that many biologists are not yet aware of the potential of parallelizing tools such as R. That can partly be attributed to the perception that high-performance computing is a complex activity which requires a computer science background," says Colin Magee, vice president of sales and marketing at Revolution Computing, a parallel software maker. Revolution offers a parallel version of R based on open-source R, and the company says its customer base of biologists has responded well to it. Magee says he sees steadily growing interest from life science users about ways to further exploit their own multicore PCs.

Interactive Supercomputing is another cutting-edge parallel software vendor that specializes in parallelized R, albeit with a focus more on moving jobs from running serially on the desktop over to a cluster. But the company maintains that the challenge involved in spreading the gospel of parallelization is the same no matter what the hardware. "It is true that the life sciences community is less aware of different types of computational tools and the benefits [that parallel computing] can provide," says David Rich, vice president of marketing at Interactive Supercomputing. "This is especially true for the smaller life sciences organization with few or no computational experts."

Despite vendors' efforts to keep complexity behind the scenes, biologists may still have to get their hands dirty to some extent if they want to see real results. "The current generation of [parallelization] tools requires the user to at least understand their code and be able to indicate where parallelism can be applied, [so] if they can do that, then benefit can be achieved," says Rich. "Furthermore, if some members of the organization can prepare portions of a code for parallelism, then other users will benefit even if they are not aware of how that benefit is achieved."

Thinking in parallel

Because so many bioinformatics jobs have loops that can be run in parallel, the real challenge for developers is to build a flexible tool that allows biologist to ignore what's under the hood. "There are various commercial implementations that allow for parallel computing for a very specific task, but if you want something else to be parallelized, you have to dig deep into the code," Jansen says. "There may be problems that need these hard core parallel computing tricks, but many of the tasks that we work with in today's high-throughput biology can so easily be parallelized."

Vera and his colleagues believe that it should be trivial for average biologists to find out which parts of their code can be parallelized. But according to Thomas Mailund, a research associate professor at the Bioinformatics Research Centre at the University of Aarhus, parallel programming is still anything but trivial — even for seasoned programmers. "The problem is that our brains just find it hard to think about parallelism," Mailund says. "We have to explicitly program how the communication and synchronization rules should be, and our brains are pretty bad at reasoning this out. … There are guidelines you can follow to avoid these problems, but even very experienced people find it hard and make mistakes."

The trick for the user is to figure out which parts of the code need to be parallelized. But Mailund believes that in much the same way that it is notoriously hard to consider where the bottlenecks will be and how to get around them in sequential programs, it is equally if not more difficult to predict where parallelization can give your code a boost.

According to Mailund, what makes R/parallel worth the attention of non-computer scientists is that all the questions and quandaries usually posed by parallel coding are hidden from the user. This must be the selling point of any parallelization solution. "If you are not experienced in parallel programming, and your interests are in, say, biology and not computer science, you will probably find dealing with [parallel software] issues a pain in the neck," he says. "You just want the computer to crunch your numbers with the equations you give, and really you shouldn't have to worry about how it does it."

Vera and his colleagues say that despite the efforts of commercial and academic developers, solutions to parallelize your code automatically are still a ways off. These tools require a fair amount of working knowledge about the code one is using and about where parallelization can accelerate it. Vera even warns against such tinkering for the average biologist. "Most users should not do anything to automatically parallelize their codes, because the technology has not evolved enough to do this in an optimal way," he says. "There are a lot of cases where this automatic parallelization doesn't work well. That's why we just opt for a minimal thing … just point to what section you want to parallelize."

But the lack of an automatic solution shouldn't slow you down, according to Vera, who believes that most bench biologists are unaware of how powerful current multicore desktops are. "Starting with quad-cores and what's coming, it's more than enough power for analyzing problems that we have in genomics right now from a processing point of view," he says.

The Scan

Possibly as Transmissible

Officials in the UK say the B.1.617.2 variant of SARS-CoV-2 may be as transmitted as easily as the B.1.1.7 variant that was identified in the UK, New Scientist reports.

Gene Therapy for SCID 'Encouraging'

The Associated Press reports that a gene therapy appears to be effective in treating severe combined immunodeficiency syndrome.

To Watch the Variants

Scientists told US lawmakers that SARS-CoV-2 variants need to be better monitored, the New York Times reports.

Nature Papers Present Nautilus Genome, Tool to Analyze Single-Cell Data, More

In Nature this week: nautilus genome gives peek into its evolution, computational tool to analyze single-cell ATAC-seq data, and more.