Researchers at the Max Planck Institute of Biochemistry’s Molecular Structural Biology lab in Martinsried, Germany, have recently released an updated version of a software package called the TOM Toolbox for analyzing the three-dimensional structures of macromolecular protein complexes with a new type of microscope technology.
For the new release, available here, the MPI researchers used a suite of tools from the MathWorks, including Matlab, the Parallel Computing Toolbox, and Matlab Distributed Computing Server, to parallelize the software and speed up the image processing by up to 20-fold.
The first version of the TOM Toolbox was completed in 2004 and published in the Journal of Structural Biology.
The software, which serves as a high-throughput pipeline for image acquisition, processing, and results visualization, needed an update because the MPI lab began evaluating a new prototype of a cryo-electron microscope, called Titan Krios, which has more components and achieves higher resolution than previous electron microscopy tools
The Titan Krios is manufactured by Dutch optics firm FEI Company, and the MPI structural biology lab worked with the firm to develop the instrument, Andreas Korinek, a PhD student at MPI, told BioInform.
One focus of the MPI group, led by principal investigator Wolfgang Baumeister, is determining the three-dimensional structures of proteins that are difficult to crystallize. The group’s goal is to study these difficult proteins as close as possible to their in vivo conditions. To do so, Baumeister and his colleagues employ a labor-intensive method, a kind of electron microscopy called cryo-electron tomography, that was developed at MPI.
Moving to the Cluster
“The proteins that we study in the microscope are interesting to us because they have resisted attempts to be crystallized,” Korinek said. The technique takes two-dimensional shots using a fixed beam of electrons, and captures images as the sample is physically tilted a little at a time. A series of 2D images is assembled into 3D representations.
When researchers began using electron microscopy to study protein complexes in the 1970s, they had many software packages to choose from, Korinek said. “They were more or less easy to use, [but] they all had different file formats, with differing ways of calculating coordinates and angles,” he said.
In 2001, the group decided to create its own analysis package that would make it easier for biologists to work with different file formats.
“The goal was to allow us to make sure we could do everything ourselves: data acquisition, image processing, and visualizing results.”
They wanted to assure the software could import and export as many file formats as possible, Korinek said. “The goal was to allow us to make sure we could do everything ourselves: data acquisition, image processing, and visualizing results,” he said. “It’s a complete pipeline.”
The MPI researchers chose Matlab as their development platform “because it has all kinds of toolboxes, for image processing, for example, giving you a bunch of methods to process images so you don’t need to program those modules from scratch,” he said.
A few years ago MathWorks began offering Distributed Computing Toolbox, which the MPI scientists adopted for their research. “Image processing takes a long time if you only run it on a desktop computer; that could add years to research,” he said. So they began using the toolbox to distribute tasks to a cluster.
The MPI lab has a 64-node Linux cluster and can also access an off-site data center that has an IBM BlueGene with up to 8,000 CPUs.
“The toolbox was helpful because it made it easier to parallelize the process,” he said. Other tools, which use C, for example, take a long time to learn, he said. “But with this package for distributed computing, it only takes me a half an hour to explain to a student how to solve simple problems and get them calculated more quickly, without them needing intense programming skills,” Korinek said.
For example, if a scientist wants to perform the same analysis step repetitively on all images in the series, that requires a for-loop command, allowing the code to be repeatedly executed. “To create a distributed computing version of that command [with Matlab], all you need to do is write a par for-loop instead of for-loop,” he said. “Normally that kind of thing is not as simple.”
Korinek said that this parallelization capability enabled his team to speed up the image processing by 20-fold.
Processing the Proteasome
The MPI team is particularly interested in studying the 26S proteasome, which acts as “the cell’s recycling center,” Korinek said. “If there is a stray protein lying around in the cell, maybe because environmental conditions have changed or because the protein has been somehow damaged and it is no longer needed, the cell tags it with ubiquitin, which slates it for processing in the proteasome.”
Image processing of the proteasome, which is made up of 50,000 particles, takes three weeks on 10 computers, he said. First, a sample is loaded onto the microscope, the beam scans over the sample, generating images of single particles. “Then you have to find the particles on these images, cut them out either manually or in an automated fashion.”
That process might deliver a set of 30,000 to 50,000 images that require pattern recognition to find the particles in the images, followed by 3D-model construction from the 2D images, by taking the generated images, and fitting them onto the model. “You have to do that around 40 or 50 times before you get a true model,” Korinek said.
The scientists are using Matlab for many tasks in this workflow: the microscope is programmed to place the sample the way the scientists need, to focus, filter the signal to noise, and take the data off the instrument, process it, and visualize it. “There are tools for single particles in various academic labs but nothing that is adaptable to our needs, so we wrote our own,” Korinek said.
More Advanced than ‘Typical’
Kristen Zanella, manager for biotech and pharmaceutical industry marketing at the MathWorks, told BioInform said that the company found the application of its tools to structural biology “very interesting.” Korinek “really had a nice start-to-finish workflow that we like to see,” she said, including data acquisition, processing, parallelizing the process, and building a graphic user interface.
Structural biology is not a newcomer to Matlab, “but we are seeing the match being made a bit more,” she said. “I think the data has gotten so computationally intensive that parallel computing has …become a nice solution for that.”
The solution at Max Planck is a more advanced application “than might be typical” in the life science research market, she said. “A lot of customers aren’t to the point where they have built an entire toolbox, with an interface, and parallelized it,” she said.
As Zanella explained, MathWorks tries to meet the needs of the entire “gradient of user type that we face in the life sciences.” Scientists who don’t need intense data analysis may not require a tool like Matlab, she said, but “tool-builders” like Korinek do.
Zanella said that the company has begun offering “hands-on workshops” to reach out to less advanced users who need some data analysis and visualization. “They’ve warmed up quite nicely to that,” she said.
Although those seminars have been offered over the last few years, of late there has been “a ramp-up,” she said.
“There has been a lot of activity in that space,” Zanella said. “A lot of it is because of the data,” but it also stems from a need to automate data analysis and visualization to obtain efficiencies in research.