By John S. MacNeil
Representing DNA as a series of bases — adenine, cytosine, guanine, and thymine — seems logical enough to most people. As a PhD student at the Technical University in Braunschweig analyzing imaging data from microarray experiments, Gerhard Kauer was certainly no exception. But Kauer, then 36, got to thinking one day (while in the bathtub, he says), that there might be a better approach to studying genomes: Why not try converting DNA sequence — long strings of ACTGAGA — into a format more compatible with signal processing? After all, manipulating signals is a tried and true discipline, with applications in pattern recognition from cell-phone signal processing to digital image analysis.
But Kauer couldn’t just randomly assign numbers to the four constituent bases of DNA, such as A=1, C=2, and so forth. As he writes in a paper published late last year in the journal Bioinformatics, he needed numbers that correspond to a physical property of the bases for the signal to hold any significance. So he looked up the energies required to melt adjoining bases, known as the enthalpic melting energy, and converted strings of letters into a signal based on that data.
In his initial experiments using microarray data, Kauer was surprised at how smoothly sequence data could be represented by this type of signal. He even used his analysis to find heat shock motifs that had previously gone undetected. “It was so easy to look at the DNA,” he says. “There were very long sections of harmonic sequence ... Nature was giving us the signal, ‘You’re right, person, please continue!’”
Kauer sees many applications for this kind of DNA analysis. In addition to improving homology searches, Kauer says his approach should also be useful for more easily assembling sequence fragments generated in whole-genome shotgun sequencing, and for detecting the parts of the genome that bind proteins, or even analyzing protein sequences themselves.
In addition, using a physical property to represent DNA sequence allows researchers to compare the genomes of various organisms not just on the basis of sequence homology, but also on the basis of analogy. This means that even if two organisms’ DNA sequences differ, comparing their genomes on the basis of a physical property should still enable researchers to detect similarities in function. A shark and a dolphin, for example, may have taken diverging evolutionary paths that led to major differences in their genomes, but when their DNA sequences are converted to a signal based on enthalpy data (or some other physical property), similarities in how their genomes function are more easily recognizable.
At root, the reason analyzing DNA sequence in the form of a signal can be more insightful is that signal theory is much more advanced as a discipline than the algorithms typically used for character-based genome analysis, such as FAST and BLAST. These character-based algorithms are limited not only in their ability to delve into more complicated physical phenomena, but can also take much longer to perform tasks than comparable algorithms that operate in the signal domain, Kauer says. Representing the physics and chemistry underlying the structure of macromolecules like DNA is much easier with signal theory, he says, “and you have all these nice little algorithms available like speech recognition and image and pattern recognition.”
Pressing the Hardware Accelerator
Hardware also plays a significant role in the kinds of signal analyses that Kauer’s group can accomplish. While there are many methods for processing signals, Kauer chose to apply Fourier transformation to his DNA data, a scheme that involves converting the data into the sum of sine and cosine functions, a format faster to manipulate mathematically. Transforming data in this manner can be computationally intensive, and Kauer’s group has employed commercially available PCI cards for making fast Fourier transform, known as FFT, routines even faster.
In his initial work, Kauer relied on Cheetah PCI cards from Catalina Research, now a subsidiary of defense contractor DRS Technologies. The cards were capable of 4.774 gigaflops when performing a 64K complex FFT, but required Kauer to use Sun Sparc workstations, a relatively expensive option, he says.
Currently, Kauer’s group, now at the University of Applied Sciences in Emden, Germany, is working with DoubleBW Systems, a subsidiary of Eonic based in Delft, the Netherlands. Eonic not only offers a faster FFT accelerator at a lower price that even works with regular PCs, he says, but is also designing a system that will allow Kauer to store the entire human genome in a huge RAM, a configuration that will eliminate the bottlenecks associated with accessing genome data stored in a hard drive. “We are looking forward to the time when we will analyze the whole genome in a blink of the eye,” Kauer says.
Kauer completed his move from Braunschweig to Emden in March, and has $1.7 million in funding from German funding agencies to pursue his work. In addition, Kauer says his new university is home to many researchers with significant expertise in signal processing, giving him many opportunities to collaborate with researchers outside of biology. “There are a lot of cool experts joining this idea now, and they are really enthusiastic,” Kauer says. “The group I am able to involve in these new technologies has grown, and they are really experts on these things, while I am just a biologist.”
Ride Your Own Wavelet
Currently Kauer is expanding into wavelet analysis, a more powerful approach to analyzing signals than FFT. Working with FFT requires the user to keep a large amount of data on the computer and deal with several tricky algorithms that complicate optimizing the speed of the analysis, Kauer says, while wavelet analysis is much faster and more elegant.
Essentially, wavelet analysis, otherwise known as wavelet bootstrapping or wavestrapping, allows researchers to transform DNA data into a more nuanced signal than in FFT. “With wavelet analysis you are able to jump on the signal and just make a ride, like surfing on a wave,” Kauer says. Hardware issues are minimal, he adds, because his new Eonic PC cards can be reprogrammed for wavelet analysis.
But equally high on the agenda is the desire to make this kind of analysis more easily accessible to researchers involved in studying genome data. In the near term, Kauer says, while the hardware remains somewhat expensive, colleagues at the university are developing Java programs to create a network interface for outside researchers to use his system over the Web. In the long run, if the cost of hardware continues to decline, there’s no reason why other groups couldn’t build their own systems for using signal theory to analyze biological data.
So far, however, Kauer, along with his former advisor Helmut Blöcker at Braunschweig, are among the very few who have taken up this type of approach to analyzing gene and protein data. “Because the methodology I developed was so unique, but it’s too arrogant to say unique — I was the first one who did it — 99 percent of the people I talked to told me, ‘I don’t understand, go away, leave me alone.’” Kauer says. “’Ouch’ was the loudest shout!”