There's more than one way to look at a DNA sequence, and it doesn't have to involve an endless string of letters. North Carolina State University's David Cox has created what he's termed a "symbolic scatter plot" that people can use to visualize patterns found within a DNA sequence with the naked eye. One of his main findings was that when compared to Boston University's Tandem Repeats Finder, his technique found more interesting patterns, and in some cases, repeats that TRF missed. Cox will present his research this summer at the 2009 International Conference on Bioinformatics and Computational Biology in Las Vegas.
Tandem repeats are significant in the molecular pathology of many diseases, including Huntington's disease, and Cox's technique may help researchers identify the small causative changes in DNA patterns more effectively than computational sequence analysis tools like TRF or Blast. "My thesis is that even though we have good software for finding those patterns, it's still unmatched when compared to the human visual system," he says.
Cox, who is a graduate student working on his PhD in computer science, says he developed the scatter plot with a visualization tool in mind. "Most of the bioinformatics algorithms take a statistical approach and analyze DNA sequences statistically, looking for matches that are, from a statistical perspective, not random," he says. "What I wanted to do was to actually be able to see the matches in some fashion, and in looking at the techniques that were currently available, I didn't find any that were particularly good at it."
His technique starts out similar to Blast, he says, in that it takes the sequence at hand and breaks it up into small words. Whereas Blast computationally plugs those words into a database to find similar matches, his method simply maps the words. In his case those words are 3-mers that correspond to one of 64 possible choices because there are 64 possible combinations of three nucleotides. Each 3-mer is represented as a point on the scatter plot, zero through 63, with that number serving as the y-coordinate. The x-axis is the order that the 3-mer appears in the genetic sequence. Cox designed the symbolic scatter plot so that those 3-mers that correspond to the same amino acid are adjacent to each another.
What initially struck him when he first did it on various plots of the human genome were the variable and interesting patterns he saw. "What I'm doing now is to look at those patterns and try to understand a little bit as to what they mean," Cox says. "Are they important biologically and, if so, what do they mean?"
He says that the scatter plot is basically a research tool, and won't replace currently available software simply because there is a limit to visualization. "For example, if you're comparing two sequences from two distantly related organisms, then the nucleotides might not match at all and in that case, you have to rely on some sort of assumptions about how frequently the nucleotides mutate and how frequently insertions and deletions occur in order to come to some conclusions," Cox says. "Visualization is not going to help that."