Searching for DNA sequences that don’t exist in nature might sound like an academic exercise for a rainy day, but sometimes finding things that aren’t there can have rewards all its own. Putting a new twist on comparative genomics, Greg Hampikian from Boise State University has developed a program called SeqCount to search through GenBank and identify short sequences that have never been reported. SeqCount looks for what Hampikian calls “primes,” the smallest sequences not reported in any species groups, and “nullomers,” sequences that never occur in one species but that exist in others.
“As it is now, when people want to tag something artificially with DNA, they basically just make up a sequence and put it into GenBank and make sure they don’t have any hits,” Hampikian says. “We’re saying there’s a more rational approach to this: let’s look at sequences that haven’t been reported and try and find out if there’s any reason for that.”
Hampikian says that while some sequences are not reported simply by chance, many more are not reported because they are selected against. He believes that there are some sequences that may be totally incompatible with any life form and that such “dangerous” sequences could play a big part in looking for potential drug targets. He likens this idea to the small peptides, generally around the 20 amino acid level, found on the skin of frogs that act as natural antimicrobials, killing a range of fungal and bacterial species. “If you have sequences that are intolerable for some pathogenic bacteria but that are common in humans because we are so far evolutionarily apart, then those can be used as drugs against the bacteria,” he says.
The Department of Defense has doled out a $1 million grant to fund Hampikian’s search for nullomer or prime sequences that could be used as potential tags to safeguard both voluntary DNA samples and samples used in criminal investigations.
These short sequences also have a very practical use when it comes to database management and mirror database agreement. “For example, if there’s two Lyme disease vector organisms’ genomes out there, and two databases which claim to be the same, you look for the small sequences that don’t exist and make sure that those are the same,” Hampikian says. “It’s a very easy algorithm to apply.”