A Taiwan lab’s “genome positioning system” saves time for SNP mappers
by Aaron J. Sender
“Blast is too slow,” says Ming-Jing Hwang, explaining what drove his lab at the Academia Sinica in Taipei to develop a new SNP-mapping approach.
Identifying a SNP is one thing: the SNP Consortium has dumped millions of them into public databases. But pinpointing each within the genome is something else. To genomically locate SNPs, researchers commonly align each variant base and its flanking sequences, one at a time, with one extremely long sequence: the human genome.
“With the alignment method — that is what the NCBI does — you have to use Blast. Most laboratories would not be able to do it, because it takes too much time and computing resources,” says Hwang. Finding a match for each SNP-containing fragment, typically several hundred bases long, within a multibillion-base genome can take several minutes. That might not sound like a lot of time. But multiply that by millions of SNPs, and you’ve got yourself a months- or even years-long project. Of course an accelerator such as those sold by Paracel or Time Logic can speed things up dramatically, but at a six-figure cost few small labs can afford.
Hwang, however, has found a way to sharply cut both time and cost. Using a new approach called GPS, for genome positioning system, that avoids sequence alignment completely, his lab mapped more than 1.6 million SNPs in 20 hours, using four 1-Ghz desktop PCs. “And basically every lab can do that,” he says.
Here’s how it works. Instead of searching the entire genome database, Hwang only searches sequences that appear in the genome once. To do this, lab-member Edward Shih digitized the entire genome into binary code. A becomes 11 and its complement T becomes 00. Similarly, C becomes 10 and G, 01. “Computation is faster if you encode it in binary digits,” says Hwang.
Next, the researchers scan through the code to find all the unique 15-base-pair sequences. Hwang calls these sequences that appear exactly once in the genome UniMarkers, and his approach the UM method. The 162 million UniMarkers are then searched against the 1.6 million SNP-containing fragments. “We search all the SNP sequences and see what markers they have,” says Hwang.
Ideally, only a single UM match would be enough to locate a SNP in the genome. But in practice, because of sequencing errors and variations in the genome, a SNP sequence may contain multiple UMs that map to different parts of the genome. But because there are so many more markers than SNPs, most SNP sequences contain multiple markers that tend to cluster around a single genomic region. “But if you just have a single marker it may not point to the right position, because the marker may be noise,” says Hwang. As the genome moves from draft to complete, these ambiguities should begin to fall away.
Hwang and his colleagues were able to uniquely assign 81.4 percent of the SNPs in the dbSNP database with the UM method. Blast, on the other hand, was only able to unambiguously position 75.7 percent.
The UM approach has other advantages as well. “For other methods you first have to use RepeatMasker to mask repeat elements before you do analysis,” says Hwang. “With the UniMarkers the repetitive elements will be filtered out automatically.”
Now Hwang is working to extend the method to mapping ESTs and then cross-referencing them back to the SNPs. “The EST, because it comes from an independent source, can validate the SNP data and vice versa,” says Hwang. “And we can then identify the haplotypes in the EST clones.”
Blast certainly still has its uses. “But for the purposes of mapping, UniMarkers are definitely much faster,” says Hwang. “And you can do it with PCs.”