TIGR’s recent completion of two isolates of the Bacillus anthracis genome may not have provided any easy answers about the source of the anthrax scare that occurred in the US last fall, but it did provide some important insights into the role of bioinformatics in comparative genomics, according to Mihai Pop, a bioinformatics scientist at TIGR who worked on the project.
The comparison, which was the first whole-genome approach to a forensic investigation, was able to pinpoint four key differences in the Porton isolate, which had already been sequenced at TIGR to a nearly complete stage, and the closely related “Florida” isolate used in the letter attacks. The detection of two SNPs and two short insertion/deletions in a genome of 5.2 million base pairs would have been like finding a needle in a haystack, said Pop, without the use of a statistical quality value determination method developed by TIGR’s bioinformatics group.
TIGR has an error rate for finished genomes of around one in 88,000 nucleotides, which would have yielded around 120 differences between the two B. anthracis genomes. Pop explained that the researchers were able to differentiate between errors and actual polymorphisms by improving upon the quality values provided by the base-calling programs Phred and Paracel’s Tracetuner. Once the two genomes were aligned using Mummer, the apparent polymorphisms were assigned a quality value score of their own based on the probability that both consensus base calls were correct. TIGR wrote a series of Perl scripts to automate this process and compute the probability of each polymorphism, winnowing down hundreds of possible differences to the final four discussed in the group’s Science paper.
Pop estimated that it would have taken around a half hour per apparent SNP to assess their quality values manually.
One caveat, Pop pointed out, is the fact that the Florida strain, at an average of 8-fold coverage, was not sequenced to the same level of completion as the Porton strain, which had an average of 11-fold coverage. “So in the unlikely case that there’s a chunk of several hundred base pairs of DNA missing from the Florida strain, we wouldn’t be able to see it unless we finished the genome,” Pop said. However, he noted, having the Porton genome finished to a reasonable degree did play a key role in the work and should provide further evidence of the necessity of funding sequences through the finishing stage for certain reference organisms.
In addition, Pop said, TIGR’s quality value assessment could also be used to judge the quality of single genomes.
“Very few people understand the basic idea of shotgun sequencing,” he noted. “They understand the concept … but there’s a disconnect between the biologists who just want to read the sequence and the fact that the sequence actually is composed of multiple reads that are overlapping at a particular place and that can give you good information as to what the quality of the region is.”
TIGR provided the quality measures with its release of the sequence of the Florida isolate and plans to do the same for all low-coverage future projects, Pop said. TIGR is also in the process of sequencing at least 14 other strains and isolates of B. anthracis over the next year or so, and, “We’ll try and get better tools to find all polymorphisms between strains that are more distantly related than the strains we have to look at right now,” said Pop. A map of polymorphisms is planned, as is a broad database of polymorphisms in the genomes of other important pathogens.