Researchers from Brown University have developed a method that they say can generate more accurate haplotype assemblies for genome-wide and whole-exome studies than current methods.
Derek Aguiar and Sorin Istrail, a doctoral student and a professor of computation and mathematical sciences, respectively, presented their method, dubbed HapCompass, in a poster at the Intelligent Systems for Molecular Biology conference held in Long Beach, Calif., last week.
They’ve also published a paper about HapCompass in the June issue of the Journal of Computational Biology.
The authors note that haplotype assembly is distinct from haplotype phasing. While haplotype phasing relies on the concept of linkage disequilibrium to infer haplotypes in a population using the genotypes of a number of individuals, haplotype assembly uses SNP data to build haplotypes for a single individual from a set of sequence reads.
These assemblies are generally built by first mapping sequence reads to a reference genome and then translating the reads into haplotype fragments that contain only polymorphic SNPs. "Because DNA sequence reads originate from a haploid chromosome, the alleles spanned by a read are assumed to exist on the same haplotype," the authors explain.
Haplotype phase information is important for genetic association studies, linkage disequilibrium, and reconstructing phylogenies and pedigrees, but limitations of current sequencing technologies such as error rates and insert sizes “affect how well you can haplotype assemble,” Aguiar explained to BioInform.
For example, “the 1000 Genomes Project has high coverage on two trios and in both cases they have sequence reads from three different technologies but the insert sizes are all more or less the same … so when you have a region of the genome that is very homozygous, you don’t have a sequence read to breach the homozygous gap [and] so you can’t connect these phased blocks,” he explained. “You can phase many independent blocks across the genome but you can’t get this one large haplotype phasing of the entire chromosome that you would have liked.”
While a "considerable" number of algorithms have been developed for haplotype assembly, most were developed to handle data from Sanger sequencers and are “unrealistic” for handling output from high-throughput instruments and third-generation sequencing platforms, the authors wrote.
HapCompass is actually two algorithms that operate on compass graphs to generate haplotype assemblies. Nodes in the graph represent SNPs and edges between nodes represent sequence fragments that cover co-occurring SNP alleles in a haplotype, the authors explain.
The team compared the performance of HapCompass with two haplotype assembly algorithms developed for use with NGS data: the read-backed haplotype phasing algorithm available in the Broad Institute’s Genome Analysis Toolkit, and HapCut, developed by Vikas Bansal at the University of California, San Diego.
For this task, the researchers developed two new metrics that took into account differences in the error models used by all three tools in order to “best capture the more accurate solution independent of the error models employed.”
First, the researchers used a metric dubbed the fragment mapping phase relationship, or FMPR, that is “analogous” to metrics used to assess genome assemblies, Aguiar explained to BioInform.
“When you map your reads back to your assembly, you want as [many] reads mapping back as possible. This is the same idea. You want as much phase relationship as defined by your fragments to map as well,” he explained.
Previously, a metric known as the “haplotype switch error”• defined in the paper as “the number of switches in haplotype orientation required to reproduce the correct phasing” • had been used to evaluate haplotype phasing algorithms, the researchers explain in the JCB article.
However, haplotype assembly algorithms “operate on much different data and assumptions” than haplotype phasing algorithms and require different metrics.
For example, “phase relationships are inferred often from long distance mate pair reads [and] the switch error metric does not accurately capture these relationships,” the paper states.
Furthermore, “if two haplotype assemblies do not produce the same amount of blocks of haplotypes or otherwise do not agree on where to commit to a particular phasing, then the switch error becomes biased towards those algorithms that phase less SNPs,” the paper states.
Aguiar said that the FMPR metric is useful for comparing output from different algorithms when the true haplotypes from the data are unknown.
The second evaluation metric described in the paper, dubbed Boolean fragment mapping, or BFM, counts “the percentage of fragments that map to the resolved haplotypes with at least one error,” the authors wrote.
When all three haplotype assembly programs were compared using these two metrics, HapCompass proved to be “significantly more accurate” with smaller FMPR and BFM values than both GATK and HapCut when used on sequence data from chromosome 22 of an individual from the 1000 Genomes Project, as well as with 10 million simulated reads, according to the paper.
Furthermore, HapCompass was more accurate than both GATK and HapCut when used on 1000 Genomes data that had been supplemented with simulated Illumina reads, the researchers wrote.
The authors noted that HapCompass was also more accurate than the other two algorithms even when using the haplotype switch error metric.
Aguiar and Istrail acknowledged that the approach "does not yet consider quality
scores of sequence reads or SNP calls," but noted that this information could be incorporated into the approach via weights on the edges of the graph.
In an e-mail to BioInform, Kiran Garimella, a doctoral student at Oxford University who worked previously with the GATK team and has applied its haplotype algorithm to a variety of datasets, said that the Brown team’s software comparison and the metrics they used were “fair.”
“I do take their point that the switch error rate metric was designed for statistical phasing algorithms,” he said. “That metric doesn't have much to say about two disjoint regions that one ‘could’ have phased together into a single big region, [for example] if one were to have paired-end NGS data linking the two regions. It seems like FMPR can model this, and I like that additional insight into the accuracy of single individual haplotyping.”
Garimella did have two criticisms for the study. The first has to do with the data, which was from a deeply sequenced individual in the second pilot phase of the 1000 Genome Project. “I don't doubt that one can do very well phasing a single individual given deep coverage, but I think we'll see many NGS studies adopting the [1000 Genome Project] Pilot 1 approach” where samples are sequenced at low coverage.
As a result, “some indication as to how well their algorithm performs as coverage is reduced, rather than increased, would be helpful,” he said.
Secondly, the algorithm “only accepts uncompressed text-based SAM files, rather than the compressed binary BAM counterpart,” he said. “Even a small sequencing study of 10 people will produce terabytes of data, and converting all of those optimized-for-high-throughput-algorithmic-access BAM files to uncompressed-slow-access SAM would be tough on storage systems.”
Overall, “I think there are a lot of people who'd like to take the next step of looking at haplotypes in their NGS data, and it seems like Aguiar and Istrail have a really great model for doing so,” he said.