NEW YORK (GenomeWeb News) – There is a lack of agreement between the results generated by different alignment tools that compare and contrast multiple genome sequences, according to a recent paper in the early, online edition of Nature Biotechnology.
Martin Tompa, a computer science and engineering researcher at the University of Washington, who is also affiliated with the institute's department of genome sciences, and doctoral student Xiaoyu Chen compared four sequence alignment tools using 554 million bases of sequence data from 28 vertebrate genomes. Rather than producing consistent results, the pair found a lack of conformity between alignments generated from the same vertebrate dataset — particularly when comparing species that are not closely related or looking at non-protein coding parts of the genome.
"We discovered that there's a disturbingly low level of agreement between genome alignments produced by different tools," Tompa said in a statement. "What this should suggest to biologists is that they should be very cautious about trusting these alignments in their entirety."
The data used for the new comparison stemmed from research done by groups working within the ENCODE consortium's Multi-Species Sequence Analysis team. That study, which appeared in Genome Research in 2007, involved aligning one percent of the human genome with genome sequences for 27 other vertebrates.
For the current paper, Tompa and Chen delved into the details of this 554 million base pair alignment, looking at the agreement, coverage, and accuracy of the sequence alignment tools used in the ENCODE study: Threaded Blockset Aligner (TBA), Multiple Limited Area Global Alignment of Nucleotides (MLAGAN), Mavid, and Pecan.
"What makes these alignments an unprecedented test bed for comparisons is that four expert teams used four different methods to align the same 28 vertebrates sequences," Chen and Tompa wrote.
Unlike the initial analyses, though, the pair assessed all of the aligned vertebrate sequence data rather than honing in on mammalian data.
Unexpectedly, the researchers found a low agreement between the alignments, especially for untranslated regions, introns, and intergenic sequences.
In general, they found lower agreement, coverage, and accuracy with increasing species distance from humans, though agreement was low even when comparing alignments of human and mouse sequences.
"Such low levels of agreement indicate that constructing a reliable whole-genome multiple sequence alignment remains a significant challenge," the duo noted, "particularly for non-coding regions and distantly related species."
On the whole, the pair's analyses using the statistical analysis method StatSigMA-w suggest the European Bioinformatics Institute tool Pecan provided the most accurate results of the four methods tested.
Based on the alignment differences and accuracy deficits detected in the new paper, Chen and Tompa argued that researchers need to take a critical look at sequence alignment tools and should be particularly vigilant about double checking alignments involving sequences from distantly related species and/or non-coding regions of the genome. In the long term, they say, evaluating alignments in this fashion may help to improve the alignment approaches.
"I think we're all interested in having a better understanding of which methods work the best and how to make them better," Tompa said in a statement.