Researchers who rely on genome browsers are accustomed to viewing their genome of interest aligned against the genomes of many other species. But how accurate are those alignments?
In a recent study that set out to address that question, Martin Tompa and Xiaoyu Chen of the University of Washington's department of computer science and engineering and department of genome sciences assessed how well four different tools aligned 28 vertebrate genomes.
In a paper describing the study published last week in Nature Biotechnology, Tompa and Chen explain that they used data provided by the Encyclopedia of DNA Elements Multi-Species Sequence Analysis team, which aligned 1 percent of the human genome to 27 other vertebrate genomes with four alignment tools: TBA (Threaded Blockset Aligner), Mavid (Multi-Avid), MLAGAN (Multi-Limited Area Global Alignment of Nucleotides), and Pecan.
Tompa and Chen described these alignments as "an unprecedented test bed for comparison" because "four expert teams used four different alignment methods to align the same 28 vertebrate sequences, spanning 554 Mbp of sequence in total."
In order to assess the alignments, Tompa and Chen looked at the overall agreement between the four alignments, the coverage of each alignment, and the accuracy of each alignment. In order to assess the accuracy, they used a method developed at the University of Washington called StatSigMA-w, which flags "suspicious" alignments.
The authors concluded that there was a "disturbing lack of agreement" among the alignments — a discrepancy that was not limited to species distant from human, but also in the well-characterized mouse genome.
Overall, the study found that Pecan, developed by the European Bioinformatics Institute, was the best-performing multiple alignment tool because it aligned as much of the human genome to other species as any of the other tools. Additionally, its matches were consistently more accurate, especially between more distantly related species.
BioInform spoke to Tompa this week about the study and its implications for end-users as well as developers of alignment tools. The following is an edited version of the interview.
Given that multiple genome alignment is so critical for this field, it seems odd that there hasn't there been much work developing methods to assess the accuracy of the alignments. Can you discuss why there has been a dearth of these assessment tools until now?
Doing an assessment is very difficult. The basic problem is that what you'd like to know in assessing these alignments is how close are they to the true alignment, and the problem is there's no way that we can know what the true alignment is. The true alignment is the alignment that would align two characters, say from human and mouse, if they both evolved from the same ancestral character in the most recent ancestor of human and mouse. Now of course we don't know what the genome sequence is for the most recent ancestor and we don't know how that sequence evolved in the time since human and mouse diverged. So this ground truth that we would love to have — the ground truth of what's the correct alignment — we just don't have. And that makes evaluating these alignments not only very difficult, but it means that the assessment tools are subject to the sorts of criticism that the alignment tools are subject to — namely someone can ask, 'How do you know you did the assessment right?' just like we're asking, 'How do you know you did the alignment right?'
So how did you address this problem? Your paper notes that some groups have used simulated data so that they have a known true alignment, but you didn't rely on simulated data for this study.
Using simulation is one way to get around that problem, because if you simulate evolution then you do know what the ground truth is, so if you say, 'I'm going to start from some random sequence and I'm going to simulate evolution,' then you can track the mutations that have happened during your simulation and then you do know what the correct alignment is and now you can test that correct alignment against predicted alignments. That has the wonderful advantage that you know what the truth is.
But the disadvantage of simulation is that we don't understand evolution terribly well, so whatever model of evolution you build into the simulation is subject to the criticism of, 'Well, that's a simple mathematical model. Evolution really isn't that simple.' And in particular, one of the big challenges in these genomic alignments is that there have been very large-scale rearrangements of genomes over the course of evolution. A big piece of, say, mouse chromosome 3 might have broken off some time in evolution and gotten attached to chromosome 17. So you get these big chunks of rearrangement and we really don't understand how those happened. We don't understand them well enough to be able to model them well in a simulation.
You developed a tool called StatSigMA-w as an alternative. What approach does that take to assessing these alignments?
What StatSigMA-w does is ask the question, 'In a particular region of the alignment, is there one or more sequence that really doesn't belong there — one or more sequence that's aligned in a certain place, but really doesn’t fit into the alignment at that position?' And the way that it tests that hypothesis of whether the sequence belongs is to ask, 'Does it fit into the alignment there any better than a randomly chosen sequence from that species would fit in?'
Let's take mouse again as an example. For every region of the alignment, we ask, 'Is the mouse sequence in this portion of the alignment aligned to the other sequences any better than a randomly chosen mouse sequence would be?' If the answer is no, that a random mouse sequence would align just as well as the given sequence, then StatSigMA-w says that the mouse sequence looks suspicious in that region. We don't call it misaligned, because we don't know if it's misaligned, but we just call it a suspicious region for mouse.
StatSigMA was actually developed with a former PhD student of mine, Amol Prakash. This was a big part of his PhD thesis a few years ago. [The tool was described in a paper published in Genome Biology in 2007 — Ed.]
[ pagebreak ]
So how do you know that StatSigMA-w works better than simulation-based approaches or other evaluation methods?
Well, I can't say with complete confidence that this definitely works better than other assessment methods. It's certainly a different tool, a different approach than the other methods, so it can be used together with other assessment methods to gain more evidence about how alignments are performing.
We have over the last few years accumulated a fair amount of evidence that StatSigMA is doing a good job of assessing these alignments, and in this new paper there is one particular new piece of evidence that we got from the current data set. What we ask is, when StatSigMA-w identifies a region as suspicious, let's say for mouse again, does one of the other alignments align that particular region better than the suspicious alignment? We actually compared alignment scores between suspicious alignments and non-suspicious alignments, and discovered that in general the non-suspicious alignments have a much higher alignment score than the suspicious alignment, so this gives us additional confidence in the calls that StatSigMA is making of suspicious regions.
In addition to the accuracy measured by StatSigMA-w, your study looked at the alignment coverage for these different algorithms. Are there cases where researchers using these tools might prefer higher coverage but lower accuracy, or vice versa?
Ideally what you want is high coverage and high accuracy. High coverage would mean that you're aligning as much sequence as you can, and high accuracy means you're aligning it well. I think that you would rarely want to sacrifice accuracy for higher coverage because what that would mean is that you're aligning more sequence, but you're probably not aligning it correctly, so there's no advantage at all to just aligning extra sequence but getting it wrong.
I think users of these alignments would rather go the other way — they'd rather say, 'If I can sacrifice some of the coverage and get higher accuracy, I'd prefer that because I really want to be able to trust the alignment that I'm getting out.'
Were there any surprises in the results in terms of how well the four methods performed?
There were some surprises. One thing we did in the assessment was to simply compare how well the alignments agreed with each other, and one of the surprises that comes out of this is that these four alignment methods that we assessed actually didn't agree with each other nearly as often as I would have hoped.
For example, in mouse again, it seems that any alignment that we looked at agreed with some other alignment on only 50 percent of the mouse characters that were aligned to human. This seemed disturbingly low. If you look at two different alignments and they're aligning 50 percent of the mouse characters differently from each other, then you really have to wonder, 'Which of these two alignments can I trust?' So that was a surprise, and a very disturbing surprise, I think, for users of these alignments.
The other thing that did surprise me was that there are a lot of different dimensions on which we tried to compare these different alignments. We tried to compare their coverage, we tried to compare their accuracy, we compared them within genes and in regions that were far away from genes, and we compared them in each of 22 species. What I would have expected to come out is that there's no clear consensus of which of these four alignments is the best one. It came as a pleasant surprise that one of the alignment methods — Pecan — actually seems as though it's the best alignment in almost all of these categories. It has coverage that's as good as any of the alignment methods in nearly all of the species, and its accuracy as measured by StatSigMA-w is at least as good as the others, and usually quite a bit better than the others, and that's particularly true in the distant species. So when you're looking at fishes and chickens aligned to human, Pecan seems to be much more accurate than the other alignment methods.
Do you have a sense of why that's the case? Is there a different approach that Pecan takes to alignment compared to the others?
That's something I really don't know the answer to, but I think it's an interesting question and I'm hoping to look into that and I'm hoping also that the developers of these alignment methods will look into it and ask why is it that Pecan seems to be doing better in these distant species.
One simple thing I can say is that Pecan is the newest of these four alignment methods and has the advantage of having been built on the shoulders of these other alignments.
What should alignment-tools developers take home from this study?
I think that one of the things that StatSigMA-w provides for developers is a complete listing of suspiciously aligned regions, so what I'm hoping is that the developers of alignment tools would be interested in taking a look at the list of suspicious regions that we have produced in this assessment — and those are available in supplementary data files to the paper — and by looking at those, and thinking about their algorithm and about how their alignment was done they will be able to discover, 'Ah, I see what we've done that has led to a large number of these suspicious alignments, and here's how we can improve the algorithm to avoid making those bad alignments.'
What about users of these tools? Is there anything they can or should do to ensure they're getting the best alignment?
I think it's very important that users of the alignments, in addition to the alignments, be given some alignment quality measure, and something like StatSigMA-w could be used to provide that quality measure. If a user has a particular region of interest, let's say a particular gene that they're interested in, before trusting the alignment of that gene, it would be nice if they also had a quality measure that said, 'Yes, this portion of the alignment looks trustworthy,' or, 'No, the mouse sequence looks like it's suspiciously aligned in this region so maybe you shouldn't trust the alignment." I think that it would be great if the providers of these alignments provided such a quality measure.
So it would be like a Phred score or something like that?
Exactly. It would be the analog of a Phred score, except for alignment quality rather than sequence quality.
Do you know whether any of the browsers are planning on making that information available?
I think the [University of California, Santa Cruz] Genome Browser folks are quite interested in doing that, and they've expressed some interest in incorporating our StatSigMA-w scores as tracks in the UCSC Browser.
What's next for you? Are you going to continue to look into this or will you be moving onto other things?
I'm still interested in understanding this a little bit further, pursuing it a bit more. I will also be moving onto other things as well. This has been sort of a digression for me. I got involved in this a number of years ago because of concerns about alignment quality, but really my longer-term interests are in doing the comparative sequence analysis itself, rather than evaluating alignment methods, so I'd like to get back to some of those questions of comparative genomics.
I guess this study will help that.
That's the reason why I got involved in this. Before you can do the comparative genomics I wanted to be sure that you can trust the alignments that it's based on.
This study only looked at the ENCODE regions, which is 1 percent of the genome. Would it be of value to assess whole-genome alignments as well, or would that be too difficult?
It's definitely possible to do, and I think that would be a very valuable thing to do. The reason we did the ENCODE 1 percent alignments was because it was a unique opportunity to compare four different alignment methods on exactly the same input sequences. And that data isn't available on a whole-genome scale. For a whole-genome scale we really only have two different alignments that I know of that are available for the mammals, and they're really done in a bit of an incomparable way so it wouldn't make sense to compare those against each other. On the other hand, evaluating either one using StatSigMA-w independently I think would be a very useful thing to do for end users.