Nat Goodman charts a genomics project just for the bioinformatics crowd: the comparisome
We IT guys and gals need an ome of our own. An ome where computation is the foundation, and not just the shingles. An ome where the furnishings are comfy for bioinformaticians, and aren’t designed for wet biologists. An ome built using timber hewn from existing data, rather than from piles of new material.
The ome I have in mind is the comparisome, by which I mean the omic study of whole-genome comparisons. The idea is to systematically compute all meaningful comparisons among the many whole genomes that are now in hand and produce a dataset that can be mined for biological results.
Comparisomics will require a Home Depot’s worth of computing power and will cost hundreds of millions of dollars. But the value will be immense, and, hey, it wouldn’t be an ome if it were cheap.
I don’t claim to know the best blueprint for our ome sweet ome, but here’s a strawman to get the ball rolling.
As a thought experiment, let’s imagine comparing the entire human genome to the entire mouse genome using the familiar Smith-Waterman algorithm. The cost of doing a Smith-Waterman calculation is proportional to the product of the lengths of the two sequences being compared.
Taking the human and mouse genomes as 3.5 billion bases each, the cost comes to approximately 1019 Smith-Waterman steps. This is a big number. And, of course, you have to reverse complement one of the genomes and do the calculation a second time, a detail I’ll usually ignore as it only increases the cost by a factor of two.
But computers are fast and programmers are clever. A program called MPSRCH from Aneda, formerly Edinburgh Biocomputing Systems, can do 1 billion (109) Smith-Waterman steps per second on a Compaq Alpha workstation. So, the whole job would take 1019 steps / 109 steps per second = 1010 seconds which comes to just under 390 years. I agree that this is too long for any of us to wait, but of course we can cut the time by buying more computers and doing the work in parallel.
With 10,000 Alphas, the job would take about two weeks, fast enough for even the most impatient ome buyer. With Alphas at $4,000 a pop, this comes to $40 million in hardware. Then there’s overhead, of course, and remember to multiply by two for the reverse complement. As I said, omics ain’t cheap.
Specialized hardware can do the job even faster. A Paracel system can do somewhere between 30 billion and 50 billion Smith-Waterman steps per second. This gets the job done in between eight and 13 years. A TimeLogic array can do 7.5 billion Smith-Watermans per second; each TimeLogic node can have three arrays for a total of 22.5 billion steps per second, which does the job in just over 17 years. Again, buy more hardware and you can do it as fast as you want.
So it’s expensive, but feasible. Why not do it?
The major fault with Plan A is that the computation I’ve sketched doesn’t really answer the right question for the comparisome. The problem is that the alignments computed by Smith-Waterman, though called “local,” aren’t local enough. Every alignment contains a core of maximum similarity surrounded on both sides by shoulders of less similarity. If a second strong match is nearby, the downward shoulder of the first can connect with the upward shoulder of the second, linking the two cores into the same alignment. This may be what you want when aligning sequences, but for the comparisome we’d prefer to represent each match separately.
A better approach is to divide each genome into a series of short, overlapping “words” and compute the similarity of each pair of words. To be concrete, imagine that we divide each genome into words of length 10. The first word would consist of bases one through 10 of chromosome 1, the next would be bases two through 11 of chromosome 1, and so forth until we reach the end of chromosome Y. We then compare each pair of words, yielding a Taj Mahal-sized dataset of similarity scores. Since there are about 3.5 billion overlapping words in each genome, the output will be a matrix of 3.5 billion by 3.5 billion scores, for a total of about 1019 cells.
Before worrying about all the practical problems of constructing and storing such a large structure, let’s dream about what we could do if we had it. One easy thing we can do is display a dot plot — in fact, what we’ve computed is precisely the data that sits behind a dot plot. Discovery Biosciences has a product called jot that does something like this, though on a smaller scale. But we can do a lot more that just display dots.
It’s important to note that no single high score is interesting in and of itself, since lots of high scores will occur by chance. In fact, every 10-base word is expected to appear about 3,500 times in the genome, and is expected to be 50 percent or more identical to about 10 percent of the words in the genome. Rather than paying attention to individual scores, we need to look for regions of the matrix that contain multiple high scores.
Diagonals are perhaps most interesting. A series of high scores along a diagonal indicates that the sequences corresponding to the rows and columns of the diagonal match. For example, a diagonal series of 90 high scores indicates that a 100-base sequence from one genome matches 100 bases from the other. A series of good scores along a reverse diagonal suggests a strong inverted match. A series of good scores along a near diagonal is evidence for a strong match with some gaps. A series of good scores along a broken diagonal — meaning a stretch that’s diagonal for a while, then jumps, then becomes diagonal again — indicates a strong match broken by large gaps, which is exactly what a conserved gene should look like. Short matches that appear within gaps or just outside the boundaries of longer matches, though boring in isolation, might represent short exons or conserved regulatory sites.
A string of high scores along a row or column suggests a tandem repeat. An excess of high scores along a row or column indicates an interspersed repeat or low complexity region. In fact, in such cases, we’d expect several successive rows or columns to be high scorers.
Interspersed repeats — even short, rare ones which are hard to find with conventional methods — should stand out like sore thumbs. Consider a repeat that’s 100 bases long and appears 1,000 times in the genome. Think about the 90 columns that represent one occurrence of this repeat. The rows beneath those columns will contain 1,000 diagonal matches representing the 1,000 occurrences of the repeat. Hard to miss.
So, there are plenty of treasures to be discovered in this ome. The problem that remains is to figure out a way to build it.
The obvious way to compute the comparisome is to take each 10-base word from one genome and compare it to all the words in the other genome using Smith-Waterman. Simplistically, this requires 100 times more computation than was needed for Plan A, because we have to do a 10x10 Smith-Waterman for each pair of words in the two genomes, and would push the cost from around $50 million to $5 billion. Fortunately, many shortcuts are possible.
The big advance is to notice that there are only 410, or approximately 1 million, possible 10-base words, which is about 3,500 times smaller than the number of words in the genome. We can save bundles of work by first comparing all possible words, and then seeing where each word appears in the genome. The second part is computationally trivial and poses no difficulty.
Another big advantage of this strategy is that the hard part — comparing all possible 10-base words to each other — can be done once, and applied to all genomes. This starts to feel very omic: do the job once and use it forever.
We can gain further speed by noticing that there’s no point in allowing many gaps in the Smith-Waterman calculations. Limiting the number of gaps to two reduces the cost of each 10 x 10 comparison to 44 steps.
With these improvements, the cost comes down to about 5x1013 Smith-Waterman steps. This is a piddling amount of work. To get a ballpark estimate of how long this would take, assume that the Smith-Waterman implementations in Plan A can do 10x10 problems at full speed — this is unrealistic, but I suspect the algorithms can be modified to achieve this, or even better. Given this assumption, Paracel could do the job in less than 16 minutes, a TimeLogic node would need about 36 minutes, and MPSRCH would take about 13 hours on one Alpha. More a shack than an ome.
Since the problem is so easy with 10-base words, let’s see what happens as we increase the word size. The advantage of using longer words is that individual high scores are more significant. The disadvantage, from a functional standpoint, is that we weaken our ability to detect short matches, such as short conserved exons or regulatory sites. We can circumvent this problem if we’re willing to store more data by modifying the algorithm to emit scores for “prefix” matches, meaning matches that start at position one in both words, but end in the middle. This doesn’t slow the algorithm very much.
The bigger problem is performance. For every unit increase in the word size, we increase the number of pairs by a factor of 16, and we increase the cost of comparing each pair by five Smith-Waterman steps.
If we increase the word size to 15, the number of pairs grows to about 1018, and the cost of each comparison is 69 Smith-Waterman steps, for a total of about 1020 steps. Under the same assumptions as before, this would take about 50 years on a Paracel, 112 years per TimeLogic node, and more than 2,500 years using MPSRCH on one Alpha workstation. Allowing three gaps pushes the time up to 68 years, 151 years, and 3,400 years respectively.
Finally, a computation big enough to be called ome.
This ome will not be easy to build. There are some really hard problems, such as developing a storage system for the immense dataset that will be produced. We may even need to develop special hardware to do the Smith-Watermans. But it’s all doable.
A couple hundred million bucks and a couple of years is all it will take. Any volunteers?
Nat Goodman, PhD, helped found the Whitehead/MIT Center for Genome Research, directed a bioinformatics group at the Jackson Laboratory, led a bioinformatics marketing team for Compaq Computer, and has been consulting ever since. He is currently a free agent in Seattle. Send your comments to Nat at [email protected]