Martin Gollery says dynamic algorithms are a good alternative
Martin Gollery is director of bioinformatics Research at TimeLogic in Incline Village, Nev. When hes not pounding the exhibit floor at bioinformatics meetings, hes pounding the dirt on his mountain bike around Lake Tahoe, or relaxing at the private TimeLogic beach.
Listening to a conference speaker a couple of months ago, I had one of those Hey, wait a minute experiences. The speaker was claiming that he had to use crystallography to find the structure of a protein because there was no similarity between the sequence and any known proteins. He had arrived at this conclusion because there were no significant BLAST hits between the sequence and the NR database. I waited for further proof that this was truly a novel protein, but none was forthcoming: no Smith-Waterman analysis, no Interpro search, no FrameSearch.
What is it that ties people to a certain analysis method, despite the continual progress in bioinformatics? Its amazing what some people will pay to acquire data, then ignore some of the best ways to find all the diamonds in the mine. Analysis methods should be well thought out to start with, then constantly reviewed and reworked. The analytical pipeline should be reviewed from a big-picture standpoint, and then each step should be checked for weaknesses. Remember what works with a one- gigabyte database may no longer work when the database hits 10 gigs.
Why are you using BLAST in the first place? Dynamic algorithms, such as Smith-Waterman and its derivations, are more sensitive than BLAST. FrameSearch is an improvement to the Smith-Waterman algorithm that better handles the inevitable errors in the data. FASTA is a heuristic method, like BLAST, but more sensitive. Think of it as being part way between BLAST and Smith-Waterman in sensitivity and speed.
Perhaps a profile method might be best for your needs. These algorithms, such as SAM and HMMer, represent data from an entire family of sequences, rather than just one. The PFAM database contains over 2700 HMMs representing many protein domains. Why not check to see if your sequence matches one of these? TIGR-fams, PFAM-frag, and PFAM-Pro are other HMM databases that can provide a lot of information about your freshly assembled genome.
One of the reasons that BLAST is so popular is inertia. People have used BLAST for years, and they are resistant to change. But there is more to it than that.
See, dynamic algorithms are slow. If you have a large amount of analysis to do by a certain deadline, the extra hits you get from a dynamic algorithm will not offset the fact that you didnt get through all of the data in time. To combine several of these in a pipeline can really bog down your discovery if you do it wrong. To speed it up, usually with a supercomputer, server farm, or accelerator, requires a system engineering analysis. This is another step that gets shorted all too often as people go with whatever is trendy, but right now lets stay focused on the process analysis.
Too many people are glossing over the details, even when those details are important. The scoring matrix that is chosen, for example, affects the scores and therefore the e-values. Some people seem to think that BLOSUM62 should be used for everything! If you are looking for distant homologies, give OPTIMA a try. Searching transmembrane proteins? Use PHAT. Besides being useful and effective, it has a cool name.
There is nothing wrong with crystallography. We need more structures to be solved. But dont assume that just because there are no BLAST hits, that there is no similarity. The time you save may be your own.
Opposite Strand is a forum for readers to express opinions and ideas about trends and issues in genomics. Submissions should be kept to 550 words and may be submitted to [email protected]