Skip to main content
Premium Trial:

Request an Annual Quote

Stuck in a BLAST Rut

Premium

Martin Gollery says dynamic algorithms are a good alternative

Martin Gollery is director of bioinformatics Research at TimeLogic in Incline Village, Nev. When he’s not pounding the exhibit floor at bioinformatics meetings, he’s pounding the dirt on his mountain bike around Lake Tahoe, or relaxing at the private TimeLogic beach.

Listening to a conference speaker a couple of months ago, I had one of those “Hey, wait a minute” experiences. The speaker was claiming that he had to use crystallography to find the structure of a protein because there was no similarity between the sequence and any known proteins. He had arrived at this conclusion because there were no significant BLAST hits between the sequence and the NR database. I waited for further proof that this was truly a novel protein, but none was forthcoming: no Smith-Waterman analysis, no Interpro search, no FrameSearch.

What is it that ties people to a certain analysis method, despite the continual progress in bioinformatics? It’s amazing what some people will pay to acquire data, then ignore some of the best ways to find all the diamonds in the mine. Analysis methods should be well thought out to start with, then constantly reviewed and reworked. The analytical pipeline should be reviewed from a big-picture standpoint, and then each step should be checked for weaknesses. Remember — what works with a one- gigabyte database may no longer work when the database hits 10 gigs.

Why are you using BLAST in the first place? Dynamic algorithms, such as Smith-Waterman and its derivations, are more sensitive than BLAST. FrameSearch is an improvement to the Smith-Waterman algorithm that better handles the inevitable errors in the data. FASTA is a heuristic method, like BLAST, but more sensitive. Think of it as being part way between BLAST and Smith-Waterman in sensitivity and speed.

Perhaps a profile method might be best for your needs. These algorithms, such as SAM and HMMer, represent data from an entire family of sequences, rather than just one. The PFAM database contains over 2700 HMMs representing many protein domains. Why not check to see if your sequence matches one of these? TIGR-fams, PFAM-frag, and PFAM-Pro are other HMM databases that can provide a lot of information about your freshly assembled genome.

One of the reasons that BLAST is so popular is inertia. People have used BLAST for years, and they are resistant to change. But there is more to it than that.

See, dynamic algorithms are slow. If you have a large amount of analysis to do by a certain deadline, the extra hits you get from a dynamic algorithm will not offset the fact that you didn’t get through all of the data in time. To combine several of these in a pipeline can really bog down your discovery if you do it wrong. To speed it up, usually with a supercomputer, server farm, or accelerator, requires a system engineering analysis. This is another step that gets shorted all too often as people go with whatever is trendy, but right now let’s stay focused on the process analysis.

Too many people are glossing over the details, even when those details are important. The scoring matrix that is chosen, for example, affects the scores and therefore the e-values. Some people seem to think that BLOSUM62 should be used for everything! If you are looking for distant homologies, give OPTIMA a try. Searching transmembrane proteins? Use PHAT. Besides being useful and effective, it has a cool name.

There is nothing wrong with crystallography. We need more structures to be solved. But don’t assume that just because there are no BLAST hits, that there is no similarity. The time you save may be your own.

Opposite Strand is a forum for readers to express opinions and ideas about trends and issues in genomics. Submissions should be kept to 550 words and may be submitted to [email protected]

The Scan

Tens of Millions Saved

The Associated Press writes that vaccines against COVID-19 saved an estimated 20 million lives in their first year.

Supersized Bacterium

NPR reports that researchers have found and characterized a bacterium that is visible to the naked eye.

Also Subvariants

Moderna says its bivalent SARS-CoV-2 vaccine leads to a strong immune response against Omicron subvariants, the Wall Street Journal reports.

Science Papers Present Gene-Edited Mouse Models of Liver Cancer, Hürthle Cell Carcinoma Analysis

In Science this week: a collection of mouse models of primary liver cancer, and more.