Skip to main content
Premium Trial:

Request an Annual Quote

Stuck in a BLAST Rut

Premium

Martin Gollery says dynamic algorithms are a good alternative

Martin Gollery is director of bioinformatics Research at TimeLogic in Incline Village, Nev. When he’s not pounding the exhibit floor at bioinformatics meetings, he’s pounding the dirt on his mountain bike around Lake Tahoe, or relaxing at the private TimeLogic beach.

Listening to a conference speaker a couple of months ago, I had one of those “Hey, wait a minute” experiences. The speaker was claiming that he had to use crystallography to find the structure of a protein because there was no similarity between the sequence and any known proteins. He had arrived at this conclusion because there were no significant BLAST hits between the sequence and the NR database. I waited for further proof that this was truly a novel protein, but none was forthcoming: no Smith-Waterman analysis, no Interpro search, no FrameSearch.

What is it that ties people to a certain analysis method, despite the continual progress in bioinformatics? It’s amazing what some people will pay to acquire data, then ignore some of the best ways to find all the diamonds in the mine. Analysis methods should be well thought out to start with, then constantly reviewed and reworked. The analytical pipeline should be reviewed from a big-picture standpoint, and then each step should be checked for weaknesses. Remember — what works with a one- gigabyte database may no longer work when the database hits 10 gigs.

Why are you using BLAST in the first place? Dynamic algorithms, such as Smith-Waterman and its derivations, are more sensitive than BLAST. FrameSearch is an improvement to the Smith-Waterman algorithm that better handles the inevitable errors in the data. FASTA is a heuristic method, like BLAST, but more sensitive. Think of it as being part way between BLAST and Smith-Waterman in sensitivity and speed.

Perhaps a profile method might be best for your needs. These algorithms, such as SAM and HMMer, represent data from an entire family of sequences, rather than just one. The PFAM database contains over 2700 HMMs representing many protein domains. Why not check to see if your sequence matches one of these? TIGR-fams, PFAM-frag, and PFAM-Pro are other HMM databases that can provide a lot of information about your freshly assembled genome.

One of the reasons that BLAST is so popular is inertia. People have used BLAST for years, and they are resistant to change. But there is more to it than that.

See, dynamic algorithms are slow. If you have a large amount of analysis to do by a certain deadline, the extra hits you get from a dynamic algorithm will not offset the fact that you didn’t get through all of the data in time. To combine several of these in a pipeline can really bog down your discovery if you do it wrong. To speed it up, usually with a supercomputer, server farm, or accelerator, requires a system engineering analysis. This is another step that gets shorted all too often as people go with whatever is trendy, but right now let’s stay focused on the process analysis.

Too many people are glossing over the details, even when those details are important. The scoring matrix that is chosen, for example, affects the scores and therefore the e-values. Some people seem to think that BLOSUM62 should be used for everything! If you are looking for distant homologies, give OPTIMA a try. Searching transmembrane proteins? Use PHAT. Besides being useful and effective, it has a cool name.

There is nothing wrong with crystallography. We need more structures to be solved. But don’t assume that just because there are no BLAST hits, that there is no similarity. The time you save may be your own.

Opposite Strand is a forum for readers to express opinions and ideas about trends and issues in genomics. Submissions should be kept to 550 words and may be submitted to [email protected]

The Scan

Not as High as Hoped

The Associated Press says initial results from a trial of CureVac's SARS-CoV-2 vaccine suggests low effectiveness in preventing COVID-19.

Finding Freshwater DNA

A new research project plans to use eDNA sampling to analyze freshwater rivers across the world, the Guardian reports.

Rise in Payments

Kaiser Health News investigates the rise of payments made by medical device companies to surgeons that could be in violation of anti-kickback laws.

Nature Papers Present Ginkgo Biloba Genome Assembly, Collection of Polygenic Indexes, More

In Nature this week: a nearly complete Ginkgo biloba genome assembly, polygenic indexes for dozens of phenotypes, and more.