Skip to main content
Premium Trial:

Request an Annual Quote

Catch a Rising Star


IT (Informatics Talent) Guy scouts out the next bioinformatics leaders


Inspired by this month’s All-Stars awards, I decided to get a jump-start on next year’s contest by seeking out the hottest bioinformatics software talent now. I put up a shingle and opened Nat’s Talent Agency. There are loads of great ideas out there, but I winnowed the list down to four acts that I think might explode in the next 12 months.

None is really new. They’ve all been playing off-off-Broadway and the back streets of Vegas since before Ed Sullivan. But this might just be the year they hit the big time. You know how it is — overnight success after years of hard work.

Knowledge Mining

A promising emerging act is knowledge mining, a song and dance troupe that’s been around so long and changed its name so many times that it’s hard to keep track of all the rooms it’s played. It started as information retrieval (boring), then text mining, then literature mining. The new name has the most zing.

The goal is to search the literature better through software that “understands” more of the content. Things like gene and protein names, biological functions and processes, diseases and physiology, anatomy, drugs and compounds, assays, and more. It’s like PubMed on steroids.

This scene is getting so hot that this year’s big data-mining competition, the KDD Cup 2002, which covers all industries, focused on biomedicine. And even more to the point, the winner (in one category) was da-da … Celera in partnership with a commercial knowledge mining company, ClearForest.

The idea is to be able to answer queries like, “Find all references that discuss compounds that affect acetylation for treatment of neurodegenerative disorders.” Or for the more molecular folks, “Find all references that discuss molecules that affect the acetylation of transcription factors.”

I haven’t been able to see any of these products live and can’t judge how well they work on real problems. The nice people at BioWisdom did a Web-based demo for me, but I could only poke around in areas they had scripted. The Cellomics website offers a free two-week trial, but they came through with the required password too late.

If it works, knowledge mining will be a mega star — the Tom Cruise of bioinformatics.

Pathway Modeling

Pathway modeling is my second pick for stardom. It’s a technology with obvious appeal. Biologists invariably draw pathway diagrams to illustrate any biological process with more than one step. The challenge is to pirouette beyond informal pathway diagrams to formal models that represent biological processes in a precise mathematical or computational form.

The most obvious reason to do formal pathway modeling is to simulate the dynamic behavior of a biological process of interest. This storyline has been kicking around for eons, but it’s never been very practical because it requires detailed information on reaction rates and such, which is simply not available for most processes of interest. A newer twist on the idea is network inference, in which you start with a partial model and try to figure out the missing bits by comparing simulated results to experimental data. There is some hope that this will reduce the need for detailed information about each reaction.

Pathway models can also play a knowledge management role by organizing information about biological processes in a form that is accessible and intuitive to researchers. This theme was expressed years ago by Kurt Kohn, one of the patriarchs of the field, but seems to have been dropped. I’d like to see this genre revived: it could be the first starring role for the technology.

There are a lot of academic codes available and a few commercial ones. I tried one of each: the Jarnac/JDesigner suite from Herbert Sauro at Caltech, and VisualCell from Gene Network Sciences (with which my institution has a partnership). Both worked well, but are intended for different kinds of pathways. Jarnac/JDesigner is aimed at metabolic pathways, while VisualCell’s strength is on regulatory pathways.

Jarnac/JDesigner offers both a textual and a graphical language, which I find a plus (see sidebar). I mainly used the text language. It comes with a built-in simulator and simple graphics for plotting the simulation results. It was a lot of fun and very easy to vary the efficiency of steps in a metabolic pathway and see the effects (which were generally small, as predicted by theory). Jarnac/JDesigner is a great teaching tool even if it has no place in your research.

VisualCell is purely graphical. Not surprisingly, the language is complicated — it has to be in order to precisely model real pathways. But with the help of the experts at the company, I was able to learn the language in a day, and use it the next day to create a small but realistic model of a disease process. My learning curve was shortened by the abstract modeling language.

High-performance Sequence Analysis

A new generation of sequence rockers is hitting the charts, hoping to unseat the reigning platinum record holder, BLAST. Most promise blazing speed with no loss of sensitivity, although one act (MPSRCH) is going for the sensitivity market at somewhat lower speed.

In this context, sensitivity refers to the ability of an algorithm to find distant matches, i.e., sequences in the database that are only vaguely similar to the query sequence. The flip side — specificity — refers to the number of false matches an algorithm reports. This is not usually a concern, since the algorithms assign a score to each reported match, and the user is free to ignore matches with scores that are too low. As the algorithms get more sophisticated, specificity will probably become more of an issue.

Community Software

Our final contestant — community software — is the sentimental, feel-good favorite. Community software is the step beyond open source in which programmers from many places join together to create software of value to all. It’s the global village in action.

Two big community efforts are underway: BioPerl and MGED (Microarray Gene Expression Data). BioPerl is further along and has already debuted their software. The MGED people are working furiously and hopefully will raise the curtain soon.

BioPerl is a colossal production with 450 Perl modules focused on sequence- related issues. There is code to read and write the major sequence formats, create indexed sequence files, and work with pairwise and multiple sequence alignments. It provides wrappers for many popular programs including BLAST, HMMER, Sim4, and others, though some important tools (e.g., FASTA) are oddly absent. There is also software to create graphical displays of annotated sequences.

The internals are a tour de force of Perl programming. It’s a veritable how-to guide for advanced Perl programmers, reflecting the extraordinary software skills of the developers.

Time will tell whether the BioPerl troupe can hang together and even expand. I predict a deluge of new programmers wanting to add their favorite software to the show. The BioPerl stage managers will have to decide whether to let the newcomers on stage, and if so, how to maintain quality, or to shoo them away and become a closed shop.

Having seen all the contestants, here are my final predictions for the technologies that will be shining brightly in the not too distant future: My heart says community software. My wistful eye says knowledge mining. My scientific hopes say pathway modeling. And my pragmatic side says high performance sequence analysis.


Knowledge Mining: Can it work?

The hard part of doing literature searches is going back in time and reinterpreting old results in light of new data or theories.

For example, in answering the first question in the main text, the system should report that valproate — a histone deacetylase inhibitor — was tried on Huntington’s disease patients in a case report published in 2000. What makes this tricky is that valproate wasn’t known to be involved with acetylation when the article was published, and no terms related to acetylation appear in the paper. Moreover, the subsequent paper that connects valproate to acetylation doesn’t actually mention the drug by this name, but rather talks about valproic acid, which is the active form.

The case report describes profound improvements in HD symptoms. An exciting connection. But don’t get too excited.

If the system were really smart, it would temper its enthusiasm by noting that valproate was given in combination with another drug: the authors were mainly interested in the other drug, and they never followed up the valproate angle.

In answering the second question, the system would have to tap dance around a rapidly changing area of science. Until recently, people thought that the main way acetylation affected transcription was by changing the acetylation level of histones. Histones are the proteins around which DNA is wrapped to form a compact, three-dimensional structure. The old theory was that acetylation caused histones to loosen their grip on the DNA and allow transcription factors to sneak in and do their job. The cognoscenti now believe that this is only one effect, and that the acetylation status of transcription factors is important, too. The net effect is that many papers that talk about histone acetylation have to be re-interpreted in this new light.

— NG

Pathway Modeling: Graphics vs. Text

In the pathway modeling field there’s a big emphasis on graphical representation of models. This emphasis is understandable given that biologists are alleged to think in pictures.

I find this exasperating, because it can be fiendishly hard to describe a complex process in pictures. There are many things that are just plain easier to say in text.

Here’s an example.

1) Protein kinase A (PKA) activates the transcription factor CREB by phosphorylating the serine at position 133.

2) When activated, CREB can join a three-molecule complex, consisting of itself, either CREB binding protein (CBP) or a related protein p300, and TAFII130.

3) TAFII130 in turn can bind the TFIID subunit of the basal transcriptional complex, which includes the TATA bind protein (TBP). The details of this interaction are not known.

4) When the transcriptional complex is fully assembled, TBP can bind the TATA box upstream of the transcription initiation site.

5) This brings RNA polymerase II into contact with the DNA to be transcribed, and transcription can proceed.

The words describe the process clearly. To turn this into a model, one would have to recraft it using a precise computer language, which wouldn’t be too hard. I don’t see what a picture would add.

There’s an apropos lesson from software engineering: diagrams are a great way to document programs, but text is the best way to write them.

— NG


High Performance Sequence Analysis

Many fast sequence search methods gain their speed from a simple trick. They start by finding short, exact matches called seeds and then extend the seeds into longer, inexact matches. This makes it possible to find short exact matches very fast.

A key parameter of such methods is the size of the initial seeds. This is the “word length” parameter you may have seen in BLAST.

One simple way to gain speed is to increase the seed length, but this reduces sensitivity. Another approach is to pre-process the database and create an index telling where any given seed exists.

A more sophisticated approach is to build a fancy data structure called a suffix tree that effectively tells where all sequences of any length exist in the database. Suffix trees are widely used in the computer field, but have seen limited use in bioinformatics because the traditional implementations consume a lot of memory — about 40 bytes for each letter in the database, which comes to 120 GB for the entire human genome. Too big to be practical. Recent improvements in the method have cut the memory requirements to 17 bytes per letter (50 GB for the human genome), and have reduced the penalty for storing the data structure on disk, which bring the method to the verge of practicality.

A different approach is to switch from exact match seeds to ones with a limited number of mismatches. This makes it possible to improve sensitivity for a given seed length at the cost of slowing down the search for initial seeds. On certain computers, notably the Cray SV vector machines, inexact matches can be found almost as fast as exact ones, making this approach very attractive. (Note that my institution has a partnership with Cray.)

One algorithm that marches to a different drummer is MPSRCH, which has opted for sensitivity over speed. MPSRCH claims to implement the gold standard, most sensitive algorithm known, namely full Smith-Waterman dynamic programming. I tried their algorithm on their website and it’s incredibly fast — I wonder how they do it!

— NG


Nat Goodman, PhD, helped found the Whitehead/MIT Center for Genome Research, directed a bioinformatics group at the Jackson Laboratory and led a bioinformatics marketing team for Compaq Computer. He is currently a senior research scientist at the Institute for Systems Biology and an affiliate professor of bioinformatics at University of Alaska-Fairbanks. Send your comments to Nat at [email protected]



More on Knowledge Mining

Product Company Notes




CellSpace Knowledge Miner




Gene Ontology Knowledge Discovery System (GO KDS)

GeneEd & Reel Two


Gene Ontology Knowledge Discovery System (GO KDS)


Gene Ontology Knowledge Discovery System (GO KDS)

Celera & Clear Forest

Press release

Community Software




Microarray Gene Expression Data (MGED) Society


Help on High-Performance Sequence Analysis



Web Server

Software Availability




Jim Kent, University of California at Santa Cruz


Free for academics

Used in Santa Cruz genome browser


Biomedical Engineering Center, Industrial Technology Research Institute of Taiwan






Complete Smith-Waterman

MUMmer 2



Free for academics

Based on suffix trees


Bioinformatics Solutions


Free for academics


Jim Kent, University of California at Santa Cruz


Free for academics

Pathways Packages: ACADEMIC






Brian White

Can be ordered through the ePress Project at the University of Maryland


Adam Arkin

Some software can be downloaded, but the major work, BioSpice, seems to be hidden in a private area


Igor Goryanin



BioKin Ltd.

Free for academics; free trial version for all


Masaru Tomita

Open source

Electronic Arc

Gene Selkov

Diagramming tool; apparently open source


Pedro Mendes



Herbert Sauro

Open source


Hamid Bolouri



Dennis Bray

Open source


Jim Schaff

Web server; software not available


Herbert Sauro

Open source

Pathways Packages: COMMERCIAL





Gene Network Sciences










Co-development of Physiome and the Bioengineering Institute, University of Auckland

Systems Biology Markup Language (SBML)

Part of Systems Biology Workbench (SBW), ERATO Kitano

BioPathways Consortium

Systems Biology Project


The Scan

Enzyme Involved in Lipid Metabolism Linked to Mutational Signatures

In Nature Genetics, a Wellcome Sanger Institute-led team found that APOBEC1 may contribute to the development of the SBS2 and SBS13 mutational signatures in the small intestine.

Family Genetic Risk Score Linked to Diagnostic Trajectory in Psychiatric Disorders

Researchers in JAMA Psychiatry find ties between high or low family genetic risk scores and diagnostic stability or change in four major psychiatric disorders over time.

Study Questions Existence of Fetal Microbiome

A study appearing in Nature this week suggests that the reported fetal microbiome might be the result of sample contamination.

Fruit Fly Study Explores Gut Microbiome Effects on Circadian Rhythm

With gut microbiome and gene expression experiments, researchers in PNAS see signs that the microbiome contributes to circadian rhythm synchronicity and stability in fruit flies.