Skip to main content
Premium Trial:

Request an Annual Quote

Catch a Rising Star

Premium

IT (Informatics Talent) Guy scouts out the next bioinformatics leaders

 

Inspired by this month’s All-Stars awards, I decided to get a jump-start on next year’s contest by seeking out the hottest bioinformatics software talent now. I put up a shingle and opened Nat’s Talent Agency. There are loads of great ideas out there, but I winnowed the list down to four acts that I think might explode in the next 12 months.

None is really new. They’ve all been playing off-off-Broadway and the back streets of Vegas since before Ed Sullivan. But this might just be the year they hit the big time. You know how it is — overnight success after years of hard work.

Knowledge Mining

A promising emerging act is knowledge mining, a song and dance troupe that’s been around so long and changed its name so many times that it’s hard to keep track of all the rooms it’s played. It started as information retrieval (boring), then text mining, then literature mining. The new name has the most zing.

The goal is to search the literature better through software that “understands” more of the content. Things like gene and protein names, biological functions and processes, diseases and physiology, anatomy, drugs and compounds, assays, and more. It’s like PubMed on steroids.

This scene is getting so hot that this year’s big data-mining competition, the KDD Cup 2002, which covers all industries, focused on biomedicine. And even more to the point, the winner (in one category) was da-da … Celera in partnership with a commercial knowledge mining company, ClearForest.

The idea is to be able to answer queries like, “Find all references that discuss compounds that affect acetylation for treatment of neurodegenerative disorders.” Or for the more molecular folks, “Find all references that discuss molecules that affect the acetylation of transcription factors.”

I haven’t been able to see any of these products live and can’t judge how well they work on real problems. The nice people at BioWisdom did a Web-based demo for me, but I could only poke around in areas they had scripted. The Cellomics website offers a free two-week trial, but they came through with the required password too late.

If it works, knowledge mining will be a mega star — the Tom Cruise of bioinformatics.

Pathway Modeling

Pathway modeling is my second pick for stardom. It’s a technology with obvious appeal. Biologists invariably draw pathway diagrams to illustrate any biological process with more than one step. The challenge is to pirouette beyond informal pathway diagrams to formal models that represent biological processes in a precise mathematical or computational form.

The most obvious reason to do formal pathway modeling is to simulate the dynamic behavior of a biological process of interest. This storyline has been kicking around for eons, but it’s never been very practical because it requires detailed information on reaction rates and such, which is simply not available for most processes of interest. A newer twist on the idea is network inference, in which you start with a partial model and try to figure out the missing bits by comparing simulated results to experimental data. There is some hope that this will reduce the need for detailed information about each reaction.

Pathway models can also play a knowledge management role by organizing information about biological processes in a form that is accessible and intuitive to researchers. This theme was expressed years ago by Kurt Kohn, one of the patriarchs of the field, but seems to have been dropped. I’d like to see this genre revived: it could be the first starring role for the technology.

There are a lot of academic codes available and a few commercial ones. I tried one of each: the Jarnac/JDesigner suite from Herbert Sauro at Caltech, and VisualCell from Gene Network Sciences (with which my institution has a partnership). Both worked well, but are intended for different kinds of pathways. Jarnac/JDesigner is aimed at metabolic pathways, while VisualCell’s strength is on regulatory pathways.

Jarnac/JDesigner offers both a textual and a graphical language, which I find a plus (see sidebar). I mainly used the text language. It comes with a built-in simulator and simple graphics for plotting the simulation results. It was a lot of fun and very easy to vary the efficiency of steps in a metabolic pathway and see the effects (which were generally small, as predicted by theory). Jarnac/JDesigner is a great teaching tool even if it has no place in your research.

VisualCell is purely graphical. Not surprisingly, the language is complicated — it has to be in order to precisely model real pathways. But with the help of the experts at the company, I was able to learn the language in a day, and use it the next day to create a small but realistic model of a disease process. My learning curve was shortened by the abstract modeling language.

High-performance Sequence Analysis

A new generation of sequence rockers is hitting the charts, hoping to unseat the reigning platinum record holder, BLAST. Most promise blazing speed with no loss of sensitivity, although one act (MPSRCH) is going for the sensitivity market at somewhat lower speed.

In this context, sensitivity refers to the ability of an algorithm to find distant matches, i.e., sequences in the database that are only vaguely similar to the query sequence. The flip side — specificity — refers to the number of false matches an algorithm reports. This is not usually a concern, since the algorithms assign a score to each reported match, and the user is free to ignore matches with scores that are too low. As the algorithms get more sophisticated, specificity will probably become more of an issue.

Community Software

Our final contestant — community software — is the sentimental, feel-good favorite. Community software is the step beyond open source in which programmers from many places join together to create software of value to all. It’s the global village in action.

Two big community efforts are underway: BioPerl and MGED (Microarray Gene Expression Data). BioPerl is further along and has already debuted their software. The MGED people are working furiously and hopefully will raise the curtain soon.

BioPerl is a colossal production with 450 Perl modules focused on sequence- related issues. There is code to read and write the major sequence formats, create indexed sequence files, and work with pairwise and multiple sequence alignments. It provides wrappers for many popular programs including BLAST, HMMER, Sim4, and others, though some important tools (e.g., FASTA) are oddly absent. There is also software to create graphical displays of annotated sequences.

The internals are a tour de force of Perl programming. It’s a veritable how-to guide for advanced Perl programmers, reflecting the extraordinary software skills of the developers.

Time will tell whether the BioPerl troupe can hang together and even expand. I predict a deluge of new programmers wanting to add their favorite software to the show. The BioPerl stage managers will have to decide whether to let the newcomers on stage, and if so, how to maintain quality, or to shoo them away and become a closed shop.

Having seen all the contestants, here are my final predictions for the technologies that will be shining brightly in the not too distant future: My heart says community software. My wistful eye says knowledge mining. My scientific hopes say pathway modeling. And my pragmatic side says high performance sequence analysis.

 

Knowledge Mining: Can it work?

The hard part of doing literature searches is going back in time and reinterpreting old results in light of new data or theories.

For example, in answering the first question in the main text, the system should report that valproate — a histone deacetylase inhibitor — was tried on Huntington’s disease patients in a case report published in 2000. What makes this tricky is that valproate wasn’t known to be involved with acetylation when the article was published, and no terms related to acetylation appear in the paper. Moreover, the subsequent paper that connects valproate to acetylation doesn’t actually mention the drug by this name, but rather talks about valproic acid, which is the active form.

The case report describes profound improvements in HD symptoms. An exciting connection. But don’t get too excited.

If the system were really smart, it would temper its enthusiasm by noting that valproate was given in combination with another drug: the authors were mainly interested in the other drug, and they never followed up the valproate angle.

In answering the second question, the system would have to tap dance around a rapidly changing area of science. Until recently, people thought that the main way acetylation affected transcription was by changing the acetylation level of histones. Histones are the proteins around which DNA is wrapped to form a compact, three-dimensional structure. The old theory was that acetylation caused histones to loosen their grip on the DNA and allow transcription factors to sneak in and do their job. The cognoscenti now believe that this is only one effect, and that the acetylation status of transcription factors is important, too. The net effect is that many papers that talk about histone acetylation have to be re-interpreted in this new light.

— NG

Pathway Modeling: Graphics vs. Text

In the pathway modeling field there’s a big emphasis on graphical representation of models. This emphasis is understandable given that biologists are alleged to think in pictures.

I find this exasperating, because it can be fiendishly hard to describe a complex process in pictures. There are many things that are just plain easier to say in text.

Here’s an example.

1) Protein kinase A (PKA) activates the transcription factor CREB by phosphorylating the serine at position 133.

2) When activated, CREB can join a three-molecule complex, consisting of itself, either CREB binding protein (CBP) or a related protein p300, and TAFII130.

3) TAFII130 in turn can bind the TFIID subunit of the basal transcriptional complex, which includes the TATA bind protein (TBP). The details of this interaction are not known.

4) When the transcriptional complex is fully assembled, TBP can bind the TATA box upstream of the transcription initiation site.

5) This brings RNA polymerase II into contact with the DNA to be transcribed, and transcription can proceed.

The words describe the process clearly. To turn this into a model, one would have to recraft it using a precise computer language, which wouldn’t be too hard. I don’t see what a picture would add.

There’s an apropos lesson from software engineering: diagrams are a great way to document programs, but text is the best way to write them.

— NG

 

High Performance Sequence Analysis

Many fast sequence search methods gain their speed from a simple trick. They start by finding short, exact matches called seeds and then extend the seeds into longer, inexact matches. This makes it possible to find short exact matches very fast.

A key parameter of such methods is the size of the initial seeds. This is the “word length” parameter you may have seen in BLAST.

One simple way to gain speed is to increase the seed length, but this reduces sensitivity. Another approach is to pre-process the database and create an index telling where any given seed exists.

A more sophisticated approach is to build a fancy data structure called a suffix tree that effectively tells where all sequences of any length exist in the database. Suffix trees are widely used in the computer field, but have seen limited use in bioinformatics because the traditional implementations consume a lot of memory — about 40 bytes for each letter in the database, which comes to 120 GB for the entire human genome. Too big to be practical. Recent improvements in the method have cut the memory requirements to 17 bytes per letter (50 GB for the human genome), and have reduced the penalty for storing the data structure on disk, which bring the method to the verge of practicality.

A different approach is to switch from exact match seeds to ones with a limited number of mismatches. This makes it possible to improve sensitivity for a given seed length at the cost of slowing down the search for initial seeds. On certain computers, notably the Cray SV vector machines, inexact matches can be found almost as fast as exact ones, making this approach very attractive. (Note that my institution has a partnership with Cray.)

One algorithm that marches to a different drummer is MPSRCH, which has opted for sensitivity over speed. MPSRCH claims to implement the gold standard, most sensitive algorithm known, namely full Smith-Waterman dynamic programming. I tried their algorithm on their website and it’s incredibly fast — I wonder how they do it!

— NG

 

Nat Goodman, PhD, helped found the Whitehead/MIT Center for Genome Research, directed a bioinformatics group at the Jackson Laboratory and led a bioinformatics marketing team for Compaq Computer. He is currently a senior research scientist at the Institute for Systems Biology and an affiliate professor of bioinformatics at University of Alaska-Fairbanks. Send your comments to Nat at [email protected]

 

TABLES

More on Knowledge Mining

Product Company Notes

URL

CELL

Incellico

 

http://www.incellico.com/

CellSpace Knowledge Miner

Cellomics

 

http://www.cellomics.com/

DiscoveryInsight

BioWisdom

 

http://www.biowisdom.com/

Gene Ontology Knowledge Discovery System (GO KDS)

GeneEd & Reel Two

Pre-release

http://www.geneed.com/

http://www.reeltwo.com/

Gene Ontology Knowledge Discovery System (GO KDS)

Ingenuity

 

http://www.ingenuity.com/

Gene Ontology Knowledge Discovery System (GO KDS)

Celera & Clear Forest

Press release

http://www.clearforest.com/
whatsnew/press_releases.asp?id=24

Community Software

Organization

URL

BioPerl

http://www.bioperl.org/

Microarray Gene Expression Data (MGED) Society

http://www.mged.org/

 

Help on High-Performance Sequence Analysis

Program

Source

Web Server

Software Availability

Remarks

URL

BLAT

Jim Kent, University of California at Santa Cruz

Yes

Free for academics

Used in Santa Cruz genome browser

http://www.soe.ucsc.edu/~kent/

FLAG

Biomedical Engineering Center, Industrial Technology Research Institute of Taiwan

Yes

   

http://flag.itri.org.tw/

MPSRCH

Aneda

Demo

 

Complete Smith-Waterman

http://www.anedabio.com/

MUMmer 2

TIGR

 

Free for academics

Based on suffix trees

http://www.tigr.org/software/mummer/

PatternHunter

Bioinformatics Solutions

Demo

Free for academics

 

http://www.bioinformaticssolutions.com/

WABA

Jim Kent, University of California at Santa Cruz

Yes

Free for academics

 

http://www.soe.ucsc.edu/~kent/

Pathways Packages: ACADEMIC

Package

Authors

Notes

URL

BioQuest

Brian White

Can be ordered through the ePress Project at the University of Maryland

http://omega.cc.umb.edu/~bwhite/ek.html

BioSpice

Adam Arkin

Some software can be downloaded, but the major work, BioSpice, seems to be hidden in a private area

http://www.lbl.gov/~aparkin

DBSolve

Igor Goryanin

Free

http://websites.ntl.com/~igor.goryanin

DynaFit

BioKin Ltd.

Free for academics; free trial version for all

http://www.biokin.com/

E-Cell

Masaru Tomita

Open source

http://www.e-cell.org/

Electronic Arc

Gene Selkov

Diagramming tool; apparently open source

http://home.xnet.com/~selkovjr/ElectricArc/

Gepasi

Pedro Mendes

Free

http://www.gepasi.org/

Jarnac/JDesigner

Herbert Sauro

Open source

http://www.cds.caltech.edu/~hsauro/

NetBuilder

Hamid Bolouri

Free

http://strc.herts.ac.uk/bio/maria/NetBuilder/index.html

StochSim

Dennis Bray

Open source

http://www.zoo.cam.ac.uk/comp-cell/StochSim.html

VCell

Jim Schaff

Web server; software not available

http://www.nrcam.uchc.edu/

WinScamp

Herbert Sauro

Open source

http://www.cds.caltech.edu/~hsauro/

Pathways Packages: COMMERCIAL

Product

Company

URL

DigitalCell/VisualCell

Gene Network Sciences

http://www.gnsbiotech.com/

PathwayPrism

Physiome

http://www.physiome.com/

PhysioLab

Entelos

http://www.entelos.com/

Pathways Packages: STANDARDS AND INTEREST GROUPS

Product

Notes

URL

CellML

Co-development of Physiome and the Bioengineering Institute, University of Auckland

http://www.cellml.org/

Systems Biology Markup Language (SBML)

Part of Systems Biology Workbench (SBW), ERATO Kitano

http://www.cds.caltech.edu/
erato/sbml/docs/

BioPathways Consortium

Systems Biology Project

http://www.biopathways.org/

 

The Scan

More Boosters for US

Following US Food and Drug Administration authorization, the Centers for Disease Control and Prevention has endorsed booster doses of the Moderna and Johnson & Johnson SARS-CoV-2 vaccines, the Washington Post writes.

From a Pig

A genetically modified pig kidney was transplanted into a human without triggering an immune response, Reuters reports.

For Privacy's Sake

Wired reports that more US states are passing genetic privacy laws.

Science Paper on How Poaching Drove Evolution in African Elephants

In Science this week: poaching has led to the rapid evolution of tuskless African elephants.