I am writing to comment on Nat Goodman’s recent IT Guy article on pathway databases (“Can’t Get There From Here,” June 2003). The article combines so many unwarranted inferences and a factual error that I feel compelled to correct the record.
First let me say that I do appreciate aspects of Goodman’s column. The field of bioinformatics suffers from a dearth of fair and frank comparisons of competing methods — for some reason the practice within the computer-science culture of including factual comparisons of related methods in virtually every research publication has been diluted in bioinformatics, and Goodman’s columns combine a refreshing degree of honesty with an entertaining writing style.
But his June column makes too many mistakes. Let’s start off with the most serious problem, which is his conclusion at the end of the column that pathway databases are not dependable. On what basis is that conclusion reached? Goodman compared the treatment within four signaling pathway databases (KEGG, BioCarta, STKE, and the Woodgett site) of a single signaling pathway family: the MAPK cascade. What he finds is, “The pathways are broadly consistent, but there are many detailed differences. I have no clue which is right. Perhaps all are right under some circumstances. Given this state of affairs, it’s hard to take any site as a definitive source.”
It is bad enough that Goodman doesn’t take the time to do the homework needed to determine which, if any, of the databases have correct versions of MAPK. But even worse is that he infers from the fact that the sites do not agree on a single pathway that none of them is definitive. This is simply bad logic: just because none of the sites agrees does not mean that one or more of them is not right, and Goodman does his readers a disservice by jumping to the conclusion that none of these databases can be trusted.
What really bothers me is a theme present in this column that also runs through all of Goodman’s other columns: generalizing from a single example to evaluate an entire resource. It is unfair — and it is bad science — to evaluate any database on the basis of a single example. Systematic global studies are the only way to draw accurate conclusions. Examples are indeed useful for illustrating what a system can do, and I urge Goodman to restrict his examples to that use.
Let us move on to the factual error. Goodman states that “Kyoto University’s Encyclopedia of Genes and Genomes [KEGG] … has more than 10,000 pathways” and that “Karp’s EcoCyc … has expanded into a more comprehensive database, BioCyc, which has 477 pathways from a variety of species.”
BioCyc is a collection of 15 pathway/genome databases; it is not a single database. And furthermore, the number 477 is the number of pathways in one of the BioCyc databases: MetaCyc. The 477 pathways in MetaCyc were used as the basis for inferring pathways in the other BioCyc databases. The number 477 should be compared to the 226 “reference pathways” listed in the “Introduction to KEGG” page, since the 226 reference pathways in KEGG are used by KEGG to infer the presence of pathways in specific organisms. Overall, the databases in the BioCyc collection contain 1,980 pathways — that number is the proper one to compare to the 10,000 pathways in KEGG.
In summary, I would be delighted to see future IT Guy columns that do not derive unwarranted generalizations from single examples, and that provide their readers with deeper insights and fewer factual errors.
Peter D. Karp, PhD
Director, Bioinformatics Research Group, SRI International
(Karp’s group maintains the BioCyc collection of databases.)
Send us your thoughts
Don’t be shy. Tell us what you think. Send e-mail to [email protected] or mail to:
Editor, Genome Technology, PO Box 998, New York, NY 10272-0998