Veteran computational biologist NAT GOODMAN tries out the new online bioinformatics businesses.
This summer, while the world welcomed the human genome sequence, I paid visits to some of the new commercial bioinformatics websites that are meant to help you analyze it. I came away disappointed by a field of products stuck in pre-genomic thinking.
Bioinformatics is undergoing a phase transition as we enter the post-genomic era. Before whole genome sequences were available, it made sense for individual scientists to analyze sequences one-by-one. Today, it makes more sense for teams to analyze sequences en masse and disseminate the results in databases. Most scientists will eventually “analyze” sequences by mining the databases, falling back on traditional sequence analysis only as needed to confirm, correct, or extend database results. We’re not there yet, but it won’t be long.
Meanwhile, the bioinformatics software industry is undergoing a different kind of shift as investors try to capitalize on the growing commercial value of this technology. No one has yet succeeded in creating a large bioinformatics company, but many are trying.
The first wave of bioinformatics ventures — the enterprise-bioinformatics wave — has already crashed. Bioinformatics websites are the next swell. My bet is that this wave will crash as loudly as the first did.
I visited four sites: BioNavigator from eBioinformatics, a commercial spinoff from the respected ANGIS academic site; GeneScape from the genomics company CuraGen; DoubleTwist, the rebirth of Pangea Systems, a bioinformatics software vendor; and LabOnWeb by Compugen, a company best known for its bioinformatics hardware accelerators. The idea behind each of these online products is to offer sequence analysis services on the Web just as public sites have done for years; presumably, each will expand into other kinds of analyses over time.
This sample is representative of the commercial sites that were offering sequence analysis services as of eight weeks ago. Time and space limited the number of sites I could visit, and new sites will likely have launched by Genome Technology press time. But of the four I reviewed, I can recommend two as useful additions to public domain tools and in-house capabilities. The other two relied on unproven methods that gave the wrong answer on a simple test case.
CuraGen’s GeneScape is a marketing device and is provided gratis. The other three charge user fees—offering online tools is their main business—but offer free trial memberships, so you can check them out before plunking down real money.
I logged onto the sites incognito so that I would experience them like a typical user, without handholding from the companies’ public-relations staffs. I only used services available under each site’s free trial membership, which presented no serious obstacles as all sites except DoubleTwist allow trial members full access. I did my best to learn what each site could do based on the information presented. But I did not study the sites in excruciating detail. Features buried in the bowels of any of the sites could have slipped under my radar. (Editor’s note: at press time, DoubleTwist announced that some of the software in this article is no longer available for free trial users. It also announced improvements to its file manager and several new features.)
Assuming that security-conscious users would only use a Web service over a virtual private network and would insist upon legally effective privacy guarantees, I did not investigate the sites’ security aspects.
In addition to touring the sites, I ran a test case to see one example of how each one really works. I created a “virtual EST” by extracting 300 bases from the middle of the human cDNA sequence for a known gene and mutating about 3 percent of the bases to simulate sequencing errors. I selected the gene caspase-1, a well-studied protease implicated in programmed cell death.
For comparison purposes, I used three major public sites as gold standards: the US National Center for Biotechnology Information, the European Bioinformatics Institute, and the Baylor College of Medicine Search Launcher.
From a functional standpoint, the sites fall into two categories: Bio-Navigator and GeneScape in one, DoubleTwist and LabOnWeb in the other.
Pretty Accurate, or Just Pretty?
BioNavigator and GeneScape provide large collections of more or less standard sequence analysis programs operating, for the most part, on the usual public databases. The programs available at BioNavigator include GCG’s Wisconsin Package, BLAST, FASTA, ClustalW, HMMER, Genscan, Wisetools, Phylip, WHATIF, a large number of utilities (e.g., six-frame translation), and combinations of the above, which the site calls “protocols.” BioNavigator provides access to standard sequence databases such as GenBank and SwissProt, the PDB protein structure database, and a proprietary, UniGene-like database of ESTs called STACKDB that is produced using software developed by Winston Hide and colleagues at the South African National Bioinformatics Institute.
GeneScape provides many of the same programs, including BLAST, FASTA, ClustalW, and Phylip. It also provides a larger collection of motif-searching programs, including PRODOM, PRINTS, and SBASE, as well as several tools for protein structure analysis. In addition to standard sequence databases, the site offers the ability to search the public Mouse Genome Database. GeneScape refers to a Metabolic Pathways Database, but this is just the subset of SwissProt for which Enzyme Commission numbers are assigned. GeneScape also promises that a proprietary cSNP database will soon be added to its site.
One of BioNavigator’s strong points is its nice graphical interface for visualizing results. It’s not terribly flashy, but is functional and packs a lot of information into a small amount of screen space.
BioNavigator and GeneScape allow users to combine programs into more complete, customized analyses. BioNavigator’s mechanism is slick: first you conduct an analysis by hand, then you review a graphical display of the steps you performed, then you can edit the procedure and finally save it for future use. The GeneScape approach is simpler: just check off all the analyses you want, and the system runs them for you in a batch.
Both sites gave approximately the same answers to my test case as the public sites did. However, neither seemed to use NCBI’s Reference Sequence database (RefSeq)—the definitive source of correct, modern gene-sequence names. For instance, the sites told me that my sequence was similar to interleukin 1 beta convertase, an old name for caspase-1.
These two sites also provide file managers that let you store and organize data and results at the site. The file managers use a project/file paradigm reminiscent of the Mac and Windows desktops, but without the ability to create sub-folders. The lack of sub-folders is okay for small projects, but would get burdensome for projects involving more than a few sequences.
BioNavigator’s file manager has a mechanism for easily copying data and results from its website to your local computer. It’s an important feature because unless you keep copies of your results on your local computer, you run the risk of not being able to access your work if the Web or the site is down. In an extreme case you might lose your work forever, if the vendor does not safeguard it properly or (heaven forbid!) goes out of business. In addition, you will need local copies of your results to combine analyses from different sites, and to prepare figures and such for publication. This important feature is not supported by any other site.
Are these agents any smarter than Maxwell?
DoubleTwist and LabOnWeb are very different from the first two sites. Rather than providing collections of discrete programs, they offer a limited number of pre-configured, multistep procedures that perform generally useful analyses. These procedures utilize standard and proprietary tools and operate on public and proprietary databases. The approach is reminiscent of GeneQuiz, developed by Chris Sander and colleagues at the EBI where it can still be found, and GeneMine, developed by Chris Lee and others at the now defunct Molecular Applications Group.
DoubleTwist provides three such procedures called “agents.” One, the “comprehensive-analysis agent,” BLASTs your sequence against the public sequence databases as well as AlphaGene’s proprietary database of human and mouse genes. It also looks for your sequence in Myriad’s database of protein-protein interactions, and it runs BLOCKS to identify protein motifs in your sequence. The second agent BLASTs your sequence against DoubleTwist’s proprietary EST cluster database. And the third, which I was unable to test because it is not available to trial members, searches your sequence against DoubleTwist’s proprietary human genome database. DoubleTwist also offers a large number of agents that monitor the databases for new sequences similar to those you’re interested in.
LabOnWeb combines the work done by DoubleTwist’s three analysis agents into a single procedure that searches your sequence against Compugen’s proprietary EST cluster database, LEADS. If it finds a match, it “elongates” your sequence by retrieving longer transcripts that overlap it, a process dubbed “IRACE.” (Although alternative splicing is mentioned in LabOnWeb’s documentation, I saw no explanation for how the site handles the problem.) Next, the system runs the elongated sequence through a battery of standard bioinformatics tools, checks LEADS for expression data, and consults an unnamed SAGE database for possible hits. Finally, it reports the results in both detailed and summary formats.
DoubleTwist’s comprehensive agent correctly identified my virtual EST as belonging to caspase-1. It seems to have done this by parsing the gene name out of the best matching SwissProt record. The agent found the RefSeq entry for caspase-1, but did not recognize that it could use this as a definitive means for identifying my sequence.
Neither DoubleTwist’s EST agent nor LabOnWeb’s IRACE procedure discovered the full-length cDNA for caspase-1 even though it shares 97 percent identity with my virtual EST. I’m guessing that the proprietary databases searched by these procedures do not include full-length cDNAs, although I can’t imagine why the vendors would omit these very informative sequences. DoubleTwist and LabOnWeb’s procedures also found far fewer matching ESTs than were present in NCBI’s RefSeq and UniGene lists.
The file management capabilities of these sites are less sophisticated than those of the first two. You can basically store a list of sequences, each of which is linked to its analysis results. Unlike BioNavigator, these sites provide no easy way to copy results to your local computer.
I was struck by the lack of technical documentation on the proprietary databases and tools offered by these sites. These proprietary elements are key features of the sites, yet there are no literature citations, reprints of peer-reviewed publications, or even unpublished technical reports showing how or if they work.
DoubleTwist posts several quasi-technical reports. One, for instance, claims that its annotation of chromosome 22 found lots of genes that were missed by the investigators who sequenced the chromosome, but doesn’t offer enough technical meat for a reader to evaluate the claim. I respect the scientists at these companies and believe they can do what they’ve claimed. But I find it hard to accept any scientific product that does not present solid evidence of its validity. This is not just a matter of academic purity since neither site consistently came up with the correct answer on my test case.
Price and Performance
Interactive performance was reasonably good for all sites when accessed over a high speed Internet connection, but painfully slow over a phone line.
Analysis performance varied across the sites. BioNavigator and GeneScape were comparable to the public sites, consuming 1-5 minutes to BLAST my test sequence against the full public database and a half-hour for FASTA.
DoubleTwist and LabOnWeb were much slower. DoubleTwist required 22 hours to run its comprehensive analysis agent on my test sequence and another 2 hours to search its EST cluster database. LabOnWeb took about 12 hours to run its comprehensive procedure on this sequence. By contrast, GeneQuiz running at the European Bioinformatics Institute public site needed only about two hours to complete its comprehensive analysis.
BioNavigator was the only site that provided clear pricing information. It charges for services in terms of “units.” You get 100 free units when you sign up. Thereafter, you can buy units online by credit card at the academic price of $0.99 per unit in small quantity, and $0.50 per unit for 1,000. To place this in context, I consumed 8.4 units running a typical series of analyses on my test sequence. This seems a reasonable cost even at the full academic price of $0.99 per unit.
GeneScape is operated by CuraGen for marketing purposes and is completely free.
DoubleTwist sells subscriptions at several service levels. To access the full capabilities of the site, a “gold” subscription comes at the academic price of $640 per person per month, which allows you to perform up to 500 analyses. The commercial rate is $900. This seems ex-pensive unless you’re going to do a lot of analyses, since you could do about 150 an-alyses at Bio-Navigator for the same price. And, if the performance I experienced is anywhere close to typical, you would never have time to accomplish 500 analyses per month anyway.
LabOnWeb offers academic users free comprehensive homology analysis. Academic rates for more advanced queries are $300 for 20, $600 for 50, and $1,000 for 100.
Far to go on the post-genomic path
BioNavigator and GeneScape provide convenient access to large collections of standard sequence analysis programs. BioNavigator stands out for the professional character of its site, with GeneScape close behind. BioNavigator’s pricing is reasonable and GeneScape is free. Between them, these two sites offer most of the popular programs and a good number of esoteric ones. Neither is a complete solution, but many users will find them to be useful adjuncts to the public sites and in-house capabilities.
DoubleTwist and LabOnWeb hide their sequence analysis programs inside black-box, multi-step analyses that include standard and proprietary methods. In terms of professionalism, these sites are a notch below the first two. The sites’ differences make comparison-shopping difficult, but nonetheless DoubleTwist’s pricing seems high, and LabOnWeb’s is even higher. Neither adequately defends the scientific validity of its proprietary methods, and these methods did not give the expected answers on my test case. I cannot recommend these sites until they are able to demonstrate that their methods work reliably.
Nevertheless, the main weakness of all four sites—and it’s a big one—is that they are firmly rooted in pre-genomic thinking. They are aimed at individual scientists who need to conduct de novo functional analyses of one or a few novel sequences. LabOnWeb is furthest along the post-genomic path, but it, too, has far to go. Had these sites been established five or even two years ago, they would have been exciting. But, as we enter the post-genomic era, their relevance is decaying by the day. None is likely to be a major player in the post-genomic era, unless they evolve rapidly.
BIOINFORMATICS SITES AT A GLANCE eBioinformatics; spinoff from ANGIS academic site formerly Pangea Systems large collection of standard sequence analysis programs, including GCG large collection of standard sequence analysis programs 3 pre-configured, multi-step procedures using standard and propriatery software 1 pre-configured, multi-step procedure using standard and propriatery software Proprietary Databases STACKDB (EST clusters) none AlphaGene gene inventory; Myriad protein-protein interactions; DoubleTwist EST clusters; DoubleTwist human genome LEADS EST clusters; Genzyme Molecular Oncology’s SAGE; Genome Therapeutics’ PathoGenome File Manager Organization by project simple list simple list Results on Test Case same as public sites comprehensive analysis as expected; EST search found fewer matches than public site EST search found fewer matches than public site Performance on Test Case under one hour 22-24 hours 12 hours Pricing free academic/commercial rates: 100 sequences: $30/50/mo. academic rates: 20 queries: $300/year THE PUBLIC AND PRIVATE PORTALS BioNavigator http://www.bionavigator.com/ GeneScape http://curatools.curagen.com/ DoubleTwist http://www.doubletwist.com/ LabOnWeb http://www.labonweb.com/ NCBI http://www.ncbi.nlm.nih.gov/ EBI http://www.ebi.ac.uk/ BCM Search Launcher http://dot.imgen.bcm.tmc.edu:9331/index.html
Vendor or History
same as public sites
under one hour
30-99¢/unit; commercial: 60¢-$1.99/unit; (test case used 8.4 units)
200 sequences: $100/150/mo.
500 sequences $640/900/mo.
50 queries: $600/year
100 queries: $1,000/year commercial rates: contact Compugen
eBioinformatics; spinoff from ANGIS academic site
formerly Pangea Systems
large collection of standard sequence analysis programs, including GCG
large collection of standard sequence analysis programs
3 pre-configured, multi-step procedures using standard and propriatery software
1 pre-configured, multi-step procedure using standard and propriatery software
STACKDB (EST clusters)
AlphaGene gene inventory; Myriad protein-protein interactions; DoubleTwist EST clusters; DoubleTwist human genome
LEADS EST clusters; Genzyme Molecular Oncology’s SAGE; Genome Therapeutics’ PathoGenome
File Manager Organization
Results on Test Case
same as public sites
comprehensive analysis as expected; EST search found fewer matches than public site
EST search found fewer matches than public site
Performance on Test Case
under one hour
academic/commercial rates: 100 sequences: $30/50/mo.
academic rates: 20 queries: $300/year
THE PUBLIC AND PRIVATE PORTALS
BCM Search Launcher http://dot.imgen.bcm.tmc.edu:9331/index.html