There’s no such thing as too much information, according to Ross Overbeek, vice president of bioinformatics at comparative genomics company Integrated Genomics.
“There’s a lot that you can learn from a set of genomes that you can’t learn from a single one,” he said, referring to the integrated database of around 300 genomes his bioinformatics staff uses for functional annotation and metabolic reconstruction.
“The point that we made early on is that the value goes up with square of the number of genomes,” Overbeek said. “That’s why having the most carefully analyzed and integrated genomes is critical. And that’s why we think we’re way ahead. It’s solely based on the number of genomes.”
Overbeek’s philosophy that quantity could lead to quality in the case of functional genomics began with the idea that “it was probably easier to annotate 1000 simultaneously than to do a single one,” he said. “That’s obviously wrong in some sense,” he conceded, “but the essence of it is important — that is it offers a framework within which you can test and reject hypotheses that you just don’t get from studying a single genome in isolation.”
While Integrated Genomics has focused so far on microbial genomes, this is largely due to their wider availability, Overbeek said. The company plans to include eukaryotic genomes in its integration as they become available, and already uses those available in the public domain.
“The eukaryotes are where prokaryotes were in 1996,” he said. “Then, there was a small number of genomes and there really wasn’t that much to compare. But then as genomes poured in, everything became much easier. Your ability to predict genes, your ability to identify functions, your ability to assign functions to hypothetical genes, has gone up tremendously with the number of prokaryotic genomes sequenced. I believe the same thing will be true with eukaryotes.”
The company’s software toolkit, ERGO, was designed as a self-learning environment in which the power of analysis grows exponentially as new genomic sequences are incorporated into the system. Of the 300 integrated genomes in the system, approximately 110 to 115 are complete, Overbeek estimated. Thirty genomes are proprietary.
The 30 employees in Overbeek’s bioinformatics department include both computational staff and biologists focused on curation and annotation. He said the bulk of the annotation is done through an automated process, with a small number of hypotheses sent to the company’s wet lab for confirmation.
Overbeek considers the integrated database of a diverse set of genomes as the company’s key asset and the source of an almost unlimited number of business opportunities. “We’ll market it in different ways,” he said. “Whether we can do a better annotation of a person’s genome because we have a better integration to use as a tool or whether we believe it should be marketed as a standalone version, different products will emerge from that but it’s the integration as a whole that’s important.”
Integrated Genomics’ customers include Roche Vitamin, Maxygen, Genencor, Dow Chemical, Cargill, Dow AgroSciences, BASF, Archer Daniels Midland, the University of Scranton, the Department of Defense, the National Institutes of Health, and the Department of Energy.
While the majority of Integrated Genomics’ income has come from selling sequenced and annotated genomes, Overbeek said the company is ready to branch out. The ERGO suite is now available in a standalone implementation as either a browser over the database or with an additional capability that permits users to load and analyze their own genome sequences. In addition, Integrated Genomics has increased its sequencing capacity “substantially” over the last six months and is developing application projects both in the drug target area and in string development, Overbeek said.
But the company’s primary task for the time being is keeping up with a flood of genomic sequence data that is doubling every 18 months — good news for Overbeek and his staff. “The more data the better,” he said.