CAMBRIDGE, Mass.--Having left his position as a public sector senior scientist at the Jackson Laboratory to direct Compaq Computer's new bioinformatics solutions center here, Nat Goodman acknowledged, "I am a drained brain." But the opportunity to help stem the steady stream of other talented computational biologists flowing from academia to industry is one reason Goodman said he made the move.
At Compaq, Goodman intends to step out of his traditional role as a researcher and help to amplify the importance of the bioinformatics field, "not by doing my own research so much," he said, "but by identifying important research areas and making sure they are being properly supported and advanced."
"It's a challenge to keep the field vigorous, and it's something we absolutely have to do," Goodman contended. "If the academic side of the field atrophies we won't be able to get the software created that needs to be there in order to support the biotechnology revolution."
BioInform spoke with Goodman recently about the mission of Compaq's bioinformatics center, and about the genomics community's desperate and mounting need for sophisticated software. Goodman also shared his views on what is needed to renew the supply of academic bioinformatics researchers, which will appear in part two of this interview in an upcoming issue of BioInform.
BioInform: Will you describe the technology crisis that exists in the genomics research world?
Goodman: In my view, the major rate-limiting factor in the industry today--in terms of its growth and effective utilization--is a shortage of software and of personnel to solve bioinformatics problems. Every time a new kind of data becomes readily available--and there are all kinds of new data becoming available all the time--there is a need to have software and expertise to exploit the new data in conjunction with all the other data that already exists. There has been a real lag in creating the software and the expertise to accomplish that.
Think about what's happened with the human genome sequence. A year ago the plan was to get the genome sequenced by 2005. That date was so far in the future that it was not a factor for most commercial customers. Then Craig Venter came along and announced his plan to sequence the genome by 2001. Then the public sector--NIH and the Wellcome Trust--came along and announced their plan to produce a first-draft sequence by early next year.
In the space of a year we've gone from a situation where the human genome was way out over the horizon, not a factor in anyone's plans, to a place where it is suddenly an absolute center stage critical area to work on.
Well, that isn't enough time for companies to get their people and expertise lined up to accomplish this. It really isn't enough time for the entire field to develop the expertise to exploit this new data.
That's what really drives the creation of my new organization. We're trying to amplify and leverage expertise in the industry so that we can get the right software created at the right time and help people acquire the expertise they need to exploit these new data sources that are becoming available so quickly.
BioInform: How will you do that without competing with existing bioinformatics companies?
Goodman: Collaborative relationships will be absolutely key. We will work with the software companies and academic software developers to identify software that is relevant to the problem at hand, and to make sure it is available to our customers in a form that's easy for them to use.
BioInform: Are the standards that are being created by the Object Management Group's Life Sciences Research task force going to be integral to that goal?
Goodman: Standards are very important. Whether we embrace the OMG work right now will depend on what our customers tell us about their ability to use that technology. The OMG work is not really done yet, and the CORBA technology that's used at OMG is often difficult for people to use.
BioInform: What are other possible paths?
Goodman: There are many choices. In the sequence world there is ASN.1., which the US National Center for Biotechnology Information has been using for years. It has some technical problems but is a de facto standard, so that's a possibility.
Other possibilities from the academic sector include ACEDB and Bioperl. Many software companies are selling integration frameworks that could form the basis for standards, including InforMax, NetGenics, Oxford Molecular, Pangea, and Synomics. There is technology within Compaq called Business Bus that could also play a role; Business Bus is being used in pharmaceuticals but in manufacturing not in R&D.
BioInform: What are the biggest technical-problem areas right now?
Goodman: The most pressing problem is to harness the human genome sequence data that's coming out--the ability to annotate and analyze large amounts of genomic sequence. There's going to be a lot of data. The ability to extract useful information from this data will be on the top of everyone's priority list in the industry.
A key challenge will be to integrate genomic sequence data with ESTs and other gene sequence data.
The genome sequence by itself doesn't tell you where the genes are. If you just have genomic sequence, the only way you can find genes is by using software to do computational gene prediction. The state-of-the-art is better than it was a few years ago, but it's still far below the level that is needed for this to be a practical tool for real industrial purposes.
The solution is to take the ESTs and other gene sequences, lay them on top of the genomic sequence, and use that as the starting point from which gene-prediction software can grow. You start with the known gene sequences, align those with the genomic sequence, and that gives you places where there really are genes. Then you can use the software to try to extend those partial gene sequences into longer, full-gene sequences.
This leads to a very interesting business problem, which is that the most complete database of gene sequences lives in the hands of Incyte. Its database is much more complete than what's out there in the public. Incyte has not been willing to let people publish comparisons of their database versus the public database. So those who have the Incyte data really understand this, but those who don't are a little bit in the dark about just how much more complete the Incyte dataset is.
We're now in a funny situation where there's this key requirement to integrate the Incyte data with the genome data. The Incyte data is private. Some of the genome data is public and some is private. And the Celera effort is producing data that will eventually become public, but will be private initially. Celera and Incyte, of course, are competitors, so they're not going to cooperate in combining their datasets. We're left in a situation where the only people who can combine those datasets are the customers who have access to both, of which there are several, or possibly a third party.
Possibly we could be mediators in making such a thing happen. If we could play a role in integrating those two datasets, the result would be extremely valuable to customers who had access to both datasets.
BioInform: Where does single-nucleotide polymorphism data fall into this picture?
Goodman: That's yet another class of important new data. Analyzing the genome is going to be the most pressing problem because it's right upon us. SNPs are right behind. Large numbers of SNPs are being generated by a variety of sources, public, private, and the SNP Consortium. The challenge will be for customers to use SNPs to do genetic mapping and analyses.
SNPs are a fabulous new resource, but the computational methods for using SNPs to identify disease-causing genes--the mathematics, statistics, and algorithmics--are still being developed. There are people working on all aspects of the problem, but there is no piece of software you could pick up today that could really do a SNP project.
SNPs are going to be here very soon. Next year we have the human genome. By the end of next year we're going to have large numbers of SNPs. It's another area where customers don't have the software they need to be able to effectively use a resource when it becomes available.
BioInform: Are there other examples of data in this arena in need of software?
Goodman: There are many other examples. One is in the area of gene expression generated by these DNA arrays--Affymetrix, Synteni, and others' technology. People are working on the software, and there are some interesting preliminary programs available, but by and large there's nothing like a mature product that you can pick up and use.
Then, right downstream of gene expression is technology for doing protein expression--all this proteomics stuff you hear about. Right now, the laboratory technology for proteomics is still not terribly high-throughput. It's much smaller scale than anything else we're talking about. But that's not going to last for long. Within a year or two, reasonably high-throughput proteomics devices will be available and then the problem we've been talking about will rear its head again: how do you analyze the data?