MIAMI BEACH, Fla.--If all goes according to plan, months from now a firehose of genomic sequence data will begin spouting from the sequencers at Celera, a new company created by Craig Venter, former president of the Institute for Genomic Research, and Perkin Elmer to sequence the human genome within three years. The project will create an unprecedented bioinformatics challenge.
Earlier this month Celera announced that it would collaborate with Paracel to develop high-resolution informatics tools. Paracel's GeneMatcher, a massively parallel genomic data analysis engine, will be installed at Celera and scientists from both companies will work to develop tools for expressed sequence tag clustering and whole-genome datamining capabilities, while leveraging technologies for sequence quality estimation and similarity searching. The collaboration is likely to be the first of several partnerships Celera will establish in the bioinformatics industry as it gears up to begin whole-genome sequencing and analysis.
Anthony Kerlavage, recently appointed director of gene discovery at Celera and former director of bioinformatics at the Institute for Genomic Research, spoke with BioInform at the Genome Sequencing and Analysis Conference (GSAC) here recently, discussing the informatics infrastructure he and others are building at Celera's facility in Rockville, Md.
BioInform: What can you tell us so far about the bioinformatics infrastructure at Celera?
Kerlavage: This is going to be a very large bioinformatics effort. We're going to be a knowledge company, which requires a lot of people specializing in informatics. Bioinformatics permeates the company. Overall, Celera has 57 employees now and we're recruiting heavily, most of all for software engineers and bioinformatics specialists.
We've broken things down into three components: data acquisition, data transformation, and data product. Each of those requires substantial teams of software engineers to build the infrastructure from the ground up. Data transformation will cover everything downstream of collecting A's, C's, T's, and G's off of these instruments, such as the assembly aspects of putting all the data onto the chromosome maps. Gene discovery, the group I'm heading up, will cover single-nucleotide polymorphisms (SNP's) and other components as well. Then all of that will be fed back into a database that goes to the data product group.
Each of these groups will have its own software engineering group.
BioInform: What's your role as director of gene discovery?
Kerlavage: I'm sort of getting back to my roots. I'm trained as a protein chemist but slipped into molecular biology and computational biology and bioinformatics. Now I'm comingfull circle and relying on my training in protein chemistry. I'll have both a bioinformatics group and a laboratory group.
What we want to do is to find the golden needles in the haystack of this stream of data that will come through on a daily basis. That will require a triage of the data using some routines that we're defining now. Then we'll be taking candidates that we want to pursue into the laboratory and getting full-length cDNA sequences and characterizing them.
BioInform: What are you doing about software tools?
Kerlavage: It's going to be a mix of things. There's software that comes with the instruments. The Perkin-Elmer Applied Biosystems 3700's, 200 of which will be installed next month, come with an Oracle database on board to manage data collection. That determines the direction of building our own LIMS system. But that's not all. We need to build an entire data infrastructure. We need all three of these groups--acquisition, transformation, and data product--to talk to each other. Right now we're looking at building a system across the entire organization that's compatible for all of these components. So there will be pieces that we buy off the shelf. But in general we have to build our own data model and our own infrastructure to fit these other pieces.
BioInform: What will you buy off the shelf?
Kerlavage: We're evaluating that now. At GSAC we're meeting with a number of companies that have some promising software solutions. We may not necessarily end up using them, but they have good ideas for LIMS systems. There has been a lot of that presented in the bioinformatics talks here. We've seen quite a bit about the latest technology using Java and CORBA, developing LIMS systems around wireless remote devices. We're looking, but we haven't made any concrete decisions about whether we'll use that technology.
BioInform: In one presentation here, Mark Adams, who will head data acquisition and chromosome assembly at Celera, described the new operation as a "sequencing factory," with little need for creative thinkers on the shop floor.
Kerlavage: What he meant is that we have to have concrete protocols in place. The plan is for us to develop those in-house. We need systems that are very robust and very commercial-grade. We're going to build off the things we learned at TIGR, but everything we developed at TIGR was research-grade. We were doing a lot of stuff on the fly and proving and changing things.
Now we're taking a step back and asking what works well, what didn't work, how can we make improvements. We want to figure out how can we bullet-proof this whole system so that when we put it in place we can turn it on and let it run with the appropriate quality control/quality assurance in place so it really does operate very much like a factory. But the factory will be highly automated. In fact it may take fewer people to run that facility at Celera than currently operate the facility at TIGR because the new instruments require so much less attention.
BioInform: Where will that leave all these software engineers after you've set up the infrastructure?
Kerlavage: There are always going to be challenges. I don't think they're going to go home. That's just the first phase. One of the very first concerns we have is to get the data acquisition piece in place. When the instruments are ready, we don't want the data dumping on the floor, so to speak. We have to be able to capture it in a useful and safe manner that we can then carry forward to the next step.
I imagine a lot of our software engineers will become very well rounded. They'll be focused on developing a number of very different components throughout the process, and teams will shift around and change their focus as the system starts to develop more. Ultimately the emphasis will flow the same way as I've described the flow of information. Up front we'll emphasize data acquisition and transformation, and when those systems are very well in place then data product will be of greatest importance.
In contrast to what Mark said about the factory-like work of data acquisition, when we get into data transformation it's a little bit different. With gene discovery for example, we have to be creative about how we are going to find those needles in the haystack. There's no turnkey system that does that for you, and we have to apply a variety of different techniques to the problem. As we go through this process we'll discover new things that we need to do. We're probably going to have to develop new algorithms and new methodologies to make us more efficient.
The systems will still have to be very robust because we're going to have this firehose of data getting poured at us. We'll have to be able to respond to that very quickly and we'll have to have systems that are reliable that allow us to deal with that. But we also have to have a bit more flexibility. We'll learn a lot as we move on in the project.
BioInform: Celera has said it will sequence the Drosophila genome by next spring and human in three years. When that's done how will the company's focus shift?
Kerlavage: The sequence is just the framework that we'll be adding value onto. There will be the annotations that will go onto the sequence, and then as time goes on some things that are quite obvious. For instance, we will have both Drosophila and human genomes; comparisons between the two will be very valuable. Having the Drosophila sequence will help us to find genes in the human. So this will just be a core of the information that then is expandable in a number of different ways.
Building the framework of the human sequence is almost like something we have to get out of the way. There will be lots of downstream importance in a variety of fields, not just in the genomics community but in medicine and pharmacogenomics. There are downstream products such as SNP maps and database services. You can imagine that growing almost indefinitely because there are so many types of information one could add.
BioInform: Will you commercialize the technologies you develop to deal with all the data?
Kerlavage: Yes, quite possibly. One way we think about it is there are people in-house who are customers. The chromosome teams, gene discovery teams, and SNP teams are customers of the data. The challenge for the software engineering teams is to satisfy the customers, the first ones being those in-house. But outside customers will have the same sorts of needs as we do. If we develop one system right the first time, it can be used both in-house and by clients.
BioInform: What hardware have you installed?
Kerlavage: We'll be a Unix based shop. We've been evaluating hardware vendors and are a couple of weeks away from making a decision about who we will go with. We're looking for more than somebody who can just deliver hardware to our doorstep. It's pretty clear that there would be an advantage to having a long-term relationship with a company that can provide the types of information services we need both in-house and for our clients. We're trying to see who out there has the soup-to-nuts expertise in-house that can help us with this problem--everything from providing a hardware infrastructure that will help us solve the hard computational problems to helping us deal with our own internal network, deliver information to clients, and establish an easy-to-use web presence.
We're looking to see if we can get a company that can give us a very integrated package of all of that IT infrastructure.
BioInform: Whichever provider you pick will get a great foothold in the genomics industry.
Kerlavage: That's the point we're trying to make to them! Our goal is to become a portal of genomic information. That will be a high-profile position not only for us but for the companies we partner with.
BioInform: TIGR has a unique culture. People there seem to be having a good time. Will you be able to carry that ambience over to Celera?
Kerlavage: A major factor in hiring at TIGR that will be carried over to Celera is finding people who can work well as team players. If you can put together a team of people who work well together, ideas sort of percolate up out of the group. That's the sort of culture that exists at TIGR. A fair number of people who were the founders at TIGR are now the founders of Celera. The mindset and culture is going to continue. It will certainly be influenced by the corporate side of things, and I think we realize that we need to change some of the ways we think, and inevitably that will have some downstream effect.
We still think of ourselves as a scientific organization. It's more the science that drives the culture at TIGR and I think the same thing will happen at Celera. It's a genuine interest in the science and the knowledge we can gain from what we are doing, as opposed to making widgets, that drives us.