ROCKVILLE, Md.--Human Genome Sciences announced this month that it will pay a fixed multimillion dollar fee to the Rehovot, Israel-based bioinformatics company Compugen for work the companies' scientists will undertake collaboratively over a 12-month period. They plan to use Compugen's Leads computational analysis platform to produce a chromosomal map of expressed genes, a description of human gene organization and splicing variants, and a collection of some 500,000 single-nucleotide polymorphisms in expressed genes. Human Genome Sciences, which said it has derived from messenger RNA the characterizations of more than 95 percent of human genes, will hold exclusive rights to commercialize the partnership's results.
The collaboration is unique for Human Genome Sciences, which relies largely on its in-house bioinformatics expertise. While the company uses bioinformatics tools from vendors including Compugen, DNAStar, GeneCodes, and Genetics Computer Group, Michael Fannon, vice-president and chief information officer, said he has not gone outside his 20-person bioinformatics staff for framework applications. "We've made a significant investment in that
area ourselves, so even if someone has done something better than us, it would be easier for us to make up for any inefficiencies. We're the proof of concept. We get immediate feedback about how to tune them as these applications go out the door."
In an exclusive interview at his office here, Fannon spoke recently with BioInform about Human Genome Sciences' internal computational capabilities.
BioInform: Will you start by describing the computational infrastructure at Human Genome Sciences?
Fannon: Sequencing instruments--the PE ABI 377s and prior models--used to be controlled by Apples. So, we started with Macintoshes, but as we get into clinical and get into manufacturing, those are PC domains.
We've maintained a hybrid shop for a long time now, but starting out using the Macs as a standard platform helped us get jump-started because we didn't have to deal with a cross-platform problem. Getting the database built, and getting the connectivity and the network in place was done largely as a Mac-only solution and now we're adding PCs.
We were buying the machines as we were working the problem domain, so we didn't have to tackle that thorny technical problem of getting the same piece of software to run on both machines. We are doing that now like everybody else is, but we know we set a nice standard.
Everything we're running here is on our internal network, except that we maintain links to the public databases. If we have a reference that says our gene sequence is related to another gene, we'll jump on the net and go to the National Center for Biotechnology Information or Medline or another information service or journal we subscribe to. But the majority is hosted by our datacenter, which is pretty impressive. The computational capacity is not as big as some astrophysics and military intelligence applications, but it's not that far off.
BioInform: What instruments are you running in your sequencing production facility?
Fannon: At one point we had 58 machines. Today, with newer instruments, we have increased our sequencing capacity five fold. We're using predominantly PE 377s and we have several of the PE 3700s.
Our current sequencing production rate is largely influenced by relationships we have with several pharmaceutical companies. They have paid for access to the database as well as ongoing capture of sequence data at a certain rate.
I'm manager of the sequencing facility as well, responsible for the crew that runs the sequencing instruments, prepares cDNA libraries, analyzes primary data at the sequence fragment level, and also manages full-length sequence projects.
BioInform: How does the sort of human sequencing you are doing differ from what the Human Genome Project is doing?
Fannon: The way I think of it is that the genome has 3 billion bases--contiguous pieces of DNA that the Human Genome Project is going to define as the reference genome. It's well known that not all of that sequence actually encodes for proteins. Current estimates are on the order of 3 percent of the 3 billion bases encode for proteins.
Proteins are responsible for various biochemical functions that go on in the body. We have been most interested in proteins because we know how to convert at least a subset of them into medical products, diagnostics, or targets for drugs.
The pharmaceutical and biotech industries know how to use proteins. Knowledge of the organization of the gene on a chromosome is generally not needed to express and test the activity of the protein encoded by the gene.
Cells, for some reason that we don't understand yet, know how to read and interpret the protein-coding regions of DNA. We intercept the message that's being sent from the nucleus out to the protein factory in the cell and reverse-engineer that into a complementary DNA strand, splice it into a bacteria, and reproduce it. We actually have the message that encodes for the protein and not the full DNA.
BioInform: So the huge challenge of genome assembly is not something you're facing?
Fannon: It's not as challenging for us because we are less interested in the 97 percent intervening sequence than in the sequence that codes for proteins. Because proteins are drug targets, they're drugs themselves, they are the elements that we want to work with. Recombinant DNA technology enables us to take those cDNAs and splice them into mammalian cells, bacteria, or yeast, and as those cells or organisms grow, human proteins are being expressed.
It's marvelous. This is what biotechnology has been based on since the beginning.
We were very early to adopt this strategy and sequence a lot of these messages and start to gather inferences from computational methods to give us the first cut of what these might be. There's a whole series of strategies that we would use to identify the information in that set of experimental results.
Experimental verification tells us that there's a very high correlation between these sequences and those that code for proteins. So it's great to go mining in that. We've reduced the problem space by 30-fold.
There's overwhelming complexity involved with sequencing the genome start to finish; it's a project we've chosen not to pursue ourselves.
The other thing you find out is that there's richness in this method that is not immediately apparent. You find a lot of proteins, but you also find out where they're made. We're determining a gene-based anatomy of what proteins are being made in different parts of the body. It's giving us a microscope into an almost molecular anatomy. Anatomy has been gross features, down to cells, down to subcellular things, now we're going down to the actual proteins.
BioInform: Your department is also managing internal bioinformatics, isn't it?
Fannon: Yes. Actually, our internal research groups are the most aggressive consumer of genomic data that we've come across. This puts us in a unique position. We're simultaneously a producer--we take raw materials like primary tissue from cells and turn that into sequence information--and we do all of the downstream characterization work. We've proven to be very aggressive and very successful so far in our ability to mine this kind of asset.
BioInform: As the bioinformatics department serving Human Genome Sciences researchers, is your bigger job to manage data or to develop new analysis tools?
Fannon: We do all of that. Our model is a little different in that we didn't ever have a bioinformatics group responsible for making gene discoveries. We set ourselves up doing factory automation work. A million samples go through our sequencing facility every year. How do you collect that information off the instruments and analyze it?
We productionized the information-creation task, and then we developed tools for molecular biologists to be able to mine the data.
A lot of people say they are doing that now in their bioinformatics departments. We were the prototype. Bioinformatics was once considered to be individuals who were both skilled in computer science and manipulation of the data, and could reformat it and pull things off the internet, compile it, and run analyses.
We did that and packaged it up and turned the steering wheel over to molecular biologists, whose expertise is to interpret the data. We don't require them to come up to speed on syntax and the query languages and all that. They're working in a space that they're comfortable with.
All those tools are packaged in a way that doesn't require a high degree of computational sophistication. That has allowed us to take the bioinformatics function and distribute it out to our customer base very effectively.
BioInform: What do you mean when you call your bioinformatics platform fully integrated?
Fannon: When we take on a new technology--high-throughput sequencing, or some of the functional genomics work we're doing, or gene expression work--we basically design the systems and the experimental methods completely hand-in-hand.
In other industries in which I've worked, it's typical to have a group off to the side that has a great idea and you ask, How does it fit? Does it add value to our dataset? How can we relate samples? How do we identify and track it with our LIMS system?
All of that kind of thinking occurs extremely early in what we do. We go from concept to a very highly optimized, high capacity system in a short period of time. That is sort of a trademark of ours.
That level of integration means that we can trace back from a patent, reproduced through our high-volume patenting system, to find all the genes, all the sequences that were done on the genes, any test results that we have on those genes, the technician who ran the EST on a certain day, and on which sequencing instrument in the laboratory.
BioInform: Is this what has been called your "lawyer in a box"?
Fannon: I don't really like that terminology--our attorneys have a lot of work to do. But we've shifted the load to more high-value work.
In the traditional model, a scientist goes to an attorney and says, I think I have a patentable invention. They have to negotiate text to put this whole thing together, and they're managing all these documents themselves.
In our case, we knew we were undertaking a very high-scale sequencing project, so we put in support for patenting that didn't require any reentry of data.
As our scientists review sequences, we're capturing what it is we need to know for patenting. Scientists make evaluations: Is this patentable or should I not consider it for some reason? They also synthesize tissue distribution, gene structure, and sequence similarity information to write a description of the gene and its possible uses.
We don't ask our scientists to worry about things like file names, directory locations, SQL query languages. They fill out a form online with various checkboxes and places to put comments. The inventor puts in words about homology, tissue distribution, what the possible utility is, and what's the rationale for why this gene might be useful. This is all collected through our network. It's collective knowledge that's being attached to other knowledge we were able to get through computational methods.
When the patent attorneys are ready to collect this set of data, it's all there in a central database. It tells us all kinds of information and it's all hotlinked: the clone, the assembly that it's in, significant matches to various databases, the length of predicted open-reading-frame, whether or not there's a project that one of our scientists had expressed interest in independently.
Then the attorney has a screen to fill out things such as the registration number, the attorney of record. Then you can readily pick a set of these genes and ask the system to compose a draft of the patent text based upon all these comments from investigators. It launches Microsoft Word for you and presents you with a nicely formatted document.
BioInform: Will you describe the activities the company's newest facility will house?
Fannon: The building down the street is for manufacturing proteins. What we're doing is growing bacteria with human genes inserted in them. As the bacteria grow they express the human protein, which we purify. The proteins are the product.
It's a drug manufacturing facility that uses techniques similar to those used for production of insulin or growth hormone. Our products are naturally occurring human compounds that we've discovered through our research. We've demonstrated their activity in animals and now we're in the process of taking the lead candidates and putting them in clinical trials.
Perhaps the biggest thing that differentiates us from other people in the bioinformatics business is that we're a producer of raw data: primary sequencing of tissues, cell lines, cells from different diseases treated different ways. We use a lot of computational analysis to reduce our 2 million-plus human sequence fragments and the million-plus from the public domain to a set on which someone can actually do traditional biological testing.