In the wake of the announcement that the working draft of the human genome sequence is done, BioInform spoke with four informatics and research executives from a cross section of companies to get their impressions of what it all means.
Rainer Fuchs, director of research informatics at Biogen of Cambridge, Mass.; Doug Bigwood, director of bioinformatics at Bayer in New Haven, Conn.; Andrew Lyall, chief information officer at Oxford GlycoSciences of Oxford, UK; and Mark Boguski, vice president of research and development at Rosetta Inpharmatics of Kirkland, Wash., pointed to drug target discovery and proteomics as two main beneficiaries of the added information.
Rainer Fuchs, Biogen: The sequencing of the human genome is clearly a significant scientific achievement. We’ve found the constant stream of genomic sequence data that has been produced over the years to be extremely valuable. Even a draft is extremely useful. I’m not sure that we are holding our breath for the finished sequence necessarily.
Even more important than the scientific value of this data for drug discovery is that it’s a symbol. Now, basically all possible pharmaceutical targets are known. One of the big problems has been just to find the targets and what are the genes and proteins that can be used as some of the targets.
By definition, you can go into the database and pull them all out. If you are interested in microbial and anti-infective research, there’s a limited universe of targets, and after the refinement of that sequence data, all these targets will be known.
The days where you could justify searching for novel genes are basically gone. What you need to do now is to come up with ways of associating the relevant genes from that big database of genes to the particular disease you are interested in.
In the long run, the companies that can create the biological context for any of these genes will be the winners.
The announcement defines the baseline for our search for genetic variation. Over the next 5-10 years, that’s going to be a key activity.
I compare it to the first bacterial genome, which was a huge scientific advancement, but it really had little practical impact. The real value came from having dozens of bacterial genomes with which you could compare similarities and differences. We’re going to see the same thing with the human genome.
Having the complete human genome and not just the expressed sequences will enable new ways of searching for regulatory control regions. Some people have started already to correlate, for example, gene expression data with the existence of particular regulatory elements in genomic DNA sequence.
In the past, you had companies fighting each other over the value of their sequence data. Now, you’re going to see companies that will fight over the value of their biological information that they can attach to sequences.
That should create a lot of new opportunities for newcomers in this field, such as bioinformatics companies like DoubleTwist that focus more on creating the biological context, than the actual raw data.
Doug Bigwood, Bayer: This isn’t what we care about. It does help us in the pursuit of drug targets because there is more sequence to look at to try to pull out new genes that might be of interest to us. More sequence is better and it doesn’t matter whether it comes from expressed sequence tags or genomic sequence.
It’s a significant milestone, but from our standpoint, it doesn’t do much for us. It makes it kind of a pain in the neck day to day to manage all this data, but that’s the only practical effect it has on us.
It’s going to take a long time to figure out what’s what. We don’t know where the genes are, whether they’re real genes or not, and what relevancy they might have to the pharmaceutical industry. At least with expressed sequence tags, you know that they are expressed, that they are coding for something. That’s why people have focused on ESTs versus genomic DNA.
We still don’t have a good handle on how many genes there are. Very knowledgeable people have estimated and still estimate ranges from 40,000-130,000 genes. It shows you how ignorant we really are and how little this really means at this point.
Celera acquired Paracel to have access to their key technology and in-house development capabilities to analyze these things. The only way Celera can sell their stuff is intellectual input, which clearly they will not release into the public domain.
PE announced three months ago that it is going to develop a proteomics research center. Really, it’s the proteins that are important, not the DNA. The problem is that there are no good high-throughput techniques for looking at the proteome like there are for ESTs or for microarray expression experiments because these are all dependent on polymerase-chain-reaction amplification. You can’t amplify proteins in the same way. At best we could probably see 30-40 percent of the proteins using proteomics methods these days.
Proteomics will become a big thing. Companies that are involved in expression profiling using microarrays, such as Affymetrix, Incyte, Hyseq, and some others that are developing this technology will be important. That’s where a lot of the target validation data is going to come from.
The human genome information doesn’t make EST information obsolete, but it obviously becomes less valuable over time as more of this information becomes known and is out in the public domain.
Andrew Lyall, Oxford GlycoSciences: It’s incredibly significant for us because the human genome is the starting point for proteomics.
We can determine the sequence of proteins that we identify using mass spectrometry. The reason the human genome is so important is we can then match those proteins back to the genome and pick up all the medically important annotations available in the public domain.
Because the protein sequence is private to our pharmaceutical company partners or us, our partners get advance warning of which portions of the human genome are going to be important to the disease in which they are interested.
We collect our data from material collected from patients and important disease-related experiments or clinical trials. We identify proteins in disease contexts and can use that to prioritize the human genome for the pharmaceutical industry. Most of the genome is not terribly interesting to the pharmaceutical industry, but they don’t know which bits are interesting. We’re going to tell them that.
Before the Human Genome Project, the pharmaceutical industry had a lot of low-quality targets. After the Human Genome Project, it’s got even more low quality targets which has in effect, made the problem worse, not better. The promise of proteomics is it will allow pharma to focus in on a few high quality ones.
The human proteome is much bigger than the genome. People are saying that the genome might have as many as 100,000 genes. The proteome might have as many as 1 million different proteins.
One gene will produce a large number of different variants of the same protein. From looking at the gene, it’s not possible to tell which of the variants is important to the disease. But if you look directly at the proteins, you can immediately tell which variant it is--which is just not accessible from the genomics perspective.
As we discover the proteins that are important in important human diseases, the milestones will be related to new discoveries relevant to particular diseases.
Big advances are taking place in mass spectrometry at the moment, which will allow us to improve the throughputs and also the accuracy of sequence determination of novel proteins. Also, better sensitivity will allow us to detect the sequence of proteins at much lower levels. We’re closely involved with those developments.
Mark Boguski, Rosetta: Since the early 90s, computational biologists have grown up dealing with incomplete, inaccurate data and still found in that kind of data a goldmine of knowledge and discoveries. The only challenge is that ESTs were information dense.
Genomic sequence is only 3-5 percent actual genes and the rest is repetitive elements and other things that you have to learn to recognize so that you can ignore them for your real analysis. We know it’s incomplete; it’s not as accurate as we’d like it to be, but we can work around those deficiencies and still extract great value from this data.
What we didn’t have in the early EST days that we do have now are high-throughput experimental technologies that can discover clues to functions in entire genomes. What they used to call biochemistry and now call proteomics, you did that on a one-gene-at-a-time basis.
For single gene diseases, maybe having the protein product is enough. But for complex diseases and really understanding physiology and cell biology, we need the control mechanisms and that’s what’s missing from the EST picture. We have no idea how those genetic regulatory circuits are controlled.
Genomic sequence data--where it’s regulatory elements as well as the expressed gene products--is a much more powerful combination for understanding the complexity of biology, particularly when it comes to figuring out complex disease, multigenic complex traits.
Bioinformatics approaches alone have come up against a wall--experimental validation. We can do all the predictions we want but at the end of the day, we want experimentally validated function.
When I left NCBI after 12 years, I became not a director of bioinformatics but senior vice president for R&D, which spans computational biology and high-throughput experimental biology at Rosetta, because I believe the frontier is not bioinformatics or functional genomics alone. It’s those two disciplines coming together that will lead to the advances in the future.
What Celera and the public genome project has produced is data and, to some extent, information. The challenge now for both computational and experimental biologists is to take that data and information and turn it into knowledge and insight, and do it on a large scale and efficiently.