Name: Zemin Zhang
Position: Senior scientist, Department of Bioinformatics and Computational Biology, Genentech, since 2009
Experience and Education:
Various bioinformatics positions at Genentech, 1998-2009
Postdoctoral fellow, University of California, San Francisco, 1995-1998
PhD in biochemistry and molecular biology, Penn State University, 1995
China-US Biochemistry Exchange Program, 1988-1989
BS in genetics, Nankai University, Tianjin, China, 1988
Last week, researchers at Genentech led by Senior Scientist Zemin Zhang, together with their collaborators at Complete Genomics, published an analysis of the genome of a lung cancer tumor in Nature.
For the project, Complete Genomics sequenced a primary lung tumor to 60-fold coverage and matching normal tissue to 46-fold coverage, using its proprietary combinatorial probe anchor ligation sequencing technology. The scientists discovered more than 50,000 single nucleotide variants in the tumor — an average of about 18 somatic mutations per megabase of DNA — and found that some regions of the genome are under selective pressure that limits mutations.
In Sequence spoke with Zhang last week about the study and the potential usefulness of whole-genome sequencing to find new cancer biomarkers or targets for cancer therapy. An edited version of the conversation follows.
Can you briefly discuss your work at Genentech, and how it fits into the overall company?
At Genentech, we have an internal cancer genome project. The goal is to characterize large numbers of tumor samples in order to find useful biomarkers or targets for chemotherapy. The background to this is the discovery many years ago that EGFR mutations are highly associated with how patients respond to lung cancer drug treatment. We want to find additional markers like those EGFR mutations. We have collected a large number of tumor samples of many different tumor types, and we will be using different technologies to sequence certain genomic regions, mostly protein-coding regions of a subset of genes of interest. A different lab in the same program, for example, has used mutation enrichment technology to find candidate mutations and do the follow-up by Sanger sequencing.
How did the project with Complete Genomics that you just published in Nature come about?
This particular project came about when Complete Genomics approached us with their new, very low-cost technology. We were very attracted by the potential applications of their technology — if the cost is very low, you could apply this to a much larger number of samples. This was largely a proof-of-concept work for us to see if we can actually do whole-genome sequencing. Now that we have the first case, we are quite encouraged by the quality of the data, and, in combination with the dropping cost of genome sequencing, we are thinking that perhaps, in the future, we should be more serious about whole-genome sequencing when it comes to cancer sample characterization.
What have you learned about lung cancer from this project that you did not know before? What are the most important results from this study?
The first is the sheer number of mutations we found in this sample. This is the first published study in which a lung cancer was fully sequenced from a tumor, not a cell line. We found over 50,000 mutations, and we were definitely fairly shocked by the large number. We spent a lot of time looking at technical aspects, the whole data pipeline, to make sure there was no systematic error and there was no problem with the way we analyzed the data. We also used a different technology, mainly Sequenom's MassArray genotyping technology, to independently confirm those mutations, and we found the validation rate to be very high. Based on that, from a technical point of view, we are comfortable with the findings.
[ pagebreak ]
The next question we asked was whether the sample we were working on is some sort of an outlier — maybe it has a mutator phenotype, and has an extremely large number of mutations. To answer that question, we compared the mutation rate in our sample with published mutation rates in other lung cancer samples from smokers and found that it is consistent.
Even though we only have one sample to work with, because we have such a large number of mutations across the genome, we were able to compare the types of mutations we found, as well as their distribution across the genome. One key finding of our analysis is that in a gene region, the mutation rate is lower than in a non-gene region. Overall, for the whole cancer genome, we observed 17.7 mutations per megabase. The number of mutations in protein-coding regions is dramatically lower than this genomic average. This is probably not surprising, but what's slightly unexpected is that, if you look at promoter regions, we also observed a much reduced mutation rate. And this has not been observed before in the somatic mutation context. In promoter regions, the number of mutations is about 10, as opposed to17 for the genome average. That indicates a strong negative selective pressure, or protection force, in promoter regions.
In addition, we found that genes that are expressed in lung tissue also have a lower mutation rate. If you take a look at genes that are not expressed in normal lung tissues, their mutation rate is at the genomic average; it is likely that there is no selection pressure in those genomic regions at all.
This is interesting from an academic point of view, but why would that be interesting from a biomarker point of view? One important goal for us is to find useful biomarkers, or targets for anti-cancer therapy. We found that some mutations, for example in EGFR, are very useful biomarkers. We know that's a functional mutation, it's considered a driver mutation, as opposed to a passenger mutation. So you are more likely going to find a useful biomarker if you know it is a driver mutation.
We found that there is a large number of mutations in our lung cancer sample. But if you have mutations in genes that are not expressed at all, it is likely that those mutations are passenger mutations, so we can filter out lots of putative passenger mutations in a new way. When a genomic region is selected for, you are likely going to find driver mutations there. It's unlikely you are going to find a mutation in a region that is under selective pressure, but once you have a mutation, it is more likely to be a driver mutation.
How do your results differ from the genome of a lung cancer cell line that researchers at the Sanger Institute published last year (IS 12/22/2009)?
There are several differences between our work and theirs. Number one, they were dealing with a cell line, and we were dealing with an actual tumor sample from a patient. Many people are somewhat suspicious of cell-line-based results because cell culture may introduce new mutations. Number two is, they worked on a small-cell lung cancer sample, and we worked on non-small-cell lung cancer — these are different types of disease.
But regardless, we actually found many commonalities. The number of mutations they found was roughly 23,000, and we found 50,000, so we are in the same ballpark. With a cell line, they have a relatively pure cell population, whereas a tumor sample is a mixture of normal cells and tumor cells, and within the tumor, the cells may be heterogeneous. So we are dealing with a more mixed genetic background. Given that, the number of mutations is fairly similar.
Also, in our analysis, we observed a lower mutation rate in genes that are actively transcribed. This is something they also observed in their paper. Furthermore, in terms of the types of mutations, in both cases, we observed a very strong tobacco-induced DNA damage signature.
In terms of differences, specific pathways impacted by mutations are different. For example, in our case, we observed lots of mutations in the EGFR-RAS-RAF-MEK-ERK pathway, and this was not a prominent feature in the small-cell lung cancer.
[ pagebreak ]
Have you been able to interpret a lot of mutations in non-coding regions? In other words, was whole-genome sequencing worth the effort?
Mutations in promoter regions, for example, are a typical type of mutation that people would normally ignore. But in this case, we see very strong selective pressure in promoter regions that indicates that if you have mutations in those regions, there is a higher likelihood those are going to be functional; those might be driver mutations.
So when we attempt to find driver mutations, we should pay attention to regions outside of coding regions, like promoter regions, regions that may be relevant to splicing signals, microRNA binding sites, or UTR regions. This selection pressure we observed opens up a wider range of genomic regions to look for driver mutations.
So far, we have not reported any specific mutations in those regions that might be functional, but we are trying hard to get to that point.
What are you working on now?
We are working on more samples with Complete Genomics right now. They are in the process of generating whole-genome sequencing data for a larger number of samples. We are probably not in a position to comment specifically on how many we are dealing with, but the new dataset is an expansion of the lung cancer area, a mixture of cell lines and lung tumor samples.
We specifically want to look for biomarkers that might be predictive of drug response. We have a large number of cancer samples where we do have drug response data. We are performing whole-genome sequencing, along with transcriptome analysis, to see if we can find non-conventional biomarkers. Conventional biomarkers would include expression status, or somatic mutations in coding regions, or common SNPs. But with whole-genome sequencing, we obtain a lot more knowledge about genomic features that include regulatory regions, splice-site mutations, UTR mutations, translocations, or the presence or absence of viral genes. Going forward, we are expanding to other types of cancer as well.
Generally speaking, how else have you been using next-generation sequencing at Genentech to characterize cancer samples?
We do have Illumina machines here, but unlike other genomic centers where they have many sequencing machines, we have a small number. It's part of our sequencing facility, so we would use this resource for smaller projects. When it comes to, for example, ChIP-seq analysis or transcriptome sequencing, it makes more sense for us to do this ourselves. However, for whole-genome sequencing, where a lot more is involved, it makes a lot more sense to work with a place like Complete Genomics to get this done.
It's not the sequence itself. There is a lot of informatic work that comes with this. It doesn't make economic sense for us to do this ourselves. Complete Genomics provides a service that includes not just the sequence but also part of the downstream informatics work. We still need to do quite a bit of analysis after we receive their data, but this makes it easier for us to do this.
What role do you think whole-genome sequencing will play in pharmaceutical research in the future?
The role will definitely increase a lot. As I mentioned, this project is a proof-of-concept project for us. Now that we know this can be done, at the current level of cost, it's actually possible to expand this to a much bigger scale.
We are already talking about characterizing all the cancer cell lines that we are using, because these are reagents for us, and we need to know everything about our reagents, not for the purpose of understanding biomarkers and finding targets, but to know what we are dealing with. Within the next two years, all those cancer cell lines, I believe, are likely going to be fully sequenced.
And then the next step for us, before we expand this to patients, is to find out if we can find useful biomarkers that we cannot find using traditional technology, like when you only sequence the transcriptome or the exome. That is going to depend on the second phase of our project — an expanded analysis of cancer samples where we do have drug response data that I mentioned before. If, from that process, we can find useful biomarkers that are somewhat predictive of a certain drug response, then we will become even more serious about this.
Right now, the cost of whole-genome sequencing by Complete Genomics is still roughly $10,000, and I think it's doable for a larger number of samples, but it's still too early to apply to actual patients. But if we are able to find useful information from this, we will likely go in that direction.