At A Glance
Name: Javed Khan, Principal Investigator, Oncogenomics Section, Pediatric Oncology Branch, National Cancer Institute
Education: 1989 — MS Immunology and Parasitology, University of Cambridge
1984 — BA Immunology and Parasitology, University of Cambridge
Awards: 2001 — Scholar in Training Award, American Association for Cancer Research
The National Cancer Institute unveiled a new database March 1 for normal tissue gene expression, calling it the “largest open-source database for normal tissue from human organs.”
Touted as a “dictionary of normal expression” the database contains expression profiles for 18,927 genes, obtained from 158 tissue samples that were harvested an average of 11 hours post-mortem from males and females of different ethnic groups, ranging from 3 months to 39 years of age, according to the institute.
Likening the vastness of their dataset to the Human Genome Project, the creators of the database say they hope users can use the normal expression datasets as a baseline in their research and are offering it to users on their website [http://home.ccr.cancer.gov/oncology/oncogenomics/].
The normal tissue gene-expression database was compiled through the oncogenomics section of NCI’s Pediatric Oncology Branch and was led by principal investigator Dr. Javed Khan. A study authored by Khan that describes the various tests the database underwent prior to its release appears in the March 2005 issue of Genome Research.
BioArray News spoke with Khan this week about the significance of the new database and what kind of effect it will have on those using gene expression in their research.
I saw that you had previously released a database in 2001 that included expression profiles of small, round blue cancer tumors (SRBCT). How popular has the SRBCT database been?
Well at that time we didn’t really release the database as such, we released the data, but what you see on the site we’ve now just put it into the database [format] so people can search it. That’s only a few weeks old, less than a few weeks. We did that because people have been pestering me for the last year or so — since the data was published.
Yeah. Well basically they were saying ‘I’ve got this target, I want to target this protein, and they just express in this cancer or not.’ Basically it’s been hard to go back and pull [the data] out and send it. The easiest thing to do is just put it out there so they can search it themselves.
So what was the impetus for creating the normal tissue database? Why did you choose to make this one available?
Well, because the technology has really advanced since then, and we had a 6,000-gene array at the beginning and now we pretty much have 18K-19K unique genes on the arrays that we’ve published so it’s a huge dataset. It’s useful for comparing cancer genes or any disease expression group compared to normal expression. Basically, we felt that [it would be] a good service of the community that would be useful.
What kind of demand existed for this? Were some of the people that were pestering you about SRBCT saying that they were looking for something like this?
Well the demand was this — because I am also actually a physician — the demand is basically, say you have a gene to target, [you wind up asking] ‘Is this gene also expressed in kidney, liver, heart stomach? — If I target this gene will it affect other organs or not?’ So it’s a good [control] for whatever disease data you have. So that was our motivation.
And once we published it, and there’s so much you can do with this dataset, so we felt that it’s best to just release it — let people do whatever they want to do with it, aid in whatever discovery they want to do, whether it be drug discovery or biology. I think it will be very useful for others actually. The alternative would be just to keep it [away] somewhere. It’s akin to the Human Genome Project. It’s basically having the data out there which anyone can use and do whatever they want to do — so it’s a public service for the scientific community.
How much do you estimate it cost you to make this database available?
Well, you know, it’s expensive. That is a difficult question because we run several projects at the same time. Man power: I would say it took two man years around $100K. As for reagents each sample costs about $300 to get this information around $50K. [That’s a] total of $150K spread over four years. This does not include equipment costs.
And it was paid for by the citizens of the United States?
You know these microarrays are very expensive. They are not cheap technologies and there are not many labs that can do these on a large scale, and that’s actually another reason to make it available to the public because it is a good resource that the government pays for and it should be disseminated.
How long did it take you to make it available?
2001 was when we started. And you do the experiments. You have to make sure everything works. It was a big undertaking.
I noticed you put the database through a series of different tests. What were some of the tests you used to make sure your data was good?
So we want to make sure that the data wasn’t just junk. [First] we used [Agilent’s Bioanalyzer] to make sure the RNA was good in the beginning. So then we looked at all the gene expression profiles. [To see if] they clustered together we used hierarchal clustering that tells you [if] a kidney always aligns with a kidney or a heart next to a heart and if they plot in a similar way.
Then we looked for a bit more detail in the database, so we [asked] if we take any random set of genes, do they maintain a relationship? They did. So then we [asked], ‘what are the actual genes that are distinguishing each of these organs?’ So that gives you some idea of some of the function of the genes and the function of the organs. So this was basically a good validation of the dataset for people to use.
Did any of the results surprise you?
Well the biggest surprise was that I expected if you’re a kidney, that only a few hundred, maybe 200-300 hundred genes would define a kidney. But actually in almost every gene the level of the expression mattered. And each organ maintained a particular level of every single gene almost. To me that was surprising because it tells me there’s a control in almost every gene that’s being expressed in almost every organ and not just a very highly expressed gene.
The next thing was just the fact that these samples were taken from many patients, ages, and times following death that samples were taken, so despite the fact that that the consistency of the expression profile was actually relatively stable, so that was a surprise.
You decided to run a neuroblastoma test on it as well. Why did you choose NB, and were you surprised by those results as well?
[We chose] NB because my other hat is actually as a physician and we are planning to treat and are treating patients of neuroblastoma. And a set proportion of these patients are incurable — they have a sort of high state of the disease so we want to find new targets to treat these patients. So that was the motivation — we wanted to study the biology of these tumors. So some of the results were expected in terms of which genes were highly expressed and not — and some of the others put some new insight into these diseases.
You said that this data set is the largest to date. What are some other data sets that are available for these expression profiles?
So there’s actually at least three that I know of actually. One is produced by Stanford group, which has a less number of samples and one by Novartis.
I never intended to claim that what we had produced decreased the utility of the other ones, they are all useful. But I think the way we have done it it’s actually very easy [and] hopefully user-friendly. [Users can] go in there, search for genes that [they’re] interested in and be able to explain the database fairly easily — but the other ones are very useful and should be used together basically.
Can users access other data through your database?
What we’re going to do is combine this with a whole bunch of other cancers. What we plan to do is actually make this data compatible with Affymetrix data, because Affymetrix is one of the major platforms people use so the data should be compatible with that.
What other organs and tissues are you going to add?
We’re going to add breast, we’re probably going to add thyroid, we haven’t done lymphocytes — we’re going to do that, and bone marrow, probably, so those are the beginning ones we’re going to start with, and what we haven’t done is things like different parts of an organ like, the cortex or the outside of the kidney compared to the insides of the kidney or different parts of the brain. Those will usually come as tertiary [additions].
Are you going to work on developing any other databases now that this one has been put out?
Yeah, we are. So we’re submitting for publication — you know people use xenographs, we’ve profiled a whole bunch of xenographs which are used for drug screening, so we’ve done that, we’re going to release that once we’ve published on it, then you can compare these xenograph cancers with normal cancers. So basically we’re going to put every single sample that we’ve published, several data-sets where we haven’t made the data public but we’re going to make the data public in this sort of format actually. There are already three more databases that we have already developed. Within this year there’ll be at least three more databases up.
Will most of your users be researchers or commercial interests?
Since we’ve released the dataset we’ve been having a lot of pharmaceutical companies actually logging on to the database. I think biotech as well as the drug development companies as well as genome biologists. A lot of research can be done on this database.