At A Glance
Assistant Professor of Medicine, Harvard Medical School, Brigham and Women’s Hospital, Boston
After spending years analyzing microarray data contrasting two known conditions, a group of researchers from Harvard Medical School and Washington University in St. Louis, Mo., decided to conduct a study of the nucleotide sequences that measure gene expression on the most widely used commercial microarray technology (Affymetrix’s GeneChip). The researchers looked at every probe on the array to see if it corresponded with the gene it was intended to measure. Their results, which are published in the August issue of the journal Physiological Genomics, found that a signif-icant percentage — in some cases, as much as 20 percent — of the probe sequences didn’t perfectly correspond with the appropriate mRNA, as defined by the RefSeq database. Harvard Medical School Investigator Thomas Mariani spoke with BioArray News this past week about the study, the researchers’ findings, and the implications for microarray researchers and manufacturers.
How did you get involved in using microarrays?
My career has really been focused over the past dozen or so years on the investigation of the regulation of gene expression, particularly within the pulmonary system. I started out using good, old-fashioned Northern blot analysis to look at gene expression at the RNA level, and when gene expression array technology came along, it was sort of a natural progression of my line of investigation. At the time, I was a new faculty member, and I had the luxury of being able to immerse myself in the technology. I realized that this was going to be the wave of the future.
What prompted you to do this experiment on sequence verification in microarray probes?
A group of interesting observations led us to this systematic study. We had been using microarray tech-nology for four or five years before we began this particular study, and like many others in the field we had observed instances where we found a particular gene of interest and we further explored the specific data about that gene and were somewhat perplexed. For instance, when we looked to see where the probe sequences were located, we found [in some cases] that they physically mapped to a region that we were suspicious about whether they were truly within that gene — or maybe they mapped in reality to a sequence that was downstream and maybe within a different gene. I had a couple of post-docs in the lab at the time who made the decision by themselves that they had the capacity to systematic-ally look into this for the Affymetrix technology. That is where it all began.
What were your findings?
We found that a lot of the probes on the Affymetrix platform did not exactly correspond to what we considered the most reliable sequence information for the transcripts that they were supposed to query. We struggled for a while trying to figure out what was the best way to interpret the accuracy of the probe sequences and finally decided that the reference sequence database (RefSeq) was the gold standard. So, we decided to use that as the best estimate for the true transcript sequence [and] ultimately tested the individual probe sequences against that database. We would make a binary judgment on each probe and say, ‘Yes, we can verify it,’ or ‘No, we cannot verify it.’ Verification of the sequence had obvious implications if you could not verify that a sequence was correct, because you would intuitively suspect that it may not give you reliable information. But, we went a couple of steps further to test whether that intuition was right. And, lo and behold, we did find under a number of different circumstances and different experimental designs that the sequences that were veri-fiable did function better than the sequences that we could not verify.
What are the implications of this finding?
The implication is that if we use the most up-to-date transcript sequence information we can get more reproducible results from the technology. There are impli-cations for Affymetrix [in] changing their platform. It is a platform they cannot change rapidly. And that really was the focus of this study purely because that was the technology that my group, in particular, was invested in. I want to emphasize that this should not be looked at as a negative for Affymetrix’s technology. We actually look at it as very much a positive, because it allows us to apply this commercial technology better and more reliably. It’s kind of an improvement on the application of the technology that is available.
Are there any specific recommendations you would make to microarray users or manufacturers based on these findings?
Obviously, we’ve had a lot of strong reactions to this revelation. Most users are very pleased to hear the information and very interested in the fact that the verified sequences do behave more reliably. The sequence databases are updated so rapidly, particularly the transcript sequence databases, that they are not complete. Just because we can verify the sequence against the reference sequence transcript database does not mean that it’s right or wrong. It just means we can verify it using the most up-to-date information. Everybody can do a better job validating the sequences that they use on the microarrays. What we would say is, instead of using all of the sequence that is there, and knowing that some of it is incorrect, why don’t we just filter out and use only the most accurate sequence and use that for interpretation of our data. That’s the recommendation that we would make. If you’re running a microarray experiment, you might want to look first at the data generated from probes that have been sequence-verified.