Fast forward to 2050… When chip-arettes hit the streets in the 1990s, they were oh-so-cool. All the “in” scientists were hybing away —10, 20, even 100 packs a year. Filters or glass, cDNAs or oligos, spotted or photo-lithed. A choice for every taste and budget. And so much software with such fancy names. How could you resist something called weighted pair-group method average agglomerative hierarchical clustering using Pearson correlation?
Who knew they’d be so bad for you? Who knew they’d be so addictive? I cry when I see all those chain-chippers today filling their disks with chip-crud that cannot be confirmed. But the worst, I think, is the second-hand smoke and mirrors. How many good scientists, especially computational biologists, have been taken down by bad data from the chipper next door?
Back to the present... Could it happen? Is it happening already? Read on and draw your own conclusions.
The Statistician General’s Report
Nature Genetics published the equivalent of the surgeon general’s report on microarrays in December 2002. Their 90+ page supplement, The Chipping Forecast II, included perspectives and reviews from many leading microarray statisticians, data analysts, and practitioners. While many articles extolled the virtues of microarray technology, a few were downright scary.
Here are some choice quotes:
• “[T]here is currently no convincing evidence to support a high level of intralaboratory reproducibility, reliability, precision and accuracy of data derived from global gene expression technologies …” said Emanuel Petricoin III et al in an article about medical applications.
• “The correlation observed between … duplicate spots on a single microarray slide will typically exceed 95%... However, if the same target [is]… hybridized to two different microarray slides, the correlation ... is likely to fall to the 60-80% range … Correlations between samples obtained from individual inbred mice may be as low as 30%. If the experiments are carried out in different laboratories, the correlations may be lower still,” warned Gary Churchill in his article on experimental design and statistics.
• “There are a number of reasons why data must be normalized, including unequal quantities of starting RNA, differences in labeling or detection efficiencies between the fluorescent dyes used, and systematic biases in the measured expression levels. … There are many approaches to normalizing expression levels,” reported John Quackenbush in his article on normalization. He went on to explain the many simplifying assumptions these methods rely on, and problems they don’t handle.
• “[T]here is no one-size-fits-all solution for the analysis and interpretation of genome-wide expression data. … There are few unbiased comparative studies of prediction methods … [and] most array data sets lack enough samples to prove a method clearly superior,” opined Donna Slonim in her article on data analysis.
Rodrigo Chuaqui et al discussed the difficulty of validating microarray studies. They mentioned one in which “the majority of array results were qualitatively accurate; however, consistent validation was not achieved for genes showing less than a four-fold difference on the array. [emphasis added].” They also discussed some specific anomalies they have observed. “[A] significant number of [probes] … produce ‘non-specific’ background signals during the experiment … [and] produce (often strong) signals that are interpreted as ‘equally expressed’ in the biological samples. … [A] subset of target cDNAs will hybridize strongly not only to their intended DNA probe but also to other DNA probes on an array, ranging from a few to several dozen.”
A common theme in the papers is the need for replication going all the way back to the biological source. Churchill said it nicely: “The many sources of variation in a microarray experiment can be partitioned along … three layers. Biological variation … is intrinsic to all organisms … Technical variation … is introduced during the extraction, labeling and hybridization of samples. Measurement error … is associated with reading the fluorescent signals. ... It is tempting to avoid biological replication in an experiment because results will appear to be more reproducible. [This] is illusory, however, and significant findings may simply reflect chance fluctuations in the particular animals chosen for the experiment.”
Scared yet? I am.
Digging into the Data
I decided to look at the literature and see what scientists are really doing, or at least publishing. Are the chippers heeding the advice of the statistician general’s report, or are they still puffing away at will?
The first challenge was to assemble a dataset of microarray papers. I looked first in PubMed for papers with microarray-related terms in their titles and abstracts, and found 359 papers published in 2003 (to February 22), 2,188 published in 2002, 1,250 in 2001, and just 505 in 2000. A little preliminary work revealed that many papers that report microarray results don’t mention this in their title or abstract. So I followed up with full-text searches at HighWire (which covers 348 journals) and found 682 papers published in 2003 (to February 22), 3,051 in 2002, 1,791 in 2001, and 893 in 2000.
This is obviously too many papers to scrutinize, so I limited my search to Science, Nature, and Cell. The results at PubMed were 0 papers in 2003 (through February 22), 22 in 2002, 19 in 2001, and 18 in 2000. Full-text searches at the respective websites found 24 papers in 2003 (through February 22), 228 in 2002, 165 in 2001, and 146 in 2000.
This is still a lot of papers. I narrowed it down to all the 2002 Science, Nature, and Cell papers that popped up in my PubMed search, and all the 2002 Science papers found by full-text searching. This came to 64 Science papers, 10 from Nature, and three from Cell.
I scanned the papers quickly to find those that reported original microarray results. Of the Science papers, 19 passed the test, as did six from Nature, and all three from Cell.
I read the 28 relevant papers more carefully and tabulated some basic properties. (Note that the totals below add up to more than 28 since some papers fall into multiple categories.)
Organism or system studied: human ¯ 4; mouse ¯ 5; ape ¯ 2; human or mouse cells ¯ 4; fly ¯ 4; mosquito ¯ 1; worm ¯ 3; Plasmodium (the malaria parasite) ¯ 1; yeast ¯ 7.
Type of array used, with the caveat that some papers were vague about this rather important experimental detail: Affymetrix ¯ 13; spotted cDNA ¯ 8; spotted oligos ¯ 4; genomic ¯ 6; macroarray ¯ 1.
Data availability: NCBI’s GEO repository ¯ 4; EBI’s ArrayExpress repository ¯ 0; Stanford Microarray Database ¯ 1; supplementary material on the journal’s website ¯ 5; author or vendor website ¯ 2; not stated ¯ 16. I find it amazing that 16 of 28 papers (almost 60 percent) don’t state whether their data is available or how to get it.
I also cataloged the kinds of experiments discussed in these papers. Nine papers studied the transcriptional targets of specific genes. In the simplest case, the experiments were looking for genes directly regulated by a single transcription factor. More complicated studies looked at direct and indirect targets of multiple genes. Ten papers looked for genes involved in particular biological processes or states. Examples include a search for genes involved in circadian rhythm, and a couple of papers looking for expression patterns that are common to different types of stem cells. Two papers sought to devise diagnostic predictors for two different cancers. Five papers fell into a broad category of genomic studies. These included surveys to find new operons in C. elegans, new transcribed sequences on human chromosomes 21 and 22, and transcription factors that bind to all known regulatory regions in yeast. Two papers did not fit any category: one was on SNP discovery, and the other used expression profiles to assess the functional equivalence of computationally predicted orthologs in mosquito vs. fly.
I tried to analyze the level of replication in each study and to separate biological from technical replication. I was unable to do this in many cases, because the papers did not describe the experimental design in enough detail. What I was looking for is pretty basic: what are the treatment groups (often called arms), how many biological units (e.g., patients or animals) are in each group, and are samples split into technical replicas or merged into pools before the outcomes are measured? Remarkably, many papers don’t provide this information, and those that do often relegate it to online supplementary material. In part, this reflects a perverse tradeoff by journal editors to publish a greater number of inscrutable papers rather than a smaller number of clearly written ones.
In the end, I analyzed the overall amount of replication in each study, without distinguishing biological from technical. Even this should be taken with a grain of salt as papers often made contradictory or unclear statements in this regard. Here are the numbers: no replication or paper unclear ¯ 11; duplicates ¯ 4; triplicates ¯ 8; more than three replicas ¯ 5.
Many studies continued to rely on fold change as their method for declaring a gene’s expression to be “significantly” different in one group vs. another — nine papers were in this class. Twelve papers used more sophisticated methods including ANOVA and various forms of clustering. For seven papers, the method was unclear or unstated. Seven papers reported time series experiments; none used analytical methods that exploit the temporal correlations that are expected in such data.
Of the 12 studies using spotted arrays, only two employed the dye-swapping procedure recommended by many statisticians.
The only microarray analysis packages widely mentioned in these papers were the ubiquitous Affymetrix suite and Eisen’s Cluster and TreeView. A couple of papers used GeneCluster from Todd Golub’s group at Whitehead, and one used dCHIP from Wing Wong and Cheng Li at Harvard. Many of the sophisticated analyses relied on methods developed locally, often by outstanding statisticians, for the specific problem at hand.
Eleven papers reported the results of confirmatory experiments. These generally involved small numbers of genes that showed large expression differences. A few papers also confirmed some negative findings. Nine papers compared their results to information already known to see how well their microarray study picked up known positives and negatives. Ten papers reported no confirmation studies at all.
Worth the Worry
Despite the dire warnings of well- meaning statisticians, chippers continue to hyb away with little concern for the dangers involved.
Not everyone does, of course. About 60 percent of the papers used some replication, which is the first step in responsible chipping. About 40 percent carried out sophisticated analyses that provide some mathematical confidence in the results. And about the same number conducted some experimental confirmation of their microarray results. So maybe there’s hope.
Another possibility to consider is that the statisticians may be wrong. Maybe they’re just worrywarts and do-gooder killjoys. Maybe chipping is actually safe and reliable and produces clean data and good health.
As always, time will tell. But as I’m sure you’ve guessed, my money is on the worrywarts.
Looking for Literature: Further Complicating Microarray Madness
My hunt for microarray papers illustrates many of the annoying problems that bedevil efforts to do systematic literature searches. The first problem I encountered is that MEDLINE does not have a microarray MeSH term, which forced me to fall back on text searching. The next problem is that MEDLINE (and its cousin PubMed) does not support full-text search. In other words, you can’t search for words in the bodies of articles, and my PubMed searches were basically limited to titles and abstracts. The third problem is that many papers that report microarray experiments don’t say anything about this in their title or abstract. The net effect is that many relevant papers are completely missed by PubMed searches.
The lack of full-text searching in MEDLINE and PubMed is an anachronism given modern technology. Publishers send MEDLINE the full text of journals in electronic form, which the MEDLINE curators read to assign MeSH terms. But MEDLINE does not go the next step of providing so much as simple Google-like indexing of the text, or the even simpler step of extracting technical terms from the text and allowing them to be searched. I don’t know if this a contractual limitation imposed by the publishers or ambivalence by MEDLINE. Whatever the reason, it creates a lot of pain for people trying to assemble bibliographies on specialized topics.
The downside of full-text searching is that it finds lots of irrelevant papers that you have to sift through by hand. Of the 64 Science papers found by my full-text search, only 19 (30 percent) were original research papers that described actual microarray experiments. The others were perspectives, reviews, news stories, articles about new microarray technologies, and articles that mentioned microarrays in passing.
Full-text searching of limited journal collections is available at a number of websites. Stanford’s HighWire offers full-text searching for 348 journals, including Science, but not Nature or Cell. Nature offers full-text searching for 50 journals, including, of course, the complete Nature family. Cell provides eight journals. Other publishers have their own sites.
For this article, I searched these websites manually. I learned some useful facts about the nuances of each site. HighWire and Cell have fairly small limits on how long a query can be; the max is something like 90 characters. All of the sites let you search for phrases that you quote as in Google, e.g., “DNA array.” Also, they all permit the use of ‘*’ as a wild card, e.g., microarray* to search for any word that starts with “microarray.” However, Cell doesn’t let you combine these features, as in “DNA array*”.
To systematically search the literature, we need software that can interact with each of the full text sites, and combine the results. Naturally, the sites are all different, and none seems to offer a computer-friendly means of interaction. This puts a huge burden on the software we need — it must be programmed to talk to each site in its own idiosyncratic way, and must resort to screen scraping to extract the results. Blah. What a pain!
I am on the prowl for software or a website that does this. If you know of anything, let me know. I’ll pass the information along in a future column.
For references and links to tools and papers mentioned in IT Guy columns, visit www.genome-technology.com.