Xiaole Shirley Liu
Assistant professor, Biostatistics and Computational Biology
Dana-Farber Cancer Institute/Harvard School of Health
Name: Xiaole Shirley Liu
Title: Assistant professor, Biostatistics and Computational Biology, Dana-Farber Cancer Institute/Harvard School of Health
Professional background:2003 - present, assistant professor, biostatistics and computational biology, Dana-Farber Cancer Institute and Harvard School of Health
Education: 2002 — PhD, biomedical informatics, Stanford University; 1997 — BA, biochemistry and computer science, Smith College.
Awards: 2006 — US Department of Defense Prostate Cancer Research Program New Investigator Award; 2005 — Claudia Adams Barr Award for Innovative Basic Cancer Research.
New generation Affymetrix oligonucleotide microarrays often have blob-like image defects that require investigators to either repeat their hybridization assays or analyze their data with the defects left in place, according to a paper published this month in the online edition of Bioinformatics [ Song JS, et al. Microarray Blob-Defect Removal Improves Array Analysis. Bioinformatics. 2007 Mar 1; [Epub ahead of print]].
However, Affymetrix only provides replacement chips if the defect takes up more than 10 percent of the array, even though the paper, published by a team of bioinformaticists at Dana-Farber Cancer Institute and the Harvard School of Health in Boston, showed that “if the blob array is 1 percent of the array area, it could greatly change your analysis result.” It was not immediately clear if similar defects are found on other oligonucleotide arrays.
To counter these effects, the team designed a software tool that the group claims can filter out the blobs before they can impact downstream array data analysis.
The tool, called the Microarray Blob Remover, found here, allows researchers to rapidly visualize, detect, and remove various blob defects from the .CEL files of different types of Affymetrix microarrays.
The group also shows in the paper that MBR significantly improves the sensitivity of tiling array analysis compared to leaving the affected probes in the analysis.
To learn more about the MBR tool and the trouble with defective Affymetrix arrays in general, BioArray News last week spoke with paper co-author Xiaole Shirley Liu, an assistant professor of biostatistics and computational biology at the Dana-Farber Cancer Institute and Harvard School of Health.
Why did you decide to design this software tool?
We are using some of the newer 5-micron resolution arrays. These are genome-tiling microarrays. You can see these big blobs on the arrays — they kind of look like a bubble — pretty frequently. We talked to our microarray core facility, and they said that this [defect] seems to appear every time Affymetrix has a new platform, especially when there's a new resolution, but that after awhile the manufacturing becomes pretty standard.
The core facility is pretty confident that it's not a hybridization bubble. We suspect that it's a manufacturing artifact, probably a defect that occured during the array synthesis. They said that it happened a lot during the early day U133 expression arrays, and now it's very rare. It happened pretty frequently on the 100K SNP arrays, and now it's pretty rare. Recently, it's mostly been appearing in the 5-micron arrays, which includes the genome-tiling arrays, exon arrays, and 500K SNP arrays.
Our early estimate — because people at Dana-Farber haven't used the arrays that much — is that it's in the order of 10 percent of the 5-micron arrays that have a little blob like this. At Dana-Farber, the policy is that if the blob area is more than 10 percent of the array area, we will get a free replacement from the company. Below 10 percent, the company says that in terms of downstream analysis, it doesn't influence the result.
So we decided to see whether that's the case. After some analysis, we could clearly see that even if the blob array is 1 percent of the array area, it could greatly change your analysis result.
How did you determine that?
We used a spike-in for the experiment. We used the raw data from an unaffected array and then took the probe signal from an array with a defect, and then simulated those probe values onto this correct array. We then tried different blob sizes and locations. We divided the array up into nine sections, and then we put the blob on each of the sections in different sizes from 1 percent to 9 percent [of the section]. And then we ran a couple of analysis algorithms to see whether the performance would degrade with the size of the bubbles. It doesn't matter which algorithm you use — for both of the algorithms [we used], the analysis results quickly degraded as soon as there was a blob — even a 1 percent blob is pretty bad.
How exactly does it affect the data?
Most of the probes in these blobs have a higher signal value. So you can imagine that in terms of array analysis, if a few of the affected arrays have extremely strong distribution of outliers, it will mess up the quantum normalization of all the arrays. We have tried another algorithm that doesn't even use quantum normalization, which is our own analysis method. It has less influence, but it's still pretty sensitive if you have a blob. With our algorithm, we normalize all the probes within the same array by modeling the behavior of each probe based on its probe sequence. If suddenly you have one percent of the probes that is much brighter, it will mess up the probe behavior parameters in our model.
Have you had contact with other researchers outside of your institute that have experienced similar difficulties with these blobs?
Yes, definitely. We are most familiar with the whole-genome tiling arrays, and for humans, to tile the whole genome you need seven arrays. From our estimate, each array has a 10-percent to 20-percent defect. Our estimate is that most of the arrays for the whole-genome tiling set will be affected. For example, we had a collaborator that ran four samples, and they had exactly four blobs on those four sets.
And the thing is that, even if you can get a free replacement for the array, it takes some time to get a replacement from the company. Ideally, you want to have all seven arrays hybridized and scanned at the same time. Even if you get arrays as replacements for free, it's kind of a logistical hassle. So we decided to see if we can informatically remove the noise in there.
We developed this algorithm to do that. It's pretty simple. It's Java script, you click and load the .CEL file and it will display the array. If you can visually see a blob, you just click on 'find blob'. It takes a few seconds to run, and it will show you where the blobs are. You can see if the regions highlighted are the correct locations. If it is not the correct location, you can change the parameters. Once you find the blob you can then click ‘save to .CEL file.’ This will place these probes into an outliers section, and in our downstream analysis, we have the option to specify that we want to ignore all the probes that are in the outliers section.
After the blobs are removed, there's no statistical significance in the performance. What's remarkable, is that even if the array has a 9 percent blob, if you remove it the performance is still better than an array with only 1 percent blob that is left in place in the analysis.
Is the manufacturer aware of this software?
Yes, Affymetrix knows. We gave them a presentation on the software.
What has their reaction to the issue been?
They were a little surprised about the frequency but they knew about it as well. They knew that they had blobs in their arrays, but they weren't aware of how many. And since we are working with the whole-genome tiling arrays we are working with a set of seven. With that rate, at least one out of the seven is affected.
So how popular has the software been?
I know our collaborators are using it, but we don't really have a download tracking tool. We are making the software freely available, open source, open everything. Other researchers like it; they say it's easy to use. What we envision is going to happen is, as a microarray core, as you scan an array the [Affymetrix] GCOS software needs to convert image files to the .CEL files. During this process, the technician can see if there's a blob, then they can just run the algorithm.
As these arrays from Affymetrix become mature then it's likely that the rate of this defect occuring will decrease. However, most array companies are trying to increase the number of features they can put on the array. I know Affy is trying to make the features smaller, and if they do that, a new generation of arrays will have those blobs. So this tool hopefully will be useful again with the new generation of arrays.