Skip to main content
Premium Trial:

Request an Annual Quote

CRUK Team Designs Tool to Remove Defective Data from Illumina Arrays

Premium

Andy Lynch
Senior Research Associate
Cambridge Research Institute, Cancer Research UK
Name: Andy Lynch
 
Title: Senior research associate, Cambridge Research Institute, Cancer Research UK
 
Professional Background: 2006-present, senior research associate, computational biology group, department of oncology, Cambridge Research Institute, Cancer Research UK; 2002-2006, Centre for Applied Medical Statistics, University of Cambridge; 2001-2002, Centre for Process Analytics and Control Technology, University of Newcastle, UK
 
Education: 2002 — PhD, statistics, University of Sheffield, UK.
 
Whether you call them “spatial artifacts” or “blobs,” array defects can cause headaches for users who may mistake an unusually high or low signal from an errant probe for something more meaningful.
 
To date, several free software tools have been developed to identify and remove poor-quality data resulting from defects on microarrays. Two of these tools, Harshlight, developed at Rockefeller University, and Microarray Blob Remover, developed at Dana-Farber Cancer Institute, were developed specifically for use with the Affymetrix GeneChip platform (see BAN 3/20/2007).
 
Now, a team of computational biologists from the Cambridge Research Institute, part of Cancer Research UK, has developed a tool for Illumina’s BeadArray platform that separates good quality data from those altered by array defects.
 
In a paper published this month in Bioinformatics, the CRUK researchers describe how they used the tool, called BeadArray Subversion of Harshlight, or BASH, to exclude spatial artifacts from Illumina microarrays.
 
According to the researchers, the software takes advantage of Illumina’s array design, which arrays probes differently from chip to chip rather than in the same location on each chip, which is the method preferred by most of its rivals.
 
In the Bioinformatics paper, the team applied BASH when screening raw array data for unusual signals likely to be caused by a defect. They then excluded them from the data set and returned the resulting data to users.
 
To learn more about BASH, BioArray News this week spoke with Andy Lynch, a senior research associate at CRUK. Below is an edited transcript of that interview.
 

 
What is your primary area of research, and for what purposes have you been using Illumina arrays?
 
The Cambridge Research Institute of CRUK has a number of research groups headed up by different principal investigators. We are a computational biology research group but we also have a bioinformatics core in our building. There is some overlap in that sometimes a problem will come up that requires more in-depth research than a core can offer, because they are trying to offer high-throughput analysis. So, that is where members of our groups will perhaps get involved.
 
We also have a certain expertise of designing experiments that they can tap into. We are working with PIs and other groups that are looking at renal cancer, ovarian cancer, and breast cancer, amongst others. We are involved in deciding what kinds of questions can be sensibly answered with the available resources. 
 
My primary research is in methods of computational biology, obviously with applications in cancer. There are three main strands of my research. One is looking at DNA copy number variation, another is looking at experimental design, and the third is looking at methods for analyzing data that come out of Illumina technologies, which is where this paper came from.
 
Which Illumina products have you been using in your studies?
 
We have used pretty much all of them as a group. The expression arrays are the ones we use most, but also microRNAs, methylation, copy number, genotyping — we use a wide variety of Illumina technologies, plus, of course, the [Illumina Genome Analyzer] sequencing technology. At the moment, the biggest project is using Illumina expression for a breast cancer study that is going on here.
 
In your opinion, what is a spatial artifact, and what are some of the reasons they occur on Illumina arrays?
 
We would take a spatial artifact to be any influence of a probe’s location on the microarray to the measurement we get from it, at a very basic level. For Illumina’s BeadArrays, you have got so many replicates of each probe type that it is very easy to see trends across the array if they exist, assuming you have the raw data. If you have a measure for each probe, you can look at the differences of the probes to their common average and see where on the array problems lie.
 
The problem with Illumina is that most people never see that raw data. If you are using the standard Illumina analysis tools, the data you see will be an average of the, say, 30 replicate probes after some outliers have been removed. What you won’t have is the raw information about what those 30 values were and where on the array they came from.
 
So this is one of the areas that our research group is interested in. It is fairly straight forward to have your BeadScan record the raw data as well as the summary data. One of our work areas is seeing what extra you can get out of Illumina if you have that data.
 
How could a spatial artifact impact your analysis?
 
The main effect is that is adds an awful lot of noise. Generally, it won’t cause you to see things that aren’t there, but it will obscure interesting things that are there. The good thing about the Illumina platform is that the random design of the arrays means that these defects won’t lead to biases as you would get with other platforms.
 
Quite often, you will get a batch of these chips, and the spatial artifacts will be on all of them. On other platforms, that would lead to a bias, and anything within that [defective] area would be affected. With Illumina chips, there are different probes in that area every time. There is no systematic bias, but the noise enters into the equation every time, and it is hard to see the really interesting things. That being said, we have had a few experiments that have been nearly thrown out because the results make no sense. Fortunately, we have been able to rescue those using these software tools.
 
Still, say you have got a fairly expensive experiment, perhaps about £5,000 ($3,280) to £10,000 that has been carefully designed, but, at the end of the day, you have to run 10 more arrays because you are getting nonsensical results. If you rerun those 10 arrays at a later date, you are adding in a whole new set of variables because you might have a new batch of arrays or the technician running the experiment might be different. Your beautiful experiment is suddenly being adversely affected by all these variables.
 
Using BASH, you don’t need to run more arrays, you can just go in there, run the program, rescue all the quality data, discard the poorer data, and still answer the original question. Since Illumina has all these replicates on the array, you can cut out a good chunk of the data and get still get useful answers for probably all of the probes. 
 
Has the quality of arrays improved over time, or is it just something that researchers will have to live with as long as they use microarray technology?
 
I think it is probably something that is going to be there for the duration. It has to be said that not all the problems are from manufacturing. Some of them are due to the way the arrays are handled afterwards. For the array we discussed in our paper, it looks to be partially a manufacturing issue.
 
I don’t think we have seen any decline in defects since we started working with Illumina, but we only started looking at the problem this year. We have gone back and looked at historic data sets to see if the problems were there and they were, but we have not had a thorough review to see if the array quality has changed over time.
 
How does BASH work? After you get a readout of the raw data, how does it determine what is quality and what isn’t?
 
Essentially, we are making use of the fact that we have all these replicate readings for the probes on the array. We have the raw data, and instead of having just the summary measurement, we have readings for all 30 replicates scattered across the array. Using BASH, we identify probes that have unusually low or unusually high measurements compared to their replicates on the array.
 
The credit here has to go to Knut Wittkowski’s group at Rockefeller University, who developed Harshlight for Affymetrix. We have taken their principles and implemented them for Illumina, taking advantage of the properties of Illumina that are unique for this platform. But, it just comes down to asking, “Are there more outliers in a region than we expect?”
 
What exactly are the influences from Harshlight?
 
We haven’t used the codes or implementation of Harshlight, but we have used the strategy of Harshlight. Harshlight looks at three concepts of spatial artifacts and then looks for them in turn. The joy of doing this for Illumina was looking at what had been done for similar issues with Affy. Harshlight’s strategy seemed to be the best way forward for us.
 
How can others access BASH?
 
BASH has been available since October. For a long time, our group has been producing the beadarray free-analysis tool within the Bioconductor initiative. That is open source and we have maintained it for some time. [The free tool is referred to as lowercase ‘beadarray’ to distinguish it from Illumina’s BeadArray products. BASH is available as part of the beadarray tool — Ed.]
 
Is the tool curated? How easy is it to use?
 
To be part of Bioconductor, we have to maintain it and have a certain level of documentation. To use beadarray will require analyzing the data via the R software environment, rather than BeadStudio. If you are happy with R, I would say that beadarray is very straight forward.
 
Why do you think a tool like this hadn’t already been developed to deal with this issue?
 
Most people never see the raw data, and so they never see that there is a problem in the first place. If people don’t see the problem, there is no demand for a solution. Our group is probably in the lead for analyzing bead-level data. Hopefully, people will see the problem once they see there is a solution.
 
Have you had any dialogue with Illumina concerning these issues or BASH?
 
We have a good working relationship with Illumina and have published a lot on their platform in the past, but we have not discussed this particular problem with them yet.
 
Is there anything you would like to add to BASH in the future?
 
Well, most BeadArrays are single color. But there are some two-color arrays, and there is a question about whether you should look at these spatial artifacts separately in the two images or whether you should combine them, but this is something we are considering in more detail.
 
The aim is to have BASH as part of an automated pipeline, just for quality control. The ideal solution for most people might be that you run your experiments, you get your raw data, and then there is an automated process that tells you if the data quality is good and you can continue using BeadStudio. And if there is a problem, then you have got an insurance policy. This would be an ideal setup, but we are not quite there yet. At the moment we are trying to set that up internally. Once we have it working internally, we will make it available for sure.

File Attachments
The Scan

Renewed Gain-of-Function Worries

The New York Times writes that the pandemic is renewing concerns about gain-of-function research.

Who's Getting the Patents?

A trio of researchers has analyzed gender trends in biomedical patents issued between 1976 and 2010 in the US, New Scientist reports.

Other Uses

CBS Sunday Morning looks at how mRNA vaccine technology could be applied beyond SARS-CoV-2.

PLOS Papers Present Analysis of Cervicovaginal Microbiome, Glycosylation in Model Archaea, More

In PLOS this week: functional potential of the cervicovaginal microbiome, glycosylation patterns in model archaea, and more.