Biostatisticians at St. Jude Children Research Hospital in Memphis, Tenn., have developed a new procedure to align SNP microarray signals for copy-number analysis.
According to a recent Bioinformatics paper, for each individual array, the proposed reference alignment procedure, or RAP, uses a set of selected markers as internal references to direct the signal alignment.
[Pounds, et al. Reference alignment of SNP microarray signals for copy number analysis of tumors. 2009 Feb 1;25(3):315-21.]
RAP aligns the signals so that each array has a similar signal distribution among its reference markers, and an accompanying reference selection algorithm, or RSA, uses genotype calls and initial signal intensities to choose two-copy markers as the internal references for each array, the paper states.
The paper argues that after RSA and RAP are applied, each array has a similar distribution of signals of two-copy markers so that across-array signal comparisons are biologically meaningful. An upper bound for a statistical metric of signal misalignment is derived and provides a theoretical basis to choose RSA-RAP over other alignment procedures for copy number analysis of cancers.
In a described study of acute lymphoblastic leukemia, the authors showed RSA-RAP to give copy number analysis results that show "substantially better concordance with cytogenetics than do two other alignment procedures."
Lead author Stanley Pounds, a biostatistician at St. Jude, told BioArray News that RSA-RAP was developed in response to a perceived problem in using global signal alignment, or GSA methods using Affymetrix SNP genotyping arrays to survey tumors.
"We are using SNP arrays to look for deletions, amplifications and other mutations in tumors. One thing we learned from that study is that most B-cell leukemias have mutations in B-cell development genes," Pounds said last week. "We are looking at other forms of leukemia and other cancers with SNP arrays, and also moving up to higher resolution," he said. Pounds added that St. Jude's has been using the Affymetrix 500K platform and recently moved to the firm's SNP 6.0.
According to Pounds, with the SNP data it had collected, his team discovered problems with standard normalization methods for studying tumors. "The statistical problem we encountered in SNP arrays with GSA is this: GSA tries to take all the signals for all the array features, and make the distribution similar for each array, Pounds said. "Inadvertently, this type of normalization can give misleading results in our applications, where the distribution of the markers’ true copy number is not very similar from tumor sample to control sample."
According to Pounds, one example of the global alignment methods currently in use by some biostatisticians is mean centering, where a mean signal is calculated for an array, and a shift or scale transformation is used to map a signal to a target value. The same approach is used for each array.
"Try doing that same exercise with the true copy number values of the markers for an aneuploid tumor and see what happens," Pounds said. "If you go back and compute the difference between tumor and normal tissue marker by marker [using GSA], none of them will give you an accurate answer," he said.
To come up with a solution for the issue, Pounds drew upon his experience with quantile normalization. "Reference alignment is a more general form of quantile normalization. Reference alignment tries to map the quantiles for two-copy markers’ signals to a similar distribution," Pounds said.
More specifically, RAP performs a calculation separately for each array. "If you look at the cytogenetics data for tumors, each is different," Pounds said. "If you look at the distribution of underlying copy number for markers for each tumor, each will have a different distribution. It is important that each array is normalized separately," he said.
For each array, RAP takes all of the unnormalized signals and computes their quantiles relative to the unnormalized signals of a set of selected reference markers, which are chosen to represent the two-copy state. These markers can be selected two ways: using auxiliary data on copy number such as cytogenetics or algorithmically. Once they are computed, RAP maps them to the corresponding quantiles of the normal distribution, Pounds said.
[ pagebreak ]
After the calculations are complete, the reference markers’ signals follow the chosen distribution and the other markers fall along that distribution relative to the reference marker signals on the original scale. "After everything is finished for every array, the signals for the reference markers follow the normal distribution," Pounds said. "The subsequently computed differences center near zero for the reference markers. You want the difference for two-copy markers to be near zero."
To validate the new approach, the authors used traditional cytogenetic testing methods to find gross abnormalities in tumors, and then used the cytogenetics data as a standard for an array data set. The authors normalized the data using RAP, as well as older normalization methods such as quantile alignment and invariant set alignment.
"We compared the results back to cytogenetics to see what percentage of the known abnormalities is captured using each normalization method," Pounds said. "We found that our method does a much better job than the other methods in capturing real abnormalities as characterized by cytogenetics."
Reaching Out in R
The code for using RAP on the Affy 500K platform is written in R and is currently available on St. Jude's website. Pounds said that he hopes to soon put out a version that is compatible with the SNP 6.0 array. He added that "it would be helpful to put a user friendly front end" on the software and that he is "interested in talking with potential partners to develop this further."
Pounds said that he is unsure of the impact inaccuracies caused by global signal alignment may be having on CNV analysis. "I don't think most standard tools are taking this into account. Whether or not they get the right answer depends on how accurate the assumptions of the methods they use are for the applications they are used in," he said.
"I think that there is room to improve on what standard tools are doing because we know males and females differ in terms of the copy number distribution on sex chromosome and quantile normalization ignores that deviation," he said. "Our method could be used to take those differences into account. That could improve the results used in many more studies."