NEW YORK (GenomeWeb) – A team from the College of William & Mary has described a novel computational method for gene expression microarray data analysis they believe could help identify more differentially expressed genes in studies with small sample sizes, or with too small a ratio of samples to gene targets.
The group described the method in a paper in PLoS One this week, and shared data from its application of the technique to a microarray dataset from an experiment perturbing the Notch signaling pathway in Xenopus laevis embryos.
Study authors Margaret Saha and Daniel Vasiliu told GenomeWeb that they came to collaborate on adapting the computational method for microarray analysis after Saha gave a presentation at a biomathematics seminar at the college on the limitations of current methods for analyzing microarray data.
In array studies with small sample sizes relative to their breadth of gene targets, it can be difficult to resolve differentially expressed genes from the background. Saha and Vasilu wrote in their paper that this has kept hundreds of publicly available datasets (and potentially many more unpublished studies) from being used to their full potential.
Vasiliu had independently been working on an algorithm he calls penalized Euclidean distance, or PED regression, and after hearing Saha's talk he realized that a classifier using PED regression could potentially improve on current statistical methods for array analysis.
In their paper, he, Saha, and their coauthors described the resulting PED-based analysis method, and its application to a dataset from a study of Notch signaling the lab had previously conducted.
According to Vasiliu, analyzing microarray datasets with relatively low numbers of samples has been a long-known challenge. Array analyses — as well as newer techniques like RNA-seq —involve profiling the expression of tens of thousands, if not more genes. This means that if only a small number of samples are tested, it can be hard to pick out differentially expressed genes from the overall background of variables with sufficient statistical significance.
This is generally known as the "big P low n" dilemma in the field, Vasiliu said, and it has become more of a problem as experiments have moved from looking for large differences in only a few genes to looking for small changes in many genes — mirroring the field's evolving understanding of the contributions of multiple genes to complex diseases and biological processes.
This is particularly problematic, the researchers wrote, because in both basic biological and clinical research, sample sizes can be ultimately limiting; if researchers can't collect an "n" high enough to overcome the number of genes they are interrogating, or the complexity or dimensionality of their dataset, they risk missing real links between gene expression and disease or other biology.
According to the group, other teams have developed algorithms for differential expression detection in studies with low sample sizes, specifically a group of methods known as penalized regression techniques. However, most penalized regression classifiers have relied on some form of cross-validation — splitting the dataset into a training set and a validation set — to assess the impact of individual variables on the classifier itself, the authors wrote..
This splitting requirement means that such algorithms aren't ideal for ultra-small sample sets that can't realistically support both a training and validation cohort.
With Vasiliu's PED, a simulation-based tuning procedure eliminates the need for cross-validation, he explained. In a separate publication currently available on ArXiv and awaiting peer review, Vasiliu and colleagues showed that PED compared favorably to similar methods such as elastic net, Lasso, SIS, and ISIS.
In their PLoS study, he and Saha then applied the method to a microarray dataset from research in Saha's lab examining how X. laevis embryos respond over time to alteration of the Notch signaling pathway. In a previous analysis of this dataset, only six samples total, using a method called limma (linear models for microarray analysis), the researchers had found that only a small number of genes were differentially expressed, and none significantly.
However, some of the genes with the lowest p-values — the closest to statistical significance — were known to be regulated by the Notch signaling pathway, suggesting that limma may have underrepresented them.
"We expected certain results based on the published literature for the control part of [our] experiment," Saha said. "When it came back and we looked at unadjusted P values, it looked like the genes we expected to be differentially expressed were, but when we corrected for multiple hypotheses — given we had 30,000-plus genes in there — virtually nothing came back as significant."
When the team applied Vasiliu's PED-regression-based method, they found that it labeled as differentially expressed all the same genes that limma did, and identified additional differentially expressed genes for further investigation.
The method also labeled many more genes differentially expressed in the array data than it did in permuted controls, evidence that it's unlikely that the classifier's calls were random.
According to Vasiliu, while one of the PED method's main advantages is enabling penalized regression analysis in sample sets too small for the cross-validation necessary with other approaches, it could also be useful for larger studies, like GWAS, where other techniques have still failed to identify disease-linked or otherwise differentially expressed genes.
Vasiliu said said his team's ArXiv publication provides early evidence that the method could help resolve the complex correlation structures that can characterize GWAS data, and the researchers hope to have the chance to give it a real-world test in this area to confirm its utility and accuracy.
Saha, meanwhile, said her lab is now planning to apply PED regression to RNA-seq experiments, which can mean even smaller numbers of tested samples.