Skip to main content
Premium Trial:

Request an Annual Quote

Bioinformatics Battle: CAMDA Conference Attendees Compete for Array Analysis Prize

Premium

In a university town legendary for its fierce basketball rivalries, bioinformaticists, molecular biologists, and computer scientists gathered this week for a different kind of competition.

Critical Assessment of Microarray Data Analysis ë01, the second annual conference of its kind sponsored by the Duke Bioinformatics Shared Resouce, challenged groups of researchers to efficiently and elegantly analyze a single set of microarray data using methods that would both produce statistically robust and biologically significant results. At the end of the conference, the attendees voted on the best presentation and the winner was presented with the prize.

Although this Duke champion would not receive any NBA offers, the conference organizers did present him with a $1,000 check, a plaque, and the highest honor a scientist can receive-- the respect of your peers.

This year’s dataset, NCI60, was provided by the National Cancer Institute on its website, and contained raw microarray data from 60 samples that hybridized 12 cancer cell lines to 10,000-spot microarrays. Twelve different groups of researchers from the US, Spain, Germany, and Korea struggled to unscramble this data in a variety of ways — from cluster analysis methods to Bayesian analysis, neural networks, and even knowledge-based text mining.

The winner? Kevin Coombs and his colleagues from MD Anderson Cancer Center in Houston, who used information about biological function and chromosomal location, as well as annotation in databases, to organize the way genes would be analyzed.

This presentation stood out because it emphasized the importance of seeking biological relevance in analysis. “Microarray data is much more than a matrix,” Coombs told the attendees. “You ignore the existing biological knowledge at your own peril.”

In their analysis, Coombs’ group applied an “annotation filter,” searching for annotations for each gene represented on the array, and eliminating those that were poorly annotated, since any analysis of these genes would not yield biologically meaningful results. They found that some “genes” on the array were no longer known to UniGene, or were annotated with only one 3’ or 5’ accession number.

After narrowing down the genes to those with meaningful annotations, they classified the genes into various functional categories, such as “apoptosis,” “transport,” or “lipid metabolism.” The genes in these categories necessarily overlapped, but through this categorization, Coombs was able to choose a small group of functional categories.

After pre-processing the microarray data by normalizing the spot intensity to background, and filtering out spots at the low end of expression levels, the researchers then performed hierarchical cluster analysis using a distance metric they based on the Pearson correlation coefficient.

Most of the different types of cell lines in this dataset clustered nicely into single clusters for a single type of cancer, but breast cancer did not.

Coombs’ group then further subdivided the genes by their location on the chromosome, and by function, and found that the same patterns existed, with like cancer cell samples clustering together.

This method is useful, Coombs said, because it gives researchers clues to the function of unknown genes that cluster with genes of known function. So, for example, if three genes of unknown function clustered with a previously known apoptosis cluster, it might be inferred that they play a role in apoptosis.

While this sort of deduction seems obvious, Coombs emphasized that researchers often go about cluster analysis backwards, first finding a cluster of data, then trying to assign biological functions and meanings, rather than carefully considering biology first before launching into the wilds of abstract statistics.

He expressed hope that microarray data, if analyzed for biological relevance, might one day play a role in what he said MD Anderson researchers are on a mission to do: cure cancer one patient at a time by finding the right treatment for each patient.

 

PCA vs. ICA

Other presentations at CAMDA focused on similar efforts to reduce the complexity of the data by grouping genes, or data points together, but instead of using biological categories, they used statistical categories such as principal component analysis (PCA), Iindependent component analysis (ICA) --methods of reducing the dimensionality of data, and partial least Squares. Researchers argued about whether PCA or ICA works better. One camp claimed ICA works better in sorting out the individual signals of particular pathways, which may overlap, than does principal component analysis, which relies on the fallacy that a gene serves as a principal component of one and only one functional group. Others proposed pattern matching methods that would allow for overlapping in networks of genes.

Qun Shan of the University of California, Berkeley, presented a new clustering algorithm, GeneCut, which he said uses the NCut algorithm, a global normalization algorithm to separate weighted clusters based on within-group vs. between-group similarity. GeneCut prototype software can be downloaded at http://www.cs.berkeley.edu/~fowles/bio.

 

No Perfect Algorithm

In the end, conference attendees expressed mixed assessments of CAMDA ë01. More than one postdoc confessed that many of the presentations had been very difficult to understand, and as there is often the case with the second year of any conference, many attendees claimed that last year was better.

But what made the conference a difficult one was not only these nostalgic comparisons; Microarray data analysis is at a stage now in which many methods have been tried and there is little consensus on which ones really work best.

Roland Stoughton, vice president of informatics at Rosetta Inpharmatics, endeavored to sort out this problem by presenting a ëcluster of the clusters.’ designed to compare and weed out algorithms.

“Even after thirty years of extensive study, there is no perfect clustering program,” said Stoughton. The company is planning to release its own test data set, in order to further the study of optimal data analysis methods for microarrays.

— MMJ

The Scan

Shape of Them All

According to BBC News, researchers have developed a protein structure database that includes much of the human proteome.

For Flu and More

The Wall Street Journal reports that several vaccine developers are working on mRNA-based vaccines for influenza.

To Boost Women

China's Ministry of Science and Technology aims to boost the number of female researchers through a new policy, reports the South China Morning Post.

Science Papers Describe Approach to Predict Chemotherapeutic Response, Role of Transcriptional Noise

In Science this week: neural network to predict chemotherapeutic response in cancer patients, and more.