A new initiative is underway at National Information and Communications Technology Australia (NICTA) to add bioinformatics to a research program heavily weighted toward telecommunications, networking, signal processing, and the like.
Last month, NICTA announced that it is collaborating with Melbourne-based Peter MacCallum Cancer Center to develop new statistical approaches for analyzing microarray data. That project is a sign that NICTA's bioinformatics research group is "now finally taking final shape," according to Adam Kowalczyk, principal researcher in NICTA's Statistical Machine Learning research program and head of the bioinformatics research group.
There are currently 14 people in the SML group, and three of those, including Kowalczyk, are tackling the challenges of large-scale genomics data sets, he told BioInform.
"In a sense, our activity is finally starting about now because the group is finally taking shape," he said.
Kowalczyk, an artificial-intelligence and data-mining expert with 20 years in telecommunications, said that his own interest in bioinformatics happened "by accident" through his involvement with the KDD Cup data-mining competition in 2002. After winning one of the tasks that year, which required participants to accurately predict gene regulation in yeast, he was offered to spend a year at the Peter MacCallum center, where he cut his teeth on data from cDNA arrays.
Kowalczyk said that he moved to NICTA when the center offered him the opportunity to create a bioinformatics "subgroup" within the SML team. The group's "main focus" at the moment is its ongoing work with Peter Mac, which has recently switched from cDNA arrays to Affymetrix chips.
The team is also collaborating with a Canberra-based startup called Diversity Arrays Technology, which has developed a low-cost genotyping method that it is applying to crop selection. The NICTA team is using its machine-learning skills to integrate genome data with phenotypic information from plant breeders in order to improve predictions for selective plant breeding.
Kowalczyk said that his team is engaged in a range of projects with Peter Mac, "but in the bulk of them, we're trying to get some clinically useful information out of array data. We're trying to build some models that are predictive of a cancer type." He said the group is primarily working on analyzing tissue samples to determine cancer of unknown primary origin.
Another focus for the group is in developing statistical techniques to help move beyond the linear models typically used in microarray analysis, which Kowalczyk said are not capable of identifying combinations of two or more genes that work together to trigger cancer.
"When people build and analyze data at the moment, they primarily use linear models because there is such high dimensionality of the data in a very small number of samples, so they try not to complicate the situation unnecessarily," he said. However, it is well known that biology works in a nonlinear fashion. Furthermore, he said, "When you increase the number of genes or probes, say from 10,000 to 50,000, the complexity of the problem grows quadratically so if it grows by 5 times, that's 25 times more difficult and that is a real headache, and we would like to develop statistical techniques to deal with this situation."
He noted that the increasing density of microarrays will "accelerate" this problem, "but from my perspective, that's good because it's a challenge worth spending time on."
Kowalczyk said that his relatively brief exposure to genomic data has already led to new insights about statistical machine learning. As an example, he noted that what at first appeared to be noise in certain microarray data sets turned out to be a statistical phenomenon called "antilearning."
Typically in machine learning, an algorithm is trained on a sample data set so that it can properly classify a given set of data. However, Kowalczyk said that once in a while, these algorithms would give the exact opposite answers as expected when they were run on microarray data. "You train the machine, and the machine trains very well, it's guessing right on your training data, but when you give it a new example, all of a sudden, it reverses the answer. It lies to you."
Kowalczyk said that his team has published some papers about antilearning, in which the classification algorithm actually performs worse than random guessing, that show "there are specific structures in the data that facilitate this behavior." For example, the situation is more likely when there is a small number of samples and a high number of features that are measured for each sample "exactly what we're doing in microarray analysis."
Kowalczyk said he believes this phenomenon may occur very frequently in genomic data analysis, but has not yet been identified because researchers assume that it is due to noisy data or some experimental anomaly.
The SML team is currently developing statistical approaches to account for antilearning in microarray data analysis.
Some bioinformatics tools from NICTA's SML group should be ready in "a few months, even weeks," Kowalczyk said. The group also plans to build a "genomic test bench" that will link to several open source machine-learning tools developed by researchers at NICTA and elsewhere called LINEAL and ELEFANT (http://lineal.developer.nicta.com.au/ and https://rubis.rsise.anu.edu.au/elefant/). The genomic tools will be written in Python, "which will allow us to have algorithms that can easily interface with the web, because obviously in genomics, access to the web is crucial," he said.
Bernadette Toner ([email protected])