Name: Auli Karhu
Position: 2001- present, postdoc, Biomedicum, Department of Medical Genetics, University of Helsinki.
Education: 2001 — PhD, University of Oulu; 1994 — MSc, University of Oulu.
As a self-described “senior PhD” in the department of medical genetics at the University of Helsinki, Auli Karhu is involved in many projects and a co-author on two microarray-related papers published last month, one in Bioinformatics and the other in Genes, Chromosomes, and Cancer [Vanharanta S, et al. Definition of a minimal region of deletion of chromosome 7 in uterine leiomyomas by tiling-path microarray CGH and mutation analysis of known genes in this region. Genes, Chromosomes, and Cancer. 2007 May;46(5):451-8.]
While the latter paper has ramifications in array comparative genomic hybridization, it is the former paper in Bioinformatics [Laakso M, et al. Computational Identification of Candidate Loci for Recessively Inherited Mutation Using High Throughput SNP Arrays. Bioinformatics. 2007 May 17; (Epub ahead of print)] that could have more of an impact on users of high-density SNP arrays from Affymetrix and Illumina. Affy, for example, launched a chip with 1.8 million features only two weeks ago, while Illumina has promised to launch a 1-million SNP chip by the end of this quarter.
Karhu and her fellow investigators at the University of Helsinki were using Affy’s high-density SNP arrays to study colorectal cancer, but quickly got fed up with existing software analysis programs and decided to build their own. The result, a tool called CohortComparator, integrates SNP data while another tool called RegionAnnotator allows users to annotate the genes identified in CohortComparator. To learn more about the tool and its ramifications, BioArray News spoke with Karhu this week.
Why did your group take it upon itself to create this kind of SNP analysis tool?
Well, there are not really good [programs] available for these studies, at least what we want to study. There are programs like D-chip and things like that. Affymetrix has its own allelic imbalance program. But the problem is if you have tens and tens of samples and you want to study them at the same time it is very difficult with these kinds of programs. It is very time consuming and also, for example, using the Affymetrix program, you can only check one sample at a time basically. It doesn’t list those areas, like a homozygous stretch. But then you have to go back to the genotype data and then pick it out and it takes time.
It’s hard to identify what is the exact location of a homozygous stretch because we have 50 control samples and 42 colorectal samples and we want to compare all these samples at the same time, basically. So these old programs are not really designed for these kinds of studies.
You developed a two-tier approach for mining through the data. There are two tools described, the CohortComparator and the RegionAnnotator. How do they work together in the system that you devised?
The program is developed to detect homozygote regions and link these regions in databases. RegionAnnotator automatically queries for instance genes and microarray probes in the region. We tested this system so that we had two samples included [that] we knew had shared homozygous regions because these patients had known recessive colorectal cancer gene mutations. When we tested this it was found among the four position allocations and it could find the right annotations as well. Of course, when you are searching a new gene [you] can not expect that there are too many, if any, papers published telling you, ‘This is a new colorectal cancer gene.’
But when we use these two tools it can tell us which chromosomal regions are interesting and you can link this region in the databases and it will show you the exact location of this area and the known genes of this area. So it works pretty nicely. I mean there is still more to do with this program. It’s still not totally perfect. But we are developing it more so that we will be able to work with, for example, compound heterozygosity. Then you need the haplotype information. Because these samples are not related, you can’t create direct haplotypes for them. You have to estimate haplotypes for each sample and we are working with that at the moment. When we started to work with 50K SNP arrays the major problem was that programs couldn’t handle the data. For instance, with 50K array data, we had to split these data sets because the programs couldn’t digest the whole data at one time because there were too many data points. Nowadays it is getting better. Many groups are developing new programs and many programs are already published.
Which arrays are you using when you talk about this SNP data, and have you had any kind of dialog with the manufacturers over these issues?
We are mainly using Affymetrix SNP and expression arrays, and we have had a dialog with Affymetrix and I understand their point of view. Basically, they are making these products and it is our problem to solve these kinds of problems. I understand that they are not willing to do linkage programs for us. They are giving the opportunity to do these huge data sets and then we have to solve the problem of how to analyze the data and I don’t just mean the genotype. Of course, Affymetrix is giving excellent opportunities to analyze the genotypes. But there are so many data points that it’s not possible that you can just browse through the genotypes by individual; you need functional programs to do that.
You can also visualize the data with the tool you have developed. How is that different from the software you were working with previously?
Well, of course you can have 100 samples in the same picture. Visualizing though is not enough. You need to tell the program, ‘I want to have homozygous regions with certain parameters.’ You need to define what the homozygous stretch is; what its minimum length is, or how many SNPs it should at least contain. You tell the program, ‘OK, these are the controls, and these are the cancer cases.’ It can then look if there are any shared homozygous regions between cancer cases [that] are not found among controls.
Then you go through the cancer cases and when you have found some areas [that] are shared between cancer cases then you can go through the normal control samples. If healthy controls have the same homozygous stretches and the frequency of those stretches among the controls is high enough that it can’t be the area of where the gene we are looking for, it’s just a normal homozygosity that you can find in the genome.
Usually if a stretch is very small it is quite unlikely that it’s the real stuff. Our group has found many cancer genes, and usually these linked regions or haplotypes tend to be quite large, usually megabases.
Since you are adding to it, how could this tool be better?
It would be better if you could create the haplotypes out of it and if it could create haplotypes for an individual. In this case, the first thing was to see if there are shared homozygous regions. But the reality is that it may be a recessive gene but there is compound heterozygosity. So there are two alleles. [There] might be, for example, two homozygotes that are not sharing the same haplotype. And then there might be individuals that are not homozygous but have one allele from another homozygous individual and then a different allele from another homozygous individual. So, two different kinds of alleles are compound in one individual.
How do you get samples for your studies?
Well, we have very good connections. We have connections with the Finnish Red Cross where we can get healthy samples and in Finland we can ask for controls from specific regions.
We also have access to other samples as well, from Finnish hospitals, for example. We have collaboration with plenty of clinicians and this gives us access to tumor samples as well. Of course we need permission for that, but Finnish legislation is in this sense very good in that we can have access to these kind of samples.
Is colorectal cancer an area of interest for you? It was discussed in the paper.
It’s one of my major projects. I am also working with other kinds of cancer like pituitary adenomas as well, and then we can have some smaller sets of samples. But we have a huge collection of colorectal cancer samples and so in that case it’s a very good starting point in these kinds of studies.
For the study cited in the paper, we and the others believe that there must be more of these genes for exposing colorectal cancer. It might be that they are just recessive genes. In that sense, the main point of our work was not to create this program. I mean, as you can see from the results we have developed a program to go through our data. But we are continuing with this project and we are trying to identify these new colorectal cancer genes.
When we started this project, we were a bit naive in thinking that we could use existing programs to go through the data. Rather soon we noticed that we needed a new program [that] fulfills our demands. Then we contacted Sampsa Hautaniemi’s group at the University of Helsinki and so it has been very successful working with them. Now we are continuing this project and we are developing the program, so it has been a very fruitful collaboration.