Bioinformaticists looking for a new challenge were given a not-so-subtle hint at this year’s Research in Computational Molecular Biology conference, held April 10-13 in Berlin. “At the next RECOMB, I hope someone has a data compression method [for SNP data] that blows haplotype blocks out of the water,” said population geneticist Andrew Clark during his keynote address.
Clark’s talk anchored a full day of eight haplotype-related presentations — a dramatic increase over last year’s program, which only offered one haplotype paper. Clark tried to stir up a bit of controversy in the emerging discipline by challenging the “party line” that haplotype blocks offer the best means of selecting which SNPs to use in large-scale variation studies like the international HapMap project. While haplotype data does exhibit evidence of “blockiness,” in which groups of neighboring SNPs tend to be inherited together, Clark said this pattern tends to disappear in large samples, and is only apparent “if you toss out the rare SNPs and the rare haplotypes.” This approach, advanced by Eric Lander’s group at the Whitehead Institute, is “overhyped,” according to Clark, and will eliminate many of the rare alleles that cause the bulk of human disease.
“There is no evidence that the degree of blockiness is caused by heterogeneity in the recombination rate,” Clark said. While a correlation does exist, he noted, simulations of data using a homogeneous recombination rate have also resulted in “blocky” patterns.
A self-described “fan of the HapMap project,” Clark reminded RECOMB attendees that the haplotype block theory is only one of several methods to handle the “real problem” behind the initiative — finding genes for common diseases. Haplotype blocks offer one way to reduce a sea of SNPs to an economically feasible subset, but not the only way. “It’s a classic data compression problem,” he reminded the computer scientists in attendance, summoning them to find new approaches to handling large SNP datasets.
After his talk, Clark told BioInform that he recently joined the HapMap advisory board, which has its first meeting in May, and where he will promote the idea of finding new approaches to analyzing haplotype data. The Whitehead research on haplotype blocks “received a huge amount of attention” within the genomics and bioinformatics community, he said, and soared in popularity despite the skepticism of many population geneticists. “I object to the feeling that we have already found the answer to this problem,” he remarked.
Other talks at RECOMB suggested that plenty of researchers have already begun tackling the issues Clark raised, as well as numerous other challenges arising in haplotype analysis — including reconstructing haplotypes from genotype data, distinguishing polymorphisms from sequencing errors, determining where haplotype blocks begin and end, and improving associations between phenotypic data and SNP data.
A team of researchers at Celera Genomics, for example, has a jump on Clark’s assignment, according to Bjarni Halldorsson, who presented a paper on the company’s block-free method for selecting subsets of SNPs. The Celera researchers compared four different computational methods of defining haplotype blocks, and deemed the variability in determining block boundaries too high to provide reliable results. The team developed an alternative approach, called the “minimum informative SNP problem,” which assigns a degree of “informativeness” based on how well a SNP or set of SNPs predicts another SNP or set of SNPs. According to Halldorsson, the block-free approach required fewer SNPs to provide the same degree of informativeness as block-based approaches.
A number of papers addressed the fundamental problem of constructing haplotypes using genotype data: Chromosome pairs are not separated experimentally prior to genotyping, so it’s difficult to assign SNPs to their originating chromosomes. Eleazar Eskin of Columbia University discussed a phylogeny-based approach to this problem, while Gideon Greenspan at Technion offered a method based on statistical modeling, and Itsik Pe’er of the Weizmann Institute proposed a method of pooling SNP samples, and then reconstructing haplotypes using the total allele frequencies in the pooled set.
As further evidence that haplotype analysis is a worthy bioinformatics challenge, Michael Waterman’s lab at the University of California presented two papers on the topic. The first, delivered by Lei Li, discussed a method for reconstructing haplotypes based on a statistical analysis of sequencing errors, the alignment of SNP fragments, and the probability of different haplotypes based on the alignment.
Waterman then discussed his team’s analysis of the Perlegen human chromosome 21 dataset, which contains 36,000 SNPs from 24 individuals. Using a dynamic programming algorithm, Waterman said his lab found that a haplotype block could be defined using about 2,500 SNPs, a significantly smaller set than the 4,500 SNPs Perlegen needed to determine blocks with its own “greedy” algorithm. Waterman said that after discussions with Perlegen, his team then modified the algorithm to “maximize the covered length of the genome with the fewest number of SNPs.” The resulting algorithm was able to calculate that 3,488 “tag SNPs” would be required for 100 percent coverage of a block.
Waterman told BioInform that the haplotype data is “thrilling” to work with. “Looking at this data, you can really see that the human family is a lot smaller than we thought,” he said.
Gene Expression Down, Biology and Statistics Up
The increase in haplotype papers was accompanied by a decrease in other regular topics at RECOMB: A drop-off in gene expression analysis methods was clearly apparent, and there were also fewer talks on phylogeny and sequencing by hybridization than in previous years.
Waterman, who served on the conference program committee, said the falloff in gene expression talks isn’t a sign that microarray analysis is any less important. “There are still lots of very important problems that need to be solved in that area,” he said. Waterman, as well as other observers, remarked that this year’s batch of papers had a stronger biological focus than in the past, along with a much healthier dose of statistical methods. Richard Durbin of the Sanger Institute noted that in previous years, “there was a tendency to identify a problem and apply methods that were distinct from experimentation,” adding that RECOMB 2003 offered more methods “that enable something scientific to be done.”