A year and a half ago, the $100 million International HapMap Project kicked off with the goal of dividing the entire human genome into neat “blocks” containing groups of SNPs that tend to travel together. At the halfway point of the three-year project, it turns out that the genome is a bit more complicated than originally expected, but the map is still on track for initial release within several months, according to one of the project’s leaders.
“I think we will have a haplotype map — at least for the CEPH [Centre d’Etudes du Polymorphisme Humain] population — sometime this fall,” Aravinda Chakravarti, professor and director of the McKusick-Nathans Institute of Genetic Medicine at Johns Hopkins University and coordinator of data analysis for HapMap, told BioInform. Chakravarti added that any version of HapMap that is released in the fall will likely be revised, and that additional data will probably be added to it, but that some version of the map, along with “all the haplotype data that supports that,” should be available by that time.
One issue that will affect the quality of the initial version of the HapMap is SNP density. Initially, the project called for coverage of around one SNP per 5,000 bases, but it soon became clear that many more SNPs would be needed for certain regions of the genome in which haplotype block boundaries were difficult to discern. Last year, the project began using sequence data from the ENCODE project, which offers a much higher SNP density — about 120-180 SNPs per 5,000 bases — for ten 500,000-base pair regions in the genome.
Because of the SNP density issue, the initial version of the HapMap is likely to be a bit spotty, Chakravarti said. “There will clearly be lots of regions of the genome where the map will be complete because we found complete [linkage] disequilibrium,” he said, but “there will be some regions where we’ll need markers at a much, much higher density. Rather than nailing every single thing in, I think the vast majority of investigators will find it useful if we just release it at that time.” He noted, however, that the quality of the initial map will be “much, much better” than a typical sequence draft.
Ideally, Chakravarti said, the data analysis group would like to see “a much higher density of SNPs than originally was envisioned — if not for any other reason than that it gives us much more to choose from.” As more SNPs are identified in a region, he said, the accuracy of identifying disequilibrium improves greatly, and the “tag SNPs” that define each haplotype will be much more precisely delineated.
Additionally, he said, more SNPs increase the analysis team’s ability to eliminate false negatives in disequilibrium detection: “We need to know that we can make an inference and say when there isn’t linkage disequilibrium — not because there aren’t markers in that region, but because there really are markers and there still isn’t [ disequilibrium],” he said
In line with this demand, the NIH in mid-April freed up $6.5 million in FY 2004 HapMap funding for a project that will identify at least 2.25 million additional SNPs in 270 HapMap samples in less than one year at a total cost of $.01 or less per successful genotype.
In addition, at an April meeting of HapMap participants in Baltimore, David Cox, CSO of Perlegen, announced that his company plans to release its own completed haplotype map into the public domain. A Perlegen spokesman confirmed that the company does plan to release its data, but declined to provide a timeline in which it expects to do so.
But even as more data become available, Chakravarti said questions remain as to the best methods for identifying haplotype regions. “I don’t think there will be a method, where we’ll say this is the way that we choose blocks,” he said. It’s likely, he said, that some methods will be good for rare sites, while others will be suited for common sites.
Chakravarti said that the data analysis group will take the next few months to analyze the CEPH data with several available methods for defining haplotype blocks, but “there is not going to be a single canonical way” that the project deems appropriate for all situations, he said.