Want to ramp up your informatics capabilities to keep pace with the international HapMap project? Brush up on your statistics, tweak your visualization tools, and brace for yet another flood of data, say researchers leading bioinformatics development for the $100 million effort to map genetic variation across the human genome.
Bioinformatics groups “need to prepare for dealing with hundreds of thousands of SNPs across the genome,” said Lisa Brooks, program director for the genetic variation and computational genomics programs at NHGRI. “New methods for data reduction are going to be needed, ways of dealing with large amounts of data, visualization. This is also a very large statistical problem, because you have very large comparisons.”
Some of the computational tools to support the public-private project that officially kicked off just a few weeks ago have been under development for some time, albeit in a disjointed manner. Mark Daly of the Whitehead Institute and Aravinda Chakravarti of Johns Hopkins University will lead the effort to assemble these scattered projects into a cohesive HapMap informatics pipeline. However, noted Chakravarti, “There are things we still don’t have and we don’t understand,” a fact that will likely spur a wave of new development in the still-nascent field. “We don’t come up with tools per se and then do the experiment, but rather the tools evolve with the experiment,” said Chakravarti.
Daly, who has honed his haplotype informatics skills working in collaboration with David Altshuler and Stacey Gabriel at the Whitehead, said that many of the tools the project needs to get started are already in place. For example, data coordination will be handled by Lincoln Stein’s group at Cold Spring Harbor Lab — a continuation of the role this group already played for the SNP Consortium. Each of the international sequencing centers involved in the project — Including the Whitehead, the Sanger Institute, Riken, the Beijing Genomics Institute, and Baylor College of Medicine — will have its own informatics teams responsible for handling the genotyping data for the 200 to 400 people expected to participate in the study. Practical work along these lines is already “being done feverishly,” Daly said. The interesting stuff comes once that data starts pouring in, he added.
While Daly and others in both the public and private sector have conducted a number of pilot haplotype studies already, these groups have had only a very limited amount of data to work with, so the methods used to analyze those data sets remain unproven for large genome regions. “As the project begins to progress and as we begin to collect this data, we’ll be able to learn what the most appropriate methods are,” said Daly. Noted Chakravarti, “Software that works on one problem is a different beast when it’s applied to large amounts of data, especially data that is coming in all the time.”
One hurdle facing the analysis team as it evaluates available methods for inferring haplotypes is particularly fitting for a field so tightly focused on the subject of variation: Each research group involved in the project has its own definition of a haplotype block — the region of the chromosome in which groups of SNPs tend to be inherited together. “Various groups have defined it various ways, so we need to settle on some uniform definitions based on the number of samples and the number of SNPs we’re going to study,” said Chakravarti.
The question may be somewhat more controversial that it first appears. Richard Judson, senior vice president of informatics at Genaissance Pharmaceuticals, a company that has been engaged in haplotype analysis for over five years, cautioned that the concept of “hot spots of recombination that are very sharp and exist in different populations” is something of an oversimplification. Judson said that research from his group and others, including Perlegen, indicates that “the real blocks … don’t have sharp boundaries and they aren’t conserved between ethnic groups.”
While acknowledging the uncertainties surrounding this aspect of the project, Brooks noted that “there does seem to be something real underlying these blocks,” and that the key is to settle on a consistent working definition to identify regions of the genome that will be of interest for disease association studies. “Definitions are the first step,” she said. “And of course, since we don’t know what the definitions are, the tools for applying it haven’t been developed yet.”
One Big HapMappy Family
So far, the HapMap project seems to have moved beyond the bitter public-private skirmishes that characterized the Human Genome Project. Variagenics, for example, while not directly involved with the project, has offered its wetlab technology for validating haplotypes to the public effort as a means of checking the accuracy of available inferral methods. R. Mark Adams, vice president of bioinformatics at the company, said the technology has much broader utility than the company requires. “We’re not that interested in keeping everything secret. We’re much more interested in getting things into the public discourse,” he said.
The public data also doesn’t seem likely to devalue commercial data resources, as was the case with human genome data. Adams, for example, said the public effort would only enhance Variagenics’ business of elucidating haplotype relationships. “The more information that’s out in the world, the more effectively we can do our work,” he said.
Likewise, Genaissance’s Judson said that although there was an “initial worry” at the company that the public project would simply be placing a free version of its own data into the public domain, “it’s actually a much different beast.”
Describing the public effort as a “low-resolution,” broad view of the whole genome, in comparison to the much deeper analysis that Genaissance performs on functional regions, Judson said the two resources would be complementary. “With the genome-wide map, you’ll look everyplace. You may not have a high enough resolution, but you won’t completely miss something. … Ultimately you want to come down to the local set of SNPs or haplotypes that cause a disease, so you have to do the high-resolution candidate gene thing we do. So maybe this will bring more business our way.”
Chakravarti said the analysis team intends to engage the broader human genetics and population genetics community — even those not directly funded by the HapMap project — to share ideas about analytical methods. The group plans to organize a workshop to bring together researchers who have published on linkage disequilibrium and genome variation to discuss methods for inferring haplotypes and using this information for disease studies and association studies.
Playing with the Data
The ultimate goal of the HapMap project, of course, is to serve as a useful resource for genetic variation data in support of medical association studies. Putting this information into a user-friendly form should present many opportunities for bioinformatics developers looking for a new challenge.
Data presentation is still an untapped area in this field, according to Daly, and is a subject that “a lot of people will have to look very hard at.” For example, he noted, some users might only want to look at a region as an extended annotation in a genome browser — viewing summary statistics of linkage disequilibrium or haplotype blocks — while others will want access to the raw data. For those who will want to use linkage disequilibium information in a way that won’t squeeze into a genome browser, Daly and several other groups are currently developing “LD browsers” — components or links out of genome browsers that provide a much more detailed look at linkage disequilibrium, relationships among markers, and haplotype patterns across regions.
The ability to extend the capability of current genome browsers appealed to Variagenics’ Adams as well: “You could imagine some kind of sliding window along the length of a chromosome arm representing haplotype diversity. What would be beautiful is if you could have fundamental population parameters, like recombination, plotted that are derived directly from the haplotypes.”
First things first, however, and the first goal of the HapMap informatics group is to get a production analysis pipeline in place over the next few months. Once that is set up and the data starts coming in from the genotyping centers, it will be released as quickly as possible, but a schedule has not yet been determined, Brooks said.
“We are the Genome Institute — we’re very interested in rapid data release — but we do need to figure out what rapid release is,” she said.
The NIH is ponying up $38 million to support the HapMap project. The remainder of the $100 million price tag will be paid by the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT); Genome Canada; the Chinese Ministry of Science and Technology and the Natural Science Foundation of China. The Wellcome Trust is funding the UK portion of the project and the SNP Consortium is coordinating private funding.