Skip to main content
Premium Trial:

Request an Annual Quote

Software for SNPs


By Bernadette Toner


Want to ramp up your informatics capabilities to keep pace with the international HapMap project? Brush up on your statistics, tweak your visualization tools, and brace for yet another flood of data, say researchers leading bioinformatics development for the $100 million, three-year effort to map genetic variation across the human genome.

Bioinformatics groups “need to prepare for dealing with hundreds of thousands of SNPs across the genome,” says Lisa Brooks, program director for the genetic variation and computational genomics programs at NHGRI. “New methods for data reduction are going to be needed, ways of dealing with large amounts of data, visualization. This is also a very large statistical problem, because you have very large comparisons.”

Some of the computational tools to support the public-private project that was launched last fall have been under development for some time, albeit in a disjointed manner. Mark Daly of the Whitehead Institute and Aravinda Chakravarti of Johns Hopkins University will lead the effort to assemble these scattered projects into a cohesive HapMap informatics pipeline. However, notes Chakravarti, “There are things we still don’t have and we don’t understand,” a fact that will likely spur a wave of new development in the still-nascent field. “We don’t come up with tools per se and then do the experiment, but rather the tools evolve with the experiment,” says Chakravarti.

Daly, who has honed his haplotype informatics skills working in collaboration with David Altshuler and Stacey Gabriel at the Whitehead, says that many of the tools the project needs are already in place. For example, data coordination, being handled by Lincoln Stein’s group at Cold Spring Harbor Lab, is a continuation of the role his group already played for the SNP Consortium. Each of the international sequencing centers involved in the project — including the Whitehead, the Sanger Institute, Riken, the Beijing Genomics Institute, and Baylor College of Medicine — will have its own informatics teams responsible for handling the genotyping data for the several hundred people participating in the study. Practical work along these lines is already “being done feverishly,” Daly says. The interesting stuff comes once that data starts pouring in, he adds.

Building Blocks

While Daly and others in both the public and private sector have conducted a number of pilot haplotype studies already, these groups have had only a very limited amount of data to work with, so the methods used to analyze those data sets remain unproven for large genome regions. “As the project begins to progress and as we begin to collect this data, we’ll be able to learn what the most appropriate methods are,” says Daly. Notes Chakravarti, “Software that works on one problem is a different beast when it’s applied to large amounts of data, especially data that is coming in all the time.”

One hurdle facing the analysis team as it evaluates available methods for inferring haplotypes is particularly fitting for a field so tightly focused on the subject of variation: Each research group involved in the project has its own definition of a haplotype block — the region of the chromosome in which groups of SNPs tend to be inherited together. “Various groups have defined it various ways, so we need to settle on some uniform definitions based on the number of samples and the number of SNPs we’re going to study,” says Chakravarti.

The question may be somewhat more controversial than it first appears. Richard Judson, senior vice president of informatics at Genaissance Pharmaceuticals, a company that has been engaged in haplotype analysis for more than five years, cautions that the concept of “hot spots of recombination that are very sharp and exist in different populations” is something of an oversimplification. Judson says that research from his group and others, including Perlegen, indicates that “the real blocks … don’t have sharp boundaries and they aren’t conserved between ethnic groups.”

While acknowledging the uncertainties surrounding this aspect of the project, Brooks notes that “there does seem to be something real underlying these blocks,” and that the key is to settle on a consistent working definition to identify regions of the genome that will be of interest for disease association studies. “Definitions are the first step,” she says. “And of course, since we don’t know what the definitions are, the tools for applying it haven’t been developed yet.”

One Big HapMappy Family

So far, the HapMap project seems to have moved beyond the bitter public-private skirmishes that characterized the Human Genome Project. Variagenics, for example, while not directly involved with the project, offered its wetlab technology for validating haplotypes to the public effort as a means of checking the accuracy of available inferral methods. R. Mark Adams, vice president of bioinformatics at the company, now part of Nuvelo, says the technology has much broader utility than the company requires. However, Adams says that since the merger, the changed business model has shifted away from the HapMap work.

Genaissance’s Judson says that although there was an “initial worry” at the company that the public project would simply be placing a free version of its own data into the public domain, “it’s actually a much different beast.”

Describing the public effort as a “low-resolution,” broad view of the whole genome, in comparison to the much deeper analysis that Genaissance performs on functional regions, Judson says the two resources would be complementary. “With the genome-wide map, you’ll look everyplace. You may not have a high enough resolution, but you won’t completely miss something. … Ultimately you want to come down to the local set of SNPs or haplotypes that cause a disease, so you have to do the high-resolution candidate gene thing we do. So maybe this will bring more business our way.”

Chakravarti says the analysis team intends to engage the broader human genetics and population genetics community — even those not directly funded by the HapMap project — to share ideas about analytical methods. The group plans to organize a workshop to bring together researchers who have published on linkage disequilibrium and genome variation to discuss methods for inferring haplotypes and using this information for disease studies and association studies.

Playing with the Data

The ultimate goal of the HapMap project, of course, is to serve as a useful resource for genetic variation data in support of medical association studies. Putting this information into a user-friendly form should present many opportunities for bioinformatics developers looking for a new challenge.

Data presentation is still an untapped area in this field, according to Daly, and is a subject that “a lot of people will have to look very hard at.” For example, he notes, some users might only want to look at a region as an extended annotation in a genome browser — viewing summary statistics of linkage disequilibrium or haplotype blocks — while others will want access to the raw data. For those who will want to use linkage disequilibium information in a way that won’t squeeze into a genome browser, Daly and several other groups are currently developing “LD browsers” — components or links out of genome browsers that provide a much more detailed look at linkage disequilibrium, relationships among markers, and haplotype patterns across regions.

The ability to extend the capability of current genome browsers appeals to Variagenics’ Adams as well: “You could imagine some kind of sliding window along the length of a chromosome arm representing haplotype diversity. What would be beautiful is if you could have fundamental population parameters, like recombination, plotted that are derived directly from the haplotypes.”

First things first, however, and the first goal of the HapMap informatics group is to get a production analysis pipeline in place over the next few months. Once that is set up and the data starts coming in from the genotyping centers, it will be released as quickly as possible, but a schedule has not yet been determined, Brooks says.

“We are the genome institute — we’re very interested in rapid data release — but we do need to figure out what rapid release is,” she says.

The NIH is ponying up $38 million to support the HapMap project. The remainder of the $100 million price tag will be paid by the Japanese Ministry of Education, Culture, Sports, Science and Technology; Genome Canada; the Chinese Ministry of Science and Technology; and the Natural Science Foundation of China. The Wellcome Trust is funding the UK portion of the project and the SNP Consortium is coordinating private funding.


This article originally appeared in GT’s sister publication, BioInform, the weekly, integrated informatics news source.


The Scan

Unique Germline Variants Found Among Black Prostate Cancer Patients

Through an exome sequencing study appearing in JCO Precision Oncology, researchers have found unique pathogenic or likely pathogenic variants within a cohort of Black prostate cancer patients.

Analysis of Endogenous Parvoviral Elements Found Within Animal Genomes

Researchers at PLOS Biology have examined the coevolution of endogenous parvoviral elements and animal genomes to gain insight into using the viruses as gene therapy vectors.

Saliva Testing Can Reveal Mosaic CNVs Important in Intellectual Disability

An Australian team has compared the yield of chromosomal microarray testing of both blood and saliva samples for syndromic intellectual disability in the European Journal of Human Genetics.

Octopus Brain Complexity Linked to MicroRNA Expansions

Investigators saw microRNA gene expansions coinciding with complex brains when they analyzed certain cephalopod transcriptomes, as they report in Science Advances.