Skip to main content
Premium Trial:

Request an Annual Quote

SNPs: Living HapMappily Ever After


SNPs have come a long way this year. No longer just happy-go-lucky singles, they’re starting to settle down in committed relationships, called haplotype blocks. Genomics researchers, as delighted as anybody to observe young love, have embarked on the HapMap project to figure out which SNPs are hanging together and how committed they really are. It’s too soon to tell whether the newlyweds will be fruitful, but there’s plenty of action to hold our interest.

SNPs are exciting because they offer a means of discovering genes that are involved in common diseases, such as cancer or diabetes. Typically, the goal is to find genes that increase susceptibility to a disease or affect aspects such as severity or age of onset. In most cases there are many genes involved, each contributing a modest effect. And usually genes are only part of the story — environmental factors play a critical role as well.

You’re looking for SNPs whose alleles are correlated with some aspect of a disease. A SNP might directly cause a problem by disrupting the function of an important gene. More often, though, it’s just a marker and the real culprit is a yet-to-be-discovered mutation located nearby.

The culprit mutation has to be very close to the SNP — close enough to be in linkage disequilibrium — or else the correlation could not be detected.


Two SNPs are in linkage disequilibrium (LD) if their alleles are correlated. More concretely, consider two SNPs, one whose major and minor alleles (i.e., letters) are A and G, and the other whose alleles are C and T. These two SNPs give rise to four possible combinations of alleles: the first SNP can be A and the second C (written AC), AT, GC, and GT. If the SNPs are not in linkage disequilibrium — if they are in linkage equilibrium — the probability of seeing a particular combinations of alleles, say AC, equals the probability of seeing A at the first SNP times the probability of seeing C at the second. If the observed probabilities differ significantly from these calculated ones, the SNPs are in disequilibrium.

A set of SNPs in linkage disequilibrium is called a haplotype block. Linkage disequilibrium is statistically expected over short distances — a few Kb — but is statistically surprising over longer distances. Of course, nature is full of surprises, and examples of long-range linkage disequilibrium have been known for years. One famous example is a stretch of almost 700 Kb in the major histocompatibility complex (MHC) class II region that is implicated in type 1 diabetes.

A big, open question is to determine the size distribution of linkage-disequilibriated regions across the genome. What is the average size of such regions? Are there many long stretches of linkage disequilibrium, or are the few known examples just flukes?

This question is of extreme practical importance, because it determines the number of SNPs needed to survey a portion of the genome or scan the entire genome in a gene discovery project.

A number of pilot projects have been undertaken to address this question, including studies of two complete chromosomes (21 and 22). There is a nice review of some of this work by Richard Judson et al from Genaissance.

A team from Perlegen led by David Cox plastered chromosome 21 with 24,000 common SNPs — almost one per Kb. They found that, on average, linkage disequilibrium extended for 7.8 Kb.

Ian Dunham of the Sanger Institute used only 1,500 SNPs — about one per 20 Kb — and focused on finding long stretches of linkage disequilibrium in chromosome 22. They found 97 regions of interest, each including three or more SNPs. These regions spanned more than nine Mb, covering some 30 percent of the chromosome. The average size was 93 Kb, and the longest was 804 Kb.

Encouraged by these pilot results, the community has launched a major project, snappily named HapMap, to identify regions of linkage disequilibrium across the entire genome. The first grant applications for this project should be awarded soon. The whole job will probably take several years.


SNPs seem to be getting it together. We’re starting to see natural groupings — haplotype blocks — and the HapMap project has started to get these mapped.

With the HapMap, gene hunters will be able to reduce the number of SNPs that are needed in gene discovery projects. These projects will still cost millions, but it seems a cheap price to pay to discover the genes that cause important diseases, like cancer or diabetes.

The big “if” that remains is scientific. The whole concept is based on the belief that genes are a major factor in common diseases, that a reasonable number of genes account for much of the effect, and that these genes can be found by association with SNPs. We won’t know if this is true until a few pioneering groups take the plunge and give it a try. It should happen soon. Stay tuned.



The mixing of two genomes that each person inherits from his or her parents occurs on a chromosome-by-chromosome basis. In most circumstances, each Mom chromosome finds its corresponding Dad chromosome. They embrace lovingly, exchange genetic material, and create a child chromosome that combines elements of both glued together in the correct order. Typically, each child chromosome receives one or two stretches of DNA from each parent, and the overall child genome contains about 35 such pieces. Each inherited piece is generally 50 to 100 megabases in size. The boundaries between the pieces are called crossovers.

Geneticists measure crossover rates in units of centi-Morgans, named for the great geneticist Thomas Hunt Morgan. A centi-Morgan equals a one percent crossover rate.

The rough rule of thumb in the human genome is that one centi-Morgan corresponds to one megabase. The correct value, according to the latest data from a recently published study by DeCode Genetics, is 829,876 bases per cM.

The crossover rate varies by chromosome. The extremes, according to the DeCode study, are 1,048,148 bases per cM for chromosome 1 and 473,445 bases per cM for chromsome 22.

It also varies by sex, with fewer crossovers in the male lineage. The genome-wide averages are 1,158,301 bases per cM for male and 700,771 for female.

The crossover rate also varies widely within chromosomes, a fact of great importance in constructing haplotypes maps discussed in the main text.

— NG



Here’s a thought experiment on all the 12 billion versions of the human genome that exist today: Imagine that each genome were sequenced, and let’s align the sequences (ignoring the complications caused by large insertions, deletions, and rearrangements). The alignment can be visualized as a vast table with a row for each genome and a column for each base position.

Next, for each column, let’s tabulate the numbers of As, Cs, Gs, and Ts that appear. Unless a particular base position is crucial for survival, it’s likely that every letter will appear somewhere in the column. In fact it’s likely that each letter will appear several million times. However, for almost all positions, almost all genomes will have the same letter — hopefully the letter that appears in the official version of the human genome stored in GenBank. Otherwise, it wouldn’t make much sense to talk about the sequence of the human genome.

For a given position, the letter that appears most often is called the major allele, and the others are called the minor alleles.

For most positions, the major allele appears in about 99.9 percent of all genomes, and the minor alleles divide up the rest. In a few positions, though, a second letter is also fairly common and appears in an appreciable fraction of genomes. When this occurs, the position is said to be polymorphic.

An arbitrary, but commonly used, cutoff is to declare a site to be polymorphic if the second letter appears in more than one percent of all genomes. Note that the remaining two letters may also appear, but they will be at the usual sub-0.1 percent level.

It is common practice to ignore the two most rare letters, and regard a polymorphic position as having only two possible values: the major allele that appears in most genomes and the minor allele that appears in one percent or more of genomes. The buzzword is biallelic.

In practice, people often work with polymorphisms whose minor allele appears in more that 10 percent of genomes as these are easier to find and use. These are dubbed common SNPs.

There are thought to be 10 million to 30 million SNPs in the human genome — between 0.3 and one percent of the entire sequence. So far, about 3 million have been found and deposited in the public databases.



SNP Databases

DbSNP remains the official repository of public SNP data, but there are now many websites that offer different views of this data or selected subsets. I haven’t checked any of these sites for completeness or accuracy. A few contain data not in dbSNP, e.g., mutations that are not SNPs. Most permit searches by gene name or symbol to find SNPs within or near genes.

The table includes commercial sites that provide online services to the public. Most of these are basically online catalogs to help customers find and order assays. Some of the sites may include proprietary data that is only available to paid subscribers. HGVbase maintains an extensive list of SNP databases, including many disease-specific ones, at


Institution Comments URL


US National Center for Biotechnology Information

The SNP Consortium (TSC)

Public SNP repository. Current build (106) has 4,296,024 raw human SNPs, 2,703,719 unique human SNPs, plus other organisms

Gene-based searches done most easily via LocusLink

TSC conducted major SNP discovery project



HGVbase (formerly HGBASE)

Karolinska Institute

Curated mutations from the literature (not just SNPs) and selected, putatively high quality SNPs from dbSNP

Human Gene Mutation Database

University of Wales

Curated data on mutations of all sorts (not just SNPs) associated with disease. 30,641 mutations in 1,245 genes.


University of Tokyo

SNPs in Japanese population. 190,562 SNPs.


University of Utah Genome Center

Environmental Genome Project, 553 genes

CGAP SNP index

National Cancer Institute

SNPs mapped onto UniGene clusters

dbSNP <=> refseq

National Cancer Institute

SNPs mapped onto RefSeq reference sequences and sequences from the Mammalian Gene Collection


University of Washington and Fred Hutchinson Cancer Research Center

Curated SNP data for genes involved in inflammatory responses; 85 genes

Human Chromosome 21 cSNP Database and MAP

University of Geneva and Swiss Institute of Bioinformatics

SNPs from coding regions of chromosome 21

SNP database of Genome Analysis (GAN) Group

International Agency for Research on Cancer (IARC), World Health Organization

Experimentally confirmed SNPs in genes relevant for metabolism of potential carcinogens; 313 SNPs in 54 genes


Gene oriented; 566,295 SNPs in 8,600 genes


Applied Biosystems

106,458 SNPs

SNP Database


dbSNP build 103, searchable by SNP id (rs#) only

Nat Goodman, PhD, helped found the Whitehead/MIT Center for Genome Research, directed a bioinformatics group at the Jackson Laboratory and led a bioinformatics marketing team for Compaq Computer. He is currently a senior research scientist at the Institute for Systems Biology and an affiliate professor of bioinformatics at University of Alaska-Fairbanks. Send your comments to Nat at [email protected]


The Scan

Billions for Antivirals

The US is putting $3.2 billion toward a program to develop antivirals to treat COVID-19 in its early stages, the Wall Street Journal reports.

NFT of the Web

Tim Berners-Lee, who developed the World Wide Web, is auctioning its original source code as a non-fungible token, Reuters reports.

23andMe on the Nasdaq

23andMe's shares rose more than 20 percent following its merger with a special purpose acquisition company, as GenomeWeb has reported.

Science Papers Present GWAS of Brain Structure, System for Controlled Gene Transfer

In Science this week: genome-wide association study ties variants to white matter stricture in the brain, and more.