Skip to main content

The Great SNP Flood


Everywhere you turn, SNP discovery efforts are inundating researchers with data. Join Nat Goodman as he surfs the tidal wave and picks out the best SNP databases.

SNP discovery projects are in full steam at all conceivable scales. These range from genome-wide efforts, such as HapMap, to meso-scale projects looking at hundreds or thousands of genes relevant to particular diseases or pathways, to good old-fashioned single gene hunting expeditions.

Not every DNA mutation is a SNP. Mutations happen all the time — in humans, the rate is 100 per generation per genome — but most disappear quickly. A mutation that has survived long enough to be present in many people is called a polymorphism.

Polymorphisms are ancient. Most predate the emergence of our species. The implication is that if you see the same polymorphism in two people, you can assume it reflects the same mutational event from the distant past. An arbitrary, but standard, cutoff is to declare a mutation to be a polymorphism if it appears in more than one percent of all genomes.

SNP Sources

NCBI’s dbSNP is the official repository of public SNP data. Despite the name, dbSNP contains other kinds of polymorphisms, too — including indels and STRs — but the vast majority are SNPs.

In dbSNP there are more than 17 million raw human entries, which coalesce into 9 million unique entries. About half of the unique entries are validated, which in most cases means that the same SNP was reported by multiple laboratories. A relatively small number, half a million entries, have allele frequencies, and a quarter million have genotypes.

The international HapMap project is a major source of public SNP data. The project has contributed about 3 million SNPs to dbSNP, but the main focus is on genotyping. They plan to genotype 1 million SNPs in 270 individuals; progress to date is about half a million SNPs in 90 individuals.

The HapMap is committed to open data access and releases all data to the public almost immediately. But there’s an interesting twist. Teams submit basic SNP information and allele frequencies to dbSNP promptly, but hold the actual genotypes on the project website until they have enough data to infer haplotypes. Anyone can access the genotypes on the project website after registering and agreeing not to “take any action that would in any way restrict the access of others to the data.” This is to prevent some clever entrepreneur from patenting the haplotypes before HapMap is able to release the information to the public.

A growing number of academic projects are doing meso-scale SNP discovery and validation in genes of special interest. Much of this work is funded by NIH special programs, namely NHLBI’s PGA initiative and NIEHS’s Environmental Genome SNP program. In addition, NCI has two large programs looking for SNPs in cancer-related genes.

These projects generally submit their data to dbSNP, but additional goodies — the latest SNPs and nice analysis tools — are often available on each project’s website. The environmental program also operates a repository that features a nice browser with links to PDB.

Two academic databases that do not routinely submit to dbSNP are ALFRED — the ALlele FREquency Database — at Yale, and the Human Gene Mutation Database at the University of Wales. ALFRED contains hand-curated allele frequency data on about 1,000 polymorphisms in 356 defined populations. The goal is to support population, rather than medical, studies. HGMD contains hand-curated data on about 40,000 mutations in more than 1,500 disease-related genes. The site also has an extensive list of locus-specific databases.

Also, Applied Biosystems and Transgenomic operate SNP databases as online product catalogs. These sites contain some useful information even if you’re not buying.

In general, dbSNP and HapMap are the 1,000-pound gorillas here. The other sites are worth a quick glance on the off chance they have some data on your gene of interest.

For a list of the websites and databases Nat used for this column, check out www.


Nat Goodman, PhD, is a senior research scientist at the Institute for Systems Biology and is co-founder of HD Drug Works, which tests treatments for Huntington’s Disease. Send your comments to Nat at [email protected]


The Scan

US Supports Patent Waivers

NPR reports that the Biden Administration has announced its support for waiving intellectual property protections for SARS-CoV-2 vaccines.

Vaccines Versus Variants

Two studies find the Pfizer-BioNTech SARS-CoV-2 vaccine to be effective against viral variants, and Moderna reports on booster shots to combat variants.

CRISPR for What Ails You

The Wall Street Journal writes that CRISPR-based therapies could someday be used to treat common conditions like heart attacks.

Nature Papers Review Integration of Single-Cell Assay Data, Present Approach to Detect Rare Variants

In Nature this week: review of ways to integrate data from single-cell assays, and more.