Looking for matches in all the wrong databases?
Nat Goodman tells how to find what you’re seeking at the major SNP sites
Each person possesses two versions of the genome that differ in about one base per thousand. Multiplied by the 3 billion bases of the human genome, this rate yields 3 million potential differences. Across the 12 billion human genomes on the planet, there are 10 million to 30 million sites that differ in more than one percent of the genomes. Only a small fraction of these affect the function of a gene, but studies suggest that almost all genes harbor at least one functional variation.
The vast majority of differences among human genomes are single nucleotide polymorphisms — spelling changes in the genome where, for example, one version of the genome might have an A while another has a C.
In the range of 3 million have already been identified:
• The SNP Consortium has discovered 1.5 million “raw” SNPs yielding about 1 million unique ones
• The public human genome sequencing project unearthed 1 million SNPs
• Celera found 2.4 million SNPs
• Incyte claims to have found 70,000 SNPs, all of which are associated with genes, by comparing EST sequences in its databases and expects to hit 100,000 by year’s end
• Genaissance scientists published a paper in Science listing 3,899 SNPs in 313 genes that were chosen more or less at random from the roughly 7,000 entries in RefSeq that have complete genomic sequences.
Many SNPs in the Sea
Genaissance sequenced DNA from 82 unrelated individuals. For each gene, the company sequenced the exons (both coding and untranslated), up to 100 bases into the introns, and some of the upstream genomic region. It covered 720 kb per individual, yielding an average of one SNP per 185 bases, and found about two SNPs per gene that change the translated protein. Genaissance claims that it sees about an equal number of functionally important SNPs that affect regulatory sites. The company is pursuing similar studies of other genes, and claims to have discovered 100,000 SNPs in 4,200 genes, with more on the way.
The literature is replete with smaller projects looking for SNPs in specific classes of genes. A few examples:
• A 1999 paper from Eric Lander’s laboratory screened 57 individuals for SNPs in 106 genes. They found 560 SNPs in about 200kb (1 per 350 bases) of which 185 affected the translated protein (1.7 per gene).
• Another 1999 paper from Aravinda Chakravarti’s laboratory screened 74 people for SNPs in 75 genes. They found 874 SNPs in about 190kb (1 per 217 bases) of which 209 affected the translated protein (2.8 per gene). The same group reported considerable differences in the number of SNPs per gene, ranging from 0 for HSD11B1 to 54 for PTGIS. Interestingly, the public database now reports 16 entries for HSD11B1, which is half the average across all genes.
Searching High and Low
The major public SNPs database is dbSNP, operated by the US National Center for Biotechnology Information. The current version of dbSNP, Build 98, contains almost 3 million raw SNPs that have been coalesced into about 1.8 million unique reference SNPs. The database also includes about 4,500 entries representing other kinds of genetic variations.
You can search dbSNP by reference SNP ID (called an rs number), by raw SNP ID (ss number), or by the local ID assigned by the submitter. One annoying feature is that you have to tell the system what kind of ID you’re using; it should be smart enough to figure this out, since reference IDs always start with ‘rs,’ and submitter IDs start with ‘ss.’ Other search options include GenBank accession number, UniSTS number, submitter, publication, and many others.
You can also find SNPs that are associated with specific genes. You can search by gene name or gene symbol, LocusLink ID, or gene ontology keyword. This type of search brings up a LocusLink list of genes. Off to the right of each gene, there’s a row of colored letter icons. Click on the purple V, and you’re taken to the dbSNP display of all SNPs that are annotated as being associated with that gene. You can also get to the same place by starting directly from LocusLink.
The association between genes and SNPs is based on NCBI’s annotation of the human genome. DbSNP declares a SNP to be associated with a gene if it falls within 2kb of the genomic region spanned by the gene as annotated by NCBI.
The database describes the “functional implication” of each SNP. For example, does it fall in a coding region, an untranslated region, or an intron? If it’s in a coding region, does it affect the translated protein or is it silent? If it’s within an intron, does it fall in a splice site?
Naturally, the functional implication of a SNP might be different for different transcriptional variants of a gene. The way dbSNP deals with this is by reporting the SNP information separately for each gene model of the gene. This is inconvenient and redundant because most gene models are quite similar, differing perhaps in an exon or two.
You can use the popular genome browsers at NCBI, Ensembl, and the University of California at Santa Cruz to visualize SNPs relative to genes or other genomic features. The NCBI browser is the easiest one to get to, because it is directly linked to LocusLink.
As in previous articles, I found the Santa Cruz browser to be more convenient for visualizing one kind of feature against another. Two aspects of the browser facilitate this: its “full” display mode spreads the features so they don’t get in each other’s way, and the grid lines on the display make it easier to visually compare features from different lines.
One negative is that the Santa Cruz browser did not seem to contain all the SNPs that were in dbSNP. I did not investigate this in detail — it could reflect differences in annotation, or simply that the browser was operating on a slightly out-of-date database.
HGBASE is affiliated with the European Bioinformatics Institute and contains less data than dbSNP. Release 10.0, the most current, has 531,850 records.
A nice though still incomplete system is GeneSNPs, operated by the University of Utah Genome Center. The user interface provides a visual picture of where SNPs fall relative to a gene of interest, in addition to a tabular listing of the SNPs. The visual display understands the functional implication of the SNPs, making it possible, for example, to display only those SNPs that affect the translated protein. Due to funding limitations, this site only contains a modest number of genes.
A related resource is the Human Gene Mutation Database at the University of Wales College of Medicine in Cardiff. This is a highly curated database of mutations implicated in human diseases. It’s not a SNP database per se. The current release (August 14, 2001) contains information on 23,345 mutations in 1,069 genes. There’s a note on the website that Celera gets exclusive first dibs on new data.
DbSNP is an effective database, well-linked to the rest of the NCBI data universe including LocusLink and NCBI’s rendition of the genome. There are some rough edges but no show-stoppers. The biggest problem is the lack of connections to NCBI’s genome, and the lack of equivalent connections to Ensembl, Santa Cruz, or other sites.
Of course, none of this will matter unless SNPs turn out to be useful in the search for genes that affect important diseases. Dozens of companies and academic groups are spending hundreds of millions of dollars on SNP-based gene hunts. We won’t have to wait long for the answer.
Human Gene Mutation Database http://www.uwcm.ac.uk/uwcm/mg/hgmd0.html
Santa Cruz Genome Browser http://genome.ucsc.edu
SNP Consortium http://snp.cshl.org
DBSNP in DETAIL
DbSNP is downloadable in various formats, including flat files, XML, relational database dumps, and NCBI’s specialized ASN.1 format. If you want the whole database, the XML format is the one. But, beware, it’s large — about 15 GB. A reasonable compromise is the flat-file format, which omits the sequence data and allele frequencies but is only about 1 GB.
I downloaded the flat file version and wrote some simple Perl scripts to analyze it. I found 2,997,034 raw SNPs grouped into 1,806,946 reference SNPs (1.66 raw SNPs per reference SNP). About 80 percent of reference SNPs mapped to a unique chromosome in NCBI’s assembly of the genome. Of these, 27 percent were associated with a gene, 1.2 percent fell in the coding region of a gene, and 0.6 percent potentially affected the gene function by changing the translated protein or disrupting a splice site.
I did the same analysis for the subset of the database that came from the SNP Consortium. There were 1,459,259 raw SNPs corresponding to 1,032,408 reference SNPs (1.47 raw per reference). A higher percentage could be mapped to single chromosomes (84 percent), but fewer were associated with genes (22 percent), and many fewer were in coding regions (0.4 percent) or potentially affected gene function (0.2 percent).
I also looked at SNPs identified by Genaissance, essentially all of which are associated with genes. Some 26 percent were in coding regions and 15 percent potentially affected gene function.
One surprising result was the number of Genaissance SNPs associated with multiple genes. Of the 3,674 associated with at least one gene, 599 (16 percent) were associated with two or more, compared to five percent for the database as a whole and 3.6 percent for the SNP Consortium. Inspection revealed that in many cases, the second gene had the name “interim” indicating an unvalidated, hypothetical gene. Excluding those, we were left with 348 multi-gene SNPs (9.5 percent).
I looked at ten of these chosen at random: rs15493, rs2231766, rs2231785, rs2231999, rs2233352, rs2233805, rs2233809, rs2233942, rs2234359, rs734094. Even though all genes had real-looking names, many turned out to be hypothetical or otherwise not well characterized. There was only one case (rs15493) where both genes were bona fide.
For these ten cases, I checked out the locations of the genes using both the NCBI and Santa Cruz browsers. In all cases, the NCBI browser showed the genes as overlapping, or sitting next to each other, and the Santa Cruz browser generally concurred.
So, if these multi-gene SNPs are an artifact, the problem lies with the genome annotation, not the SNPs.
NCBI deserves a lot of credit for providing downloadable forms of their databases. The bioinformatics life would be easier for all of us if every public database would follow NCBI’s example.