NEW YORK--The SNP Consortium, a nonprofit entity, was officially formed this month by 10 pharmaceutical companies and the UK's Wellcome Trust for the purpose of identifying and making available in a public database 300,000 single-nucleotide polymorphisms (SNPs) from the human genome. Four leading genome sequencing centers--the Whitehead Institute, Washington University School of Medicine, Stanford's Human Genome Center, and the Wellcome Trust's Sanger Centre--will generate genome-wide SNP data for the venture. A bioinformatics team led by Lincoln Stein at the Cold Spring Harbor Laboratory in New York will curate the information, hold it in escrow for quarterly releases to a public website, and submit it to the US National Center for Biotechnology Information, which will merge the so-called TSC Database with its own publicly funded SNP database.
Pharmaceutical companies participating in the $45 million initiative are: AstraZeneca, Bayer, Bristol-Myers Squibb, Glaxo Wellcome, Hoechst Marion Roussel, Hoffmann-La Roche, Novartis, Pfizer, Searle, and SmithKline Beecham. Arthur Holden, the consortium's CEO and former CEO of UK diagnostics company Celsis, said the project, which plans to map at least 150,000 of the SNPs, "will help answer questions about genetic factors that contribute to disease susceptibility and response to treatment, and suggest directions for future investigation."
Stein told BioInform TSC Database will be available to users in several formats: as an Oracle database, as an Ace DB database, and in flatfile format. "We'll have interactive web pages that are nicely formatted HTML text," Stein added, "but people who want complete dumps of the data can get flatfiles."
The major informatics challenge of the initiative, according to Stein, will be to reconcile contradictions that he said will inevitably turn up in the data. "We may find that the same SNP appears on different chromosomes in different people's maps," he explained. "So we will need to develop some combination of computational tools and probably human elbow grease to resolve those contradictions. Most of them will be small, but some are going to require human attention," Stein added.
Stein and four others at Cold Spring Harbor will use standard bioinformatics tools such as Blast to identify overlaps between candidate SNPs and the genomic sequence in GenBank. To map the SNPs, Stein said he will rely on a 30,000-gene map that was published last year through the collaborative efforts of the Sanger Centre, Stanford, Whitehead, Genethon, and Oxford University. "It's a good source because it tells us what nearby genes and expressed sequences are," Stein commented. "If we don't get a hit on that map, we will use secondary sources such as Washington University's Bac physical map. If that fails we'll have to go to GenBank features, which may be cytogenetic location or just a chromosome," he elaborated.
Labs will submit data in a common format that Stein developed based on the National Center for Biotechnology Information's draft SNP data-submission standard. "The only thing that makes this project possible is that we're all using a standard submission format and data representation formats," he remarked.
Technological demands at the end of the pipeline are another question. Pharmaceutical companies that finally get the SNP data face technological challenges to utilizing it. Stein said, "The major bottleneck for using this information is going to be developing robust and inexpensive assays for polymorphic alleles. That is, we know where the SNPs are, but some technique such as microarrays has to be applied to detecting these in mass scale on actual human samples."
In terms of informatics, the problem for pharmaceutical companies isn't as daunting, Stein contended. "Once SNPs are identified and mapped, they become no different from other genetic markers and can be treated as classic markers, such as RFLPs or phenotypic traits," he said, adding, "They'll just go into the standard linkage and analysis program."
--Adrienne Burke