Skip to main content
Premium Trial:

Request an Annual Quote

NCBI's dbGaP to Bridge Genotype, Phenotype Data


The National Center for Biotechnology Information last week unveiled its latest resource, dbGaP, which was developed to house genotypic and phenotypic data from large-scale genome-wide association studies.

The database — the first resource to enable public access to large-scale genotype-phenotype associations — could "stimulate genome-wide research to a level that's completely unprecedented," says Jim Ostell, branch chief at NCBI's Information Engineering Branch.

The initial release of dbGaP includes data from two studies: the Age-Related Eye Diseases Study, a 600-subject prospective study supported by the National Eye Institute; and the National Institute of Neurological Disorders and Stroke Parkinsonism Study, a case-controlled study that involved 2,573 subjects.

NCBI also plans to add data from other projects, including the Framingham SNP Health Association Resource Study, as well as other genome-wide association studies focusing on heart disease, women's health, neurological disorders, neuropsychiatric disorders, diabetes, and environmental factors in disease.
Ostell's group has spent the last year working closely with other National Institutes of Health institutes to develop an informatics infrastructure that "really enables a big leap for genomics and clinical science, while at the same time not violating people's privacy or consents," Ostell says.

One of the primary goals in building the database was to Web-enable huge amounts of phenotypic information from study documents, protocols, and questionnaires. "They may be on paper, they may be scanned PDFs, or they may be in people's filing cabinets," Ostell says. "We just accepted the fact that that's the way it is."

Ostell says that NCBI does not plan to impose any particular standards, though dbGaP would adopt any standards that arise from the research community.
"I think it's going to be a long time before everything in these types of studies is standardized, but certainly sections of them could be, and this database will facilitate this process," he says.

— Bernadette Toner

Short reads

Software developer Connexor has announced that BioWisdom is slated to distribute its Machinese platform along with its own Sofia Knowledge Suite. Machinese will be integrated into Sofia Editor to automatically extract information from text-based online literature.

Dresden-based bioinformatics startup Transinsight has entered a three-year collaboration with the Max Planck Institute of Molecular Cell Biology and Genetics to extend its GoPubMed search engine toward biomedical image search and analysis.

Laboratoires Fournier has recently licensed Biobase's TRANSFAC eukaryotic gene-regulation database. Fournier will use TRANSFAC, which contains data on 8,700 transcription factors, to develop therapeutics for metabolic and cardiovascular diseases.

Genomatica has announced that Diversa will utilize its biosimulation technology in an effort to develop more efficient biomanufacturing processes for biologically derived enzyme products.

Almac Diagnostics has licensed GeneGo's MetaCore data-mining platform to develop its microarray-based products for diagnosing and treating cancer.

Integromics, a Madrid-based company specializing in data management, integration, and data mining solutions, joined the BioIT Alliance. The alliance, sponsored by Microsoft, aims to develop new IT advances for use in the biomedical field.


US Patent 7,158,926. Cluster availability model. Inventor: Mark Kampe. Assignee: Sun Microsystems. Issued: January 2, 2007.

According to the abstract, this patent covers a method and system for creating a cluster availability model that takes into account availabilities of software components in the cluster. The invention includes "defining a repair model and failure parameters for a repair model, and modeling availabilities of software components based on the repair mode and failure parameters."

US Patent 7,158,892. Genomic messaging system. Inventors: Barry Robson and Richard Alan Mushlin. Assignee: International Business Machines. Issued: January 2, 2007.

This patent describes a computer-based method for relaying data that includes a genomic sequence. The abstract claims a method for identifying at least one genomic base in an input data stream, assigning a base-specific binary code to that base, and grouping the base-specific binary code from a data stream that reflects the sequence. The invention also assigns a "command binary code to at least one command for selectively processing said genomic data stream" and the integration of the binary code and the genomic data stream to produce an output binary data stream.

Data point

$25.6 million

Amount San Francisco-based investment firm Vector Capital paid for Tripos' informatics business.


The Scan

Lung Cancer Response to Checkpoint Inhibitors Reflected in Circulating Tumor DNA

In non-small cell lung cancer patients, researchers find in JCO Precision Oncology that survival benefits after immune checkpoint blockade coincide with a dip in ctDNA levels.

Study Reviews Family, Provider Responses to Rapid Whole-Genome Sequencing Follow-up

Investigators identified in the European Journal of Human Genetics variable follow-up practices after rapid whole-genome sequencing.

BMI-Related Variants Show Age-Related Stability in UK Biobank Participants

Researchers followed body mass index variant stability with genomic structural equation modeling and genome-wide association studies of 40- to 72-year olds in PLOS Genetics.

Genome Sequences Reveal Range Mutations in Induced Pluripotent Stem Cells

Researchers in Nature Genetics detect somatic mutation variation across iPSCs generated from blood or skin fibroblast cell sources, along with selection for BCOR gene mutations.