Skip to main content
Premium Trial:

Request an Annual Quote

NCBI's dbGaP to Bridge Genotype, Phenotype Data


The National Center for Biotechnology Information last week unveiled its latest resource, dbGaP, which was developed to house genotypic and phenotypic data from large-scale genome-wide association studies.

The database — the first resource to enable public access to large-scale genotype-phenotype associations — could "stimulate genome-wide research to a level that's completely unprecedented," says Jim Ostell, branch chief at NCBI's Information Engineering Branch.

The initial release of dbGaP includes data from two studies: the Age-Related Eye Diseases Study, a 600-subject prospective study supported by the National Eye Institute; and the National Institute of Neurological Disorders and Stroke Parkinsonism Study, a case-controlled study that involved 2,573 subjects.

NCBI also plans to add data from other projects, including the Framingham SNP Health Association Resource Study, as well as other genome-wide association studies focusing on heart disease, women's health, neurological disorders, neuropsychiatric disorders, diabetes, and environmental factors in disease.
Ostell's group has spent the last year working closely with other National Institutes of Health institutes to develop an informatics infrastructure that "really enables a big leap for genomics and clinical science, while at the same time not violating people's privacy or consents," Ostell says.

One of the primary goals in building the database was to Web-enable huge amounts of phenotypic information from study documents, protocols, and questionnaires. "They may be on paper, they may be scanned PDFs, or they may be in people's filing cabinets," Ostell says. "We just accepted the fact that that's the way it is."

Ostell says that NCBI does not plan to impose any particular standards, though dbGaP would adopt any standards that arise from the research community.
"I think it's going to be a long time before everything in these types of studies is standardized, but certainly sections of them could be, and this database will facilitate this process," he says.

— Bernadette Toner

Short reads

Software developer Connexor has announced that BioWisdom is slated to distribute its Machinese platform along with its own Sofia Knowledge Suite. Machinese will be integrated into Sofia Editor to automatically extract information from text-based online literature.

Dresden-based bioinformatics startup Transinsight has entered a three-year collaboration with the Max Planck Institute of Molecular Cell Biology and Genetics to extend its GoPubMed search engine toward biomedical image search and analysis.

Laboratoires Fournier has recently licensed Biobase's TRANSFAC eukaryotic gene-regulation database. Fournier will use TRANSFAC, which contains data on 8,700 transcription factors, to develop therapeutics for metabolic and cardiovascular diseases.

Genomatica has announced that Diversa will utilize its biosimulation technology in an effort to develop more efficient biomanufacturing processes for biologically derived enzyme products.

Almac Diagnostics has licensed GeneGo's MetaCore data-mining platform to develop its microarray-based products for diagnosing and treating cancer.

Integromics, a Madrid-based company specializing in data management, integration, and data mining solutions, joined the BioIT Alliance. The alliance, sponsored by Microsoft, aims to develop new IT advances for use in the biomedical field.


US Patent 7,158,926. Cluster availability model. Inventor: Mark Kampe. Assignee: Sun Microsystems. Issued: January 2, 2007.

According to the abstract, this patent covers a method and system for creating a cluster availability model that takes into account availabilities of software components in the cluster. The invention includes "defining a repair model and failure parameters for a repair model, and modeling availabilities of software components based on the repair mode and failure parameters."

US Patent 7,158,892. Genomic messaging system. Inventors: Barry Robson and Richard Alan Mushlin. Assignee: International Business Machines. Issued: January 2, 2007.

This patent describes a computer-based method for relaying data that includes a genomic sequence. The abstract claims a method for identifying at least one genomic base in an input data stream, assigning a base-specific binary code to that base, and grouping the base-specific binary code from a data stream that reflects the sequence. The invention also assigns a "command binary code to at least one command for selectively processing said genomic data stream" and the integration of the binary code and the genomic data stream to produce an output binary data stream.

Data point

$25.6 million

Amount San Francisco-based investment firm Vector Capital paid for Tripos' informatics business.


The Scan

Purnell Choppin Dies

Purnell Choppin, a virologist who led the Howard Hughes Medical Institute, has died at 91, according to the Washington Post.

Effectiveness May Decline, Data From Israel Suggests

The New York Times reports that new Israeli data suggests a decline in Pfizer-BioNTech SARS-CoV-2 vaccine effectiveness against Delta variant infection, though protection against severe disease remains high.

To See Future Risk

Slate looks into the use of polygenic risk scores in embryo screening.

PLOS Papers on Methicillin-Resistant Staphylococcus, Bone Marrow Smear Sequencing, More

In PLOS this week: genomic analysis of methicillin-resistant Staphylococcus pseudintermedius, archived bone marrow sequencing, and more.