Skip to main content
Premium Trial:

Request an Annual Quote

Driven by GAIN, Other Projects, NCBI Tackles Challenges of Merging Phenotype and Genotype


The National Center for Biotechnology Information is entering new territory as it prepares to manage data from the National Institutes of Health's recently launched GAIN (Genetic Association Information Network) project.

GAIN, which NIH kicked off in February, plans to genotype 1,000 to 2,000 patient samples from up to seven diseases and to make all the associated genotype and phenotype data publicly available for the scientific community. That raises a number of issues that NCBI is addressing for the first time, including patient privacy, informed consent, and user authentication.

Jim Ostell, chief of the information engineering branch at NCBI, told BioInform that NCBI is well on its way towards completing an infrastructure for managing genotype and phenotype data from GAIN as well as a number of other large-scale association studies supported by NIH.

In addition to GAIN, Ostell said that NCBI will house whole-genome association data from the following initiatives: the Genes and Environment Initiative, which is another large-scale genotyping project that NIH launched earlier this year; the Framingham Genetic Research Study, which is part of the National Heart, Lung, and Blood Institute's long-running Framingham Heart Study that involves genotyping around 9,000 study participants across three generations; four genotyping studies funded by the National Institute of Neurological Disorders and Stroke; a macular degeneration genotyping study under the auspices of the National Eye Institute; and the National Human Genome Research Institute's medical resequencing initiative.

"Part of why you're seeing this move toward a centralized database like this is that it actually has been less useful than one might like the way it is now — scattered all over the place."

"For us, it's a big spectrum of things that we're trying to do in as uniform a way as possible," Ostell said.

While all these initiatives build on data that NCBI already manages — namely the dbSNP database and the HapMap data — anything related to phenotype "is all new information to NCBI," Ostell said. Subsequently, the associations between genotype and phenotype that result from such studies are also uncharted territory for NCBI.

Phenotype data presents two separate challenges, Ostell said. First, there is a great deal of documentation associated with it, such as patient questionnaires and study protocols, that is generally only in paper form, and often not available at all. Second, while the phenotype data itself is usually available in tabular form, there are many types of data formats, and phenotype measurements are not always labeled clearly. A column heading such as "HO112" could mean "blood pressure," for example. Finally, there are no established standards for phenotype measures, largely because the meaning of the terms can vary widely from study to study. "Blood pressure" for a general physical exam is different than an NHLBI study, for example, in which blood pressure may be taken upon waking, after standing up, and after running, Ostell explained.

While NCBI doesn't intend to address the latter challenge, it is handling the other demands of phenotype data by first uploading the data itself, and then running a quick analysis of the information to provide a summary of the content to double-check against the submitter's records.

The supporting documentation is then shipped off to an external contractor for conversion to XML so that it can be linked to the phenotype data. From there, "we can generate a style sheet that allows us to render this XML back out as an HTML document that looks and feels a lot like the original form or questionnaire," Ostell said. This framework enables users to submit queries via the Entrez interface using phenotypic terms to pull up every study that measured blood pressure, for example, "but we can also show you the question in the context of the rest of the questionnaire or protocol, so you can see how the data was collected at the time."

Ostell stressed that this system does not violate any patient privacy because it returns only the documentation and the summary statistics of the contents. "We don't keep any identifying information from the original patients. We don't have names, social security numbers, addresses, any of that kind of stuff," he said.

Researchers who want access to individual patient data must obtain authorization from the center running each study. NCBI will maintain a single log-in account for each user, and the individual authorizing agencies will assign permission to use data from particular studies with each account. Consent agreements can vary widely across studies, and even within studies over time, which makes the authorization process quite complex, but the log-in account structure ensures that researchers have immediate access to all the studies they are authorized to view from a single resource. "NCBI itself doesn't directly authorize anybody," Ostell said. "We simply act on behalf of the instructions of the authorizing agencies."

On the genotyping side of the equation, Ostell said that NCBI is still in discussions with vendors like Affymetrix and Illumina about "appropriate data content and formats" to ensure uniformity across studies. Like the phenotype data, summary statistics will be available to all users, but individual genotypes will only be available to authorized users.

David Wholley, GAIN director at the Foundation for NIH, described the NCBI system as "a flexible model that in our view really tries to balance the need for broad public access by the scientific community with protection for confidentiality and security of the original participants' data."

Wholley noted that the scope of the effort is outside of NCBI's traditional experience. "It's a more sophisticated and comprehensive level of data representation from an IT standpoint, and I think also from the standpoint of a data access policies and procedures model, there are really some new things here that haven't been done before."

Nevertheless, he said, NCBI was the logical choice for the data "given their stewardship of this kind of data over time and the kinds of projects that were coming on their plate."

Ostell said that the genotyping community is now at the same stage with database management as the genome sequencing community was before Genbank was created. At the time, he said, many researchers argued in favor of distributed resources rather than a centralized repository, although the benefits of the centralized model soon became clear.

"The same holds true for this class of information," he said. "Part of why you're seeing this move toward a centralized database like this is that it actually has been less useful than one might like the way it is now — scattered all over the place."

— Bernadette Toner ([email protected])

Filed under

The Scan

Study Points to Tuberculosis Protection by Gaucher Disease Mutation

A mutation linked to Gaucher disease in the Ashkenazi Jewish population appears to boost Mycobacterium tuberculosis resistance in a zebrafish model of the lysosomal storage condition, a new PNAS study finds.

SpliceVault Portal Provides Look at RNA Splicing Changes Linked to Genetic Variants

The portal, described in Nature Genetics, houses variant-related messenger RNA splicing insights drawn from RNA sequencing data in nearly 335,700 samples — a set known as the 300K-RNA resource.

Automated Sequencing Pipeline Appears to Allow Rapid SARS-CoV-2 Lineage Detection in Nevada Study

Researchers in the Journal of Molecular Diagnostics describe and assess a Clear Labs Dx automated workflow, sequencing, and bioinformatic analysis method for quickly identifying SARS-CoV-2 lineages.

UK Team Presents Genetic, Epigenetic Sequencing Method

Using enzymatic DNA preparation steps, researchers in Nature Biotechnology develop a strategy for sequencing DNA, along with 5-methylcytosine and 5-hydroxymethylcytosine, on existing sequencers.