The National Center for Biotechnology Information last week unveiled its latest resource, dbGaP, which was developed to house genotypic and phenotypic data from large-scale genome-wide association studies.
The database — the first resource to enable public access to large-scale genotype-phenotype associations — could “stimulate genome-wide research to a level that’s completely unprecedented,” Jim Ostell, branch chief at NCBI’s Information Engineering Branch, told BioInform this week.
The initial release of dbGaP (available here
) includes data from two studies: the Age-Related Eye Diseases Study, a 600-subject prospective study supported by the National Eye Institute; and the National Institute of Neurological Disorders and Stroke Parkinsonism Study, a case-controlled study that involved 2,573 subjects.
NCBI also plans to add data from other projects, including the Framingham SNP Health Association Resource Study, as well as other genome-wide association studies focusing on heart disease, women’s health, neurological disorders, neuropsychiatric disorders, diabetes, and environmental factors in disease.
Ostell’s group has spent the last year working closely with other National Institutes of Health institutes to develop an informatics infrastructure that “really enables a big leap for genomics and clinical science, while at the same time not violating people’s privacy or consents,” Ostell said.
One of the primary goals in building the database was to web-enable huge amounts of phenotypic information from study documents, protocols, and questionnaires. “They may be on paper, they may be scanned PDFs, or they may be in peoples’ filing cabinets,” Ostell said. “We just accepted the fact that that’s the way it is.”
NCBI staffers extracted the relevant information from those documents to create a series of tables in which rows represent individuals and columns represent specific phenotypic measures. They also tagged each of the documents in XML to make them more easily searchable and to enable dynamic linking between particular sections of protocols or questionnaires and summaries of the phenotypic data.
NCBI has assigned accession numbers to the phenotypic column headers, and the genotyping chips used in the studies already use NCBI’s RS SNP accession numbers. Ostell said that researchers can now use the database to find novel associations between genes and phenotypes, which would then be assigned a unique identifier of their own.
“You can create a data structure that defines the mapping between those along with p-values and LOD scores, and you can deposit that data structure here and we can give you an accession number for that,” he said. “And then when you publish that paper, you can publish that accession number. And then people could come to the new database and look at the genome through the lens of your particular published association.”
In addition, beginning with the genome, molecular biologists can link to associations and then to phenotypes that might help them form new hypotheses, while clinical researchers may rely on the resource as a pool of supporting data.
“I could say, ‘Well, my blood pressure measurement maps to this set of RS numbers. Does anything else in this database map to those RS numbers?’” Ostell said. “Or I could go on the blood pressure side and say, ‘Find me all the other studies that ever took blood pressure, and then see if any of those columns — even though they’re not identical to mine — map to similar areas of the genome, to my study.’
“And that becomes either a source for different measurements or confirming or supporting evidence without really doing another experiment or recruiting another population,” he said.
Ostell said that there is also some discussion within the statistical community regarding the use of dbGaP studies as controls for other genome-wide association studies.
“I may have thousands or tens of thousands of genotypes in this database, but I only genotyped a thousand people with high blood pressure,” Ostell said. “So it turns out that it may be possible … that I could treat the other 9,000 individuals that were genotyped on the same chip as a control population.”.
One way to do this would be to rely on knowledge about the incidence of disease. “I know that 10 percent of the US population has high blood pressure, so I have to assume in this control population that 10 percent of those people — even if they didn’t measure high blood pressure — probably have it,” he said.
Ostell acknowledged that there would be “a certain amount of noise” in this approach, but said that initial studies indicate that “there may be statistical frameworks where you can use the rest of the database as controls.”
These examples are “just the things that have popped up already as to how it looks like this database will help things along,” Ostell said. “Of course we won’t really know until over the next year or so as it begins to fill up with data and people try things with it.”
No Automation in Sight
The current studies in the database required a fair amount of manual effort to enter into dbGaP. Ostell said that the process may become a bit more automated in the future, but it will likely take awhile.
“If people used XML and the web to collect the data in the first place, then we wouldn’t have to do this remapping,” he said, noting that several studies are beginning to use web-based forms to collect data. “So it’s certainly possible that we can push that along a little quicker, and that would ease a certain amount of this data handling and make the process of deposition more automatic, and eliminate certain levels of confusion.”
“I think it’s going to be a long time before everything in these types of studies is standardized, but certainly sections of them could be, and this database will facilitate this process.”
Standards for measurements and documents, however, are another matter. “There’s a lot of discussion in this area, a lot of work, and not a whole lot of agreement,” Ostell said. “There are cases where one may argue that having eight ways to measure blood pressure is arbitrary and confusing, but it’s also possible that actually it does matter because you did it a little differently on purpose.”
Ostell said that NCBI does not plan to impose any particular standards or to influence any standards initiatives that may be underway, though dbGaP would adopt any standards that do arise from the research community. “We’d just take the standard after it’s been standardized — either officially or de facto — but we ourselves don’t have to be part of the standards fight,” he said.
“I think it’s going to be a long time before everything in these types of studies is standardized, but certainly sections of them could be, and this database will facilitate this process,” he said.
Authorization System on the Way
The dbGaP database has a two-tiered access model in which study documents and summary data are available for all researchers without restrictions, but access to individual-level genotypic or phenotypic data will require authorization. Information that might help identify an individual is not included in the database, and all individual data is coded by the submitting principal investigator. NCBI will not have the key to identify the original study participant. “Even if we wanted to, we couldn’t identify the people,” Ostell said.
Some specifics of the authorization process still remain unresolved, Ostell said. One reason for this is that the database must support a broad range of conditions for data access because consent requirements can vary widely across different studies.
In addition, NIH is still discussing whether to implement an additional level of security for access to studies in which participants were drawn from a small geographic area, or those that focus on a potentially “stigmatizing” condition, such as alcoholism or drug addiction.
NCBI has a prototype of the authorization system running now that should be available early next year, Ostell said. The system is integrated with one that NIH uses to authenticate the identity of grant applicants.
Researchers who want access to controlled information in dbGaP can either log in with their existing NIH grants password, or acquire one from the grants group. Once they log in, each study has a series of authorization forms that outline additional requirements or constraints. The researcher signs those forms, and the sponsoring NIH institute uses its own criteria to determine whether the researcher is qualified to have access to the data.
Ostell stressed that there are already mechanisms in place for sharing this kind of information, and that the goal of dbGaP is only to move these current mechanisms into a common database framework that will scale in step with the growing body of genome-wide association study data.