In an effort to try to wrangle the exponential growth of data emerging from genome-wide association studies, a two-man effort has offered a possible solution that may point the way forward for GWAS databases in the future. Christopher O'Donnell, scientific director of the SNP Health Association Research (SHARe) Project and a member of the Framingham Heart Study and National Heart, Lung, and Blood Institute, along with Andrew Johnson, a postdoctoral fellow at NHLBI, stumbled upon an open access database model for GWAS after undertaking their own data collection for an association study last year. The study used 550,000 SNPs in 9,500 Framingham Heart Study participants as part of the SHARe Project. The researchers were interested in surveying the scope of results already published in order to make comparisons to the results they were getting in their own analyses. "At the time, we thought, 'Gee, this should be readily available data out there, shouldn't it?'" says O'Donnell. "So we naïvely went into this thinking we could just put this together ourselves, and I think we learned a number of lessons along the way."
The researchers' paper, published in BMC Medical Genetics, describes how they collected more than 118 GWAS research papers and deposited them into a database of more than 50,000 SNP-phenotype associations and other data into a freely available series of flat files. The main idea behind their database model is that it will not require users to have an authorization or subscription to gain access, there is no obligation to comply with a publication embargo, and it generally contains much less restrictive criteria for inclusion of results.
"Our set includes 56,411 association results from 118 papers through March 1, 2008, whereas the NHGRI catalog currently includes only 1,117 associations from 245 papers through January 21, 2009, [so] our larger set of results may allow deeper mining and analysis of GWAS results across studies," Johnson says. "We already know from past examples in the literature that true positive, replicated genes sometimes rank relatively low among GWAS findings, and thus inclusion of more results may be preferable in such a database." In addition, he says that a more inclusive database will allow users with a candidate gene hypothesis in search of evidence to have more information at their disposal.
Despite their innovative ideas, the open access issue is still the 800-pound gorilla here, although Johnson believes the future does look bright. "I would say the major challenge in creating such a database, both here and in the future, is the question of availability of results," says Johnson. "From a glass half-full perspective, 55 percent of studies we surveyed released a moderate to complete amount of their results, so I think there is definitely a drive on the part of many scientists to make such results widely available for scrutiny and further analysis."
Another factor standing in the way of organizing and maintaining a central repository for GWAS data is the research community itself. "There's not a standard ontology that's being used by scientists in the community for defining human phenotypes that are being examined, so that makes combining datasets tough," O'Donnell says. "That is another challenge that probably the whole community needs to face." In addition, making the databases available in a way that completely de-identifies the participants in the study to avoid compromising privacy and confidentiality is another hurdle.
Right now, their database exists only as downloadable flat files which are a supplement to their research paper in BMC Medical Genetics. To really take off, it will need to be made into a queryable Web interface and have some serious infrastructural support in order to consistently update and maintain the database.