The developers of HGVbase, a curated database of human sequence variation hosted by the UK’s University of Leicester, have ambitious plans for the resource, which they hope to turn into a “database journal” in which researchers can deposit the results of studies linking genotype and phenotype.
HGVbase (Human Genome Variation database), originally launched in 1998 as HGBASE (Human Genic Bi-Allelic SEquences), was the first curated collection of SNPs in the human genome, according to Anthony Brookes, a professor of bioinformatics at Leicester’s department of genetics.
After other SNP resources started coming online a few years later, including NCBI’s comprehensive dbSNP, Brookes said that he and his colleagues determined that HGVbase should not be “yet another SNP database,” and decided to shift their focus towards creating a repository for genotype-phenotype relationships.
“We realized that there wasn’t a place to put this information out there on the Internet,” Brookes said, “so we spent the last few years doing a lot of groundwork to work out what such a database should be like, what it should contain, what its structure should be, and how do we handle phenotype information. There are an infinite number of phenotypes one could describe. How do you put them all in a compact, intuitive database?”
In that time, Brookes and his team have participated in an international effort to create the XML-based Polymorphism Markup Language, and have also developed a data model called G2P to capture genotype-to-phenotype relationships. This data model will underlie the next phase of the database, to be called HGVbaseG2P, which will serve as a repository for genetic variation and disease association studies.
Last week, the project was awarded £150,000 ($293,800) under the UK-India Education and Research Initiative to incorporate data from the Indian Genome Variation Project, an initiative that began several years ago to capture data on validated SNPs, repeats, and gene duplications in more than 1,000 candidate genes in around 15,000 individuals drawn from Indian subpopulations.
Brookes said that the database will also include data from the scientific literature and other public resources. He stressed, however, that “we’re not trying to be some kind of monolithic center for all genotype information, but certain categories of it — specifically, genetic association studies represented at the summary level.”
Unlike resources such as NCBI’s recently launched dbGaP, which houses raw experimental data related for genotype and phenotype [BioInform 12-22-06], Brookes said that HGVbaseG2P will only be concerned with “summary-level descriptions.”
He noted that he is in contact with the NCBI team developing dbGaP and that “we’re all trying to make sure that we synchronize as much as possible what we’re doing.”
Brookes said that the first version of HGVbaseG2P is “very close to launching,” but did not provide a specific timeline.
Creating a Database Journal
“What we actually are ultimately trying to do is evolve our project into what I would call a database journal,” Brookes said. Currently, he said, in peer-reviewed journals, “only a very small fraction of the research done actually gets published, or gets very much visibility.”
In addition, he noted, journals are “biased toward positive findings,” so many interesting negative results — in which it is proven that a given gene or sequence is not related to a certain phenotype — “never see the light of day.”
In the database world, on the other hand, “there is no route to put the information in there easily, there’s no reward system for it, there’s no incentive, there’s no accreditation, you can’t put it on your CV, and there’s generally very little funding for those databases in the first place,” Brookes noted.
The goal of HGVbaseG2P, he said, is to combine the incentives of journal publishing with the benefits of the database model, in which information is structured for easy searching and is integrated with other resources.
“The ideal solution to our mind would be where we would link up with some journal or journals so that people who put data into our database would get a PubMed ID for that submission. It would be a database link rather than a journal reference, but we’re trying to merge those two worlds,” he said.
“It’s going to be some time before we get to that point,” Brookes conceded. “It might be a few years, or it might even be a few decades before it’s really all working, but we have to start somewhere.”
“The ideal solution to our mind would be where we would link up with some journal or journals so that people who put data into our database would get a PubMed ID for that submission.”
HGVbase is not the only project working to bridge journal publishing and database development. Last fall, Current BioData, a joint venture between Geneva Bioinformatics and publishing consortium Science Navigation Group, announced plans to develop a curated database of druggable protein targets called TPdb that would include certain features that are more akin to an online journal than a traditional database [BioInform 10-06-06].
In addition, last summer, the European Bioinformatics Institute began a collaboration with UK’s PubMed Central to develop automated ways to hyperlink all molecular entities in the PubMed Central archive to records in public data resources [BioInform 08-04-06].
“I think people are realizing that the old way of publishing science needs to be upgraded, but it’s more of a cultural change that has to happen,” Brookes said.
As a first step, he said the HGVbase team has developed an online submission application to help researchers deposit association data. “If you imagine your average researcher, they do their study, they get their genotypes, their phenotypes in Excel sheets. They might do some analysis and write a manuscript on it. But asking them to then put that data into a database is very challenging,” he said.
The application “leads them step by step to entering all the information into that form. That then validates everything, checks that everything is complete, and then when they press ‘go’ it will create an XML structure and send it to our database so that it can be incorporated in the database essentially automatically.”
Brookes said that his team is also working with around 15 other groups in Europe to form a consortium that will help guide development of future genotype-phenotype resources. Brookes said that the effort is still in the very earliest stages, but noted that there is broad concern that the growing volume of genotype-phenotype information needs to be structured a bit differently than traditional bioinformatics resources.
“There are going to be so many different resources built to hold this kind of information over the next decade, and we need to try to start with some commonality in how those things are being built so that they can be unified and tied together sooner rather than later,” he said.