“The goal is not to build one big central superior version of dbGaP or anything like that. It’s instead to try to help organize, orchestrate, and coordinate the general field of genotype-to-phenotype databasing.”
EU Consortium Aims to Coordinate Rising Number of Genotype/Phenotype Databases
A consortium of European research groups and bioinformatics firms has kicked off an effort to integrate the growing body of information on human genotype-to-phenotype relationships enabled by the rise of low-cost genotyping technologies.
The five-year project, called Gen2Phen, for Genotype-to-Phenotype, began earlier this month with €11.9 million ($17.7 million) in funding from the European Union’s Seventh Framework Program. The goal of the project is to create tools and standards that will seamlessly integrate numerous genetic-variation databases and provide a unified view of these disparate data sources.
“This is not a database,” Anthony Brookes, a professor in the department of genetics at the UK’s University of Leicester and the coordinator of the project, told BioInform. “The goal is not to build one big central superior version of [the National Center for Biotechnology Information’s] dbGaP or anything like that. It’s instead trying to help organize, orchestrate, and coordinate the general field of genotype-to-phenotype databasing.”
Brookes said that the project is a response to the “tidal wave of information” that is now coming online from low-cost genome-wide association studies that link genotypic data with phenotypic information.
Brookes said that the informatics community has not yet developed standardized database infrastructures, data standards, or data models for the rise of so-called G2P data. Unlike genomic data, which is “unidimensional, so it’s very easy to design databases, and very easy to manage and handle,” Brookes noted that “there are a lot more complex challenges involved in handling phenotype data because it’s an infinite universe of information.”
As a result, different databases have chosen to handle and represent this information in very different ways, which has made it difficult for researchers to access data from multiple studies and compare it.
In contrast to many model organism communities, which have successfully created standards for representing and exchanging genotypic and phenotypic information, “the human genetics community has really not done that,” Brookes said.
While he acknowledged that projects like dbGaP and PharmGKB are a step in the right direction, he said these efforts are “nowhere near enough for the absolute torrent of genotype-phenotype information that’s now flowing and will continue to flow.”
Industry to Benefit
The Gen2Phen consortium comprises 15 academic centers and four industry partners, including Decode Genetics and the bioinformatics companies BioBase, PhenoSystems, and Biocomputing Platforms (see below for a complete list of participants).
The project could represent a promising opportunity for commercial players looking to provide the data-management tools required for genome-wide association studies.
For instance, Timo Kanninen, founder and CEO of Biocomputing Platforms, told BioInform via e-mail that the company plans to “build a gateway” between its own genetic data-management and analysis-software tools and the Gen2Phen databases. He said the integration would “mak[e] it possible for our customers to analyze all this data together and making it easy to submit analysis results and data to Gen2Phen and other public databases.”
Samu Karanko, product manager for Biocomputing Platforms’ Gen2Phen team, added that in the long term, the company hopes to benefit from any data-transfer standards that the project develops. In the shorter term, he said the company is looking for “much tighter integration” between its software tools and the Ensembl genome browser “so that you can seamlessly tie together your own results with Ensembl data.”
Biocomputing Platforms has been selling genetic data-management tools since 1994, but Karanko said that the company has witnessed “tremendous growth” over the last several years as genotyping costs have dropped.
“We have been alone in this field for several years, and were pretty much the only provider of commercial off-the-shelf solutions for this type of thing,” he said. “But now that the genome-wide scans have become affordable enough that most research groups can afford them, the field has really exploded in the past year or two.”
He added that the new genotyping technologies have driven demand for sophisticated data-management tools. “Three years ago it was still possible to have all your genotypes in an Excel sheet, but right now … even the genotypes from one chip won’t fit in an Excel sheet anymore.”
The Five-Year Plan
Gen2Phen is organized into several different workgroups that will tackle different aspects of the project (see below for a complete list of workgroups).
Brookes said that the first six to 12 months will be spent on the “Domain Analysis and Community Relations” aspect of the project. This workgroup will reach out to the scientific community to assess what data models, ontologies, and standards already exist that might be useful in building an integrated G2P framework. The goal, he said, is for consortium members to “really become informed about where the missing pieces of the jigsaw are, and what people feel they need.”
The next phase will be dedicated to developing standard data models and terminologies. “The ultimate goal of this is to enable the community to more quickly move toward a truly holistic solution, where essentially every piece of data that’s generated in a lab that’s related to genotype-phenotype relationships has an easy path from the machine that generates that data into … whatever database or repository is out there” on the Internet, Brookes said.
A third aspect of the project involves harmonizing existing genetic and genomic databases. For this, one workgroup will focus on gene- and disease-focused databases while another group looks at genome-wide resources.
For locus-specific databases, Brookes said that the initiative plans to provide a set of “off-the-shelf” database-creation tools that researchers can download and populate with their own data. He said that the consortium is looking at the Leiden Open Variation Database and the Universal Mutation Database as potential starting points for those generic tools.
The initiative is planning a similar effort for genome-wide databases, but has decided to provide only summary-level data in order to circumvent potential privacy issues.
“For whole-genome-wide databases, one does have to think about privacy issues because the depth of information that’s being created now in a lot of these scans would allow you to identify people,” Brookes said. “So our solution there is to deliberately not store individual-level data — either genotypes or phenotypes. Only group-level data. Obviously, that means you can’t decode anyone because there is no one person’s data stored in our systems.”
Brookes said that the system will provide links to all study authors so that researchers can contact them directly if they want access to individual-level data. He acknowledged that some researchers may find fault with this model, but noted that it is exactly the same model that peer-reviewed journals currently follow for genotype-phenotype studies.
Other workgroups will tackle challenges associated with data integration and access technologies, data flows, and long-term sustainability. The latter is a particularly thorny issue for community-based projects of this type, Brookes noted.
“If we start having thousands of databases, all interoperable, who is going to pay for all those databases, even if they’re standardized and easy to build?” he asked. “We need to talk to funders, companies, journals, and we’ve got to think about incentives and rewards – why would a researcher bother to bring their data forward? They’d love to have it all out there to use, but what incentives or rewards can you give them to bring their own information forward?”
One model that the consortium is considering involves “a bioresource impact factor,” which would be akin to the journal impact factor but would instead track how often certain resources or pieces of data are used in the community.
Brookes said that it’s likely that Gen2Phen’s “vision” for a harmonized, searchable network of all genotype-phenotype databases may not be complete by the end of the five-year project. However, he noted, “If we’re going to try and move in that direction, what we can’t have is a thousand different groups running in different directions with different data models and different operating systems.”
He stressed that the consortium “can’t dictate what people should do” in this rapidly developing field, but said that the initiative can work with the community to “suggest some standards that they might think it’s wise to adopt — standard data models, terminologies, nomenclatures, ontologies, strategies, search tools; things that, if they build their systems around them, they still have a lot of flexibility in what they do, but it will be a lot easier, ultimately, to tie everything together.”
Organizations Participating in Gen2Phen
Gen2Phen Workpackage Leaders: