NEW YORK (GenomeWeb) – Scientists from Harvard Medical School, the University of Toronto, the Broad Institute, and other institutions are working on a project dubbed Matchmaker Exchange, which aims to connect disparate databases of genomic and phenotypic information through an application programming interface, which would make it easier for users to search for and obtain more comprehensive information on genes and phenotypes of interest to them as well as connect with other researchers who are studying similar cases.
Heidi Rehm, an associate professor of pathology at HMS and director of the Laboratory for Molecular Medicine at the Partners Healthcare Center for Personalized Genetic Medicine, discussed the project during her presentation at the Bio-IT World conference last month. She is one of the roughly 40 scientists from research consortia and clinical sequencing centers that make up the Matchmaker Exchange development team. The researchers first met last year during the American Society of Human Genetics meeting to discuss what they believe are the current needs for data access and to come up with methods of building bridges between data silos.
On the one hand, "lot[s] of labs are sequencing patients with rare disease phenotypes [using] exome or whole genome approaches [and] as they do that, they find …variants in candidate genes for that disorder but [do] not have enough evidence to prove causality between the mutation in the gene and the actual disorder," Rehm told BioInform this week.
On the other hand, she added, "there are lots of people developing databases or [who] already have databases that house these cases with their phenotypes and candidate genes or even whole VCF files." With Matchmaker Exchange, "the idea is can we network all of these databases through a federated system and define an application programming interface that allows you to query from one database into all the others and find matches where the phenotype matches and the gene matches" thus making it possible to "mount additional evidence for a particular gene being causative for a particular disease," she said.
Out of that ASHG meeting came two working groups: one dubbed the Tiers Workgroup whose purpose is to define data-sharing activities and goals; and the API workgroup who, as the name implies, work on APIs. So far, the technical development team has released the first version of an API, which has been used to link two databases. The first is PhenomeCentral, which contains clinical and genetic information from patients with undiagnosed rare diseases and is developed and maintained by a team at the Center for Computational Medicine at the Hospital for Sick Children in Toronto — the data in this resource comes from Canada's Care for Rare program. The second is GeneMatcher, a database that was developed by researchers at the Baylor-Hopkins Center for Mendelian Genomics and provides access to information on genes linked to inherited disorders — this database is also linked to another resource called PhenoDB that was developed by the same team and holds phenotype information.
With PhenomeCentral linked to GeneMatcher, any researcher can now enter information about both the phenotype and the gene candidate into GeneMatcher and they'll get results that are culled from both repositories — users running the search from the PhenomeCentral database would simply search by just phenotype information. Data in PhenoDB is stored using its own unique phenotyping ontology but it is similar to the Human Phenotype Ontology that PhenomeCentral uses so the phenotypic terms used do not have to match exactly for the system to connect it to the relevant genes, Rehm said
The Exchange team is open to new databases and it is also accepting datasets from researchers willing to have their information included in one of the existing systems in the Exchange. There aren't any specific criteria for the kinds of datasets it will accept. However, "the richer the phenotype, the more precise the matching approaches that could be taken, [so] obviously, we encourage people to put the detailed phenotype there… but simply providing the candidate gene that they are considering is sufficient," Rehm said. So far, besides PhenomeCentral and GeneMatcher, the Exchange developers are also working to integrate databases and datasets maintained by a number of other groups. Their list includes the Database of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources (DECIPHER); the Leiden Open Variation Database (LOVD); Café Variome; and the data from the National Human Genome Research Institute's undiagnosed diseases program.
Meanwhile, they are writing instruction manuals to help researchers access data from each of the individual databases in the Exchange system. The reason for that is not all the databases are linked through the API, Rehm explained, so while users may be able to access data from both PhenomeCentral and GeneMatcher by querying either site, if they want to access data in DECIPHER or LOVD, they have to visit and search these sites individually. "Over time through APIs we will make it easier and easier so that you don't have to search each of these systems independently." The instruction sheets will eventually be available at Matchmakerexchange.org — the site is not online at present.
It's likely that the Exchange will eventually evolve into more of a federated system with a single access point, Rehm said. That’s because the group hopes to be able to develop algorithms that can facilitate more sophisticated and complex queries. For instance, if "you had a patient with autism and a candidate gene and you were querying a database for other patients … you are going to get a lot of hits because it's not a very specific phenotype," she explained. "On the other hand, if you are querying the database with a patient with 12 different rare phenotypic features … the collection of those is going to define an incredibly rare phenotype."
Researchers within the Exchange are currently developing matching algorithms that will "give a score for how closely the phenotype matches and then … rank your matches across all the cases that are hit," she said. They are also working on algorithms that can generate gene matching scores based on exact gene match as well as type of variant, for example, loss of function versus missense versus de novo mutations. For now, because there are only a few databases in the Exchange, it’s a relatively straightforward process to match and rank phenotype and gene hits. But as more databases come online, these matching algorithms will become more useful for helping researchers make sense of their data. Duplicating the matching algorithms and enabling these more complex queries across multiple databases would be far too labor intensive, and so it would probably be more prudent to focus on making it available via a single resource.
However, a federated system is not possible at present because that would require sustained effort and funding, neither of which the project has right now. Although each individual database has its own funding sources, most of the work done for the Exchange itself is on a volunteer basis. There are a few support opportunities that the developers are exploring. The Matchmaker Exchange is one of three so-called flagship driver projects for the Global Alliance for Genomic Health. In fact, a number of researchers involved in creating the Exchange, including Rehm, are involved in the alliance's data working group whose roles include developing computer formats and APIs that will be used to represent and exchange genomic data. She and her colleagues are hoping to draw on funds made available for alliance work.
Rehm also hopes to garner some support for the Exchange through her involvement with the ClinGen program. She is involved on one of three research teams that were selected to share a $25 million award from NHGRI and the National Institute for Child Health and Human Development to collect and share detailed data about genomic variants that are relevant to human disease and useful for clinical practice. Her group was tasked with building a framework for evaluating and describing the roles that genomic variants play in disease development. The database developed as part of this program is also on the list of resources that the Exchange team hopes to link to via its API.
The Exchange team is working on a second version of its API. Rehm said the working group in charge of that is still mulling what capabilities this incarnation of API will support in terms of things like alternative approaches to searching. "For example, right now the way you do a query for PhenomeCentral or Gene Matcher is you are forced to enter a case into the system to execute search but [some] groups" don't think that should be required, she said. They are also exploring methods of enabling "hypothesis-free discovery" where researchers who haven’t identified a candidate disease-causing variant could simply input genomic sequence data or variant call files along with phenotype information — with appropriate consent from the patient in question — into one of the databases in the system and run queries for genes and variants that are common across all cases where a similar phenotype was observed.