Note: This story has been updated to clarify the potential relationship between GenCC and DisGeNet.
CHICAGO – The keepers of several large genomic databases, as well as several prominent sequencing companies and clinical laboratories, have come together to launch an online clearinghouse of curated information on gene-disease correlations.
The Gene Curation Coalition Database (GenCC DB) website publicly debuted last month with data from compendia including ClinGen, the Database of Genomic Variation and Phenotype in Humans Using Ensembl Resources (DECIPHER), Orphanet for rare diseases and orphan drugs, the UK-curated Gene2Phenotype database, Genomics England's PanelApp, and PanelApp Australia. Other charter participants include Ambry Genetics, Illumina, Invitae, Myriad Women’s Health, and the Mass General Brigham Laboratory for Molecular Medicine.
GenCC DB currently contains harmonized data from eight sources, comprising 3,728 submissions on 2,281 genes. Information from the Online Mendelian Inheritance in Man (OMIM) should come online soon, according to organizers from the Broad Institute and Geisinger Health System.
"Conceptually, it's the same thing for genes that ClinVar is for variants," explained Heidi Rehm, medical director of the Broad Institute's Clinical Research Sequencing Platform and chief genomics officer at Massachusetts General Hospital. "Ultimately, all of the groups participating in the GenCC are curating gene-disease relationships, and most of those are focused on monogenic disorders," said Rehm, who chairs the GenCC steering committee.
"It's a way to make people have access to all these resources and then quickly see when we all say the same thing or we say different things and they need to dig in," she added.
The GenCC website will not contain any computational tools. "There are tons of genomic datasets that aggregate information computationally. That is not our focus," Rehm said. She wants the database to concentrate exclusively on gene-disease correlations.
Indeed, there is no personally identifiable patient data on the site that would make the information subject to HIPAA privacy regulations. "It's just gene-level knowledge," Rehm said.
The key to GenCC DB is harmonization of terminology.
"It's really hard to do [research] projects and come together with your resources if you have no way to at scale analyze how you're similar and prioritize genes that are classified differently," Rehm said. "We really felt we needed to harmonize the terms so we could actually compare and then bring all the data together."
Rehm and colleagues have compared datasets to identify differences over the last several years. She said that some disparities were due to true fundamental differences into their approaches to gene curation, but more often, there were similar goals but different results.
Some of the participants, particularly the clinical laboratories, did not have a place to share their data other than through the occasional journal article, according to Rehm.
"There have been a few publications, but of course, that's not a good way to keep up to date," Rehm said. "This is a mechanism to not only bring together and harmonize existing public resources, but also then bring in the private laboratories that are doing a lot of this work as well."
For example, in a 2017 paper in Human Mutation, researchers from Ambry Genetics reviewed clinical validity assessments on four years of diagnostic exome sequencing.
Before publicly launching, GenCC conducted a Delphi survey of geneticists worldwide to harmonize terms describing gene-disease validity. Coalition participants settled on a list of classifications: Definitive, strong, moderate, limited, disputed evidence, refuted evidence, animal model only, and no known disease relationship. They then mapped assertions from each dataset to these terms.
Some resources that did not have enough "granularity" of curation also had the category of "supportive" to describe certain gene-disease associations. These include OMIM and Orphanet.
Rehm said that OMIM curators are working to format their data now to conform to the GenCC DB submission template. "There's a little work going on behind the scenes with the OMIM dataset to get it in, but we have an agreement for them to put everything in here," she said.
Rehm also said that DisGeNet, which she likened to a computational aggregator of information, has expressed interest in collaborating with GenCC. She said that she will meet with DisGeNet later this week to compare the scoring systems of that dataset and GenCC.
"Nearly every gene has associations and scores along with a massive amount of information, and it's a little hard to tease out the content for monogenic disease interpretation," she said.
Each participating group has its own manual curation processes in place already, so GenCC is not performing its own curation. "They're all manual and they're focused on Mendelian associations in part because that's what's needed in our field right now," Rehm said.
"We want them to learn and see where our differences are, and that will make every resource better," Rehm continued. "This is a way for all of the gene-level resources to compare with everything, and then we hope that improvement in accuracy disseminates to every resource out there."
Marina DiStefano, clinical laboratory director of Geisinger Health System's precision health program, manages and double-checks the content of the GenCC website. Each entry in the database essentially is an assertion for a gene, a disease, and a mode of inheritance, according to DiStefano, who moved to Danville, Pennsylvania-based Geisinger in September after completing a postdoctoral fellowship in clinical molecular genetics at the Broad.
Each GenCC DB contributor has to explain its own assertion criteria. "Each submitter has a submitter page and they either have a publication as to how they classify or some sort of document that they put together so that everyone can check their work," DiStefano said.
The only restriction on usage of information in GenCC DB is that the new site is following the Fort Lauderdale model that asks users to allow the Broad to be first to publish on the whole resource. DeStefano said that the GenCC team is hoping to submit a manuscript — likely to Genetics in Medicine — this spring, and the paper will be posted to a preprint site at that time.
DiStefano said that she has already seen parents of a pediatric patient look at the website in search of gene-disease relationships that might explain their child's condition.
"I think the feedback has been pretty positive and I've been surprised at the number of email signups and even people willing to submit their data already," DiStefano said. As of Jan. 5, 79 people had registered to receive email updates on GenCC DB.