NEW YORK (GenomeWeb) – Color Genomics today unveiled an open-access database containing genetic and clinical information from 50,000 customers it tested for hereditary cancer risk and who agreed to share their data for research.
The knowledgebase, called Color Data, contains more than 2.8 million variants that the Burlingame, California-based company detected in customers after testing them for 30 genes associated with hereditary cancer risk. Color created the database recognizing that researchers studying the role of genetic variants in diseases lack public datasets that contain information on the pathogenicity of variants, as well as the clinical data from patients harboring those variants.
"This is an asset for the scientific community to do hypothesis-driven research … and ask questions about the epidemiology and evolution of disease," said Alicia Zhou, head of research at Color. "We think of this as a data asset built by scientists, for scientists."
Color is announcing the availability of the database today at the American Society of Human Genetics' annual meeting to encourage researchers to use the knowledgebase. The company is also trying to set an example to inspire its peers in genetic testing to do the same, Zhou said.
In recent years, academic and industry labs have been increasingly willing to partake in efforts like ClinGen, where experts are collaborating to define the clinical relevance of genes and variants in healthcare, and a growing number of labs are submitting data to the public variant database ClinVar. Around 1,000 submitters have uploaded more than 736,000 variant records to ClinVar and have classified most of these variants as likely benign, benign, uncertain significance, likely pathogenic, or pathogenic.
However, most labs aren't sharing the information from patients that support their variant classifications, such as how many tested individuals have a particular variant, what other co-occurring variants they have, if they were healthy or affected by an illness, and their ethnicity. Limited access to this type of information hinders understanding of the role and prevalence of genetic variants in diseases, makes it harder to definitively classify variants of uncertain significance, and slows down efforts to resolve variant classification discrepancies between labs.
"A lot of the labs summarize data from the literature for variants they submit to ClinVar, but they're not describing the case observations," said Heidi Rehm, chief genomics officer within Massachusetts General Hospital's department of medicine, and a principal investigator for ClinGen.
When ClinVar was initially proposed, the experts involved in the effort knew how important it would be to bring in phenotype information. However, the investigators at the time didn't insist that labs submit extensive phenotypic data recognizing that it would complicate privacy issues, require explicit patient consent, which in turn would put extra demands on doctors and patients to communicate phenotypic information back to labs. So, the investigators focused on encouraging labs to submit their variants and interpretations as quickly as possible, on bringing variant interpretations into the public sphere, and on encouraging efforts to resolve different interpretations.
Currently, when there are discrepancies in variant classifications between two labs and they want to resolve it, the scientists exchange the case level data over email. If the labs decide to change a variant classification in ClinVar based on this exchange of information, the detailed patient evidence that supports the updated classification may remain shielded from public view.
"We need better ways of getting access to this case-level data," Rehm said. "This is a huge challenge."
Color is hoping that its database will help mitigate some of these challenges in variant classification research. Zhou highlighted, for example, that researchers can use the database to investigate the variants that occur with a pathogenic variant, which may help them resolve variants of known significance.
If researchers are interested in variants that show up in an ethnic group, they can also query Color Data, she noted. Approximately 72 percent of the individuals in the database are Caucasian, around 9 percent are Ashkenazi Jewish, 6 percent have mixed ethnicities, 4 percent are Hispanic, around 5 percent are Asian, and just over 1 percent are African.
Color will continue to add information from consented customers to Color Data and update classifications. The database will be versioned so researchers can cite information they report in publications.
Rehm lauded Color's efforts as a good step toward creating much-needed frameworks for sharing case-level data on variants and highlighted other efforts with similar goals. Within the Global Alliance for Genomics and Health, experts around the world are working on developing standards for genomic data sharing, including creating standardized application programming interfaces (APIs) for exchanging data. These APIs have been deployed in federated data-sharing networks, such as the platform called MatchmakerExchange, which allows users to identify patients that share novel gene candidates and overlapping phenotypic information.
At a recent meeting on its priorities for the next five years, GA4GH launched the first version of a similar federated platform that will allow researchers to query whether their peers have seen a specific variant in a patient, as opposed to if they've seen patients with any mutation in a particular gene of interest.
Labs participating in ClinGen expert panels, which review evidence on variants and publish consensus classifications on ClinVar, have also expressed interest in developing a set of principles, rules, and eventually, a platform for sharing case-level data on variants amongst participating labs, instead of doing this work via emails and spread sheets.
While many of the major genetic testing labs are now submitting to ClinVar, plenty aren't. Public databases shed light on variant classifications that differ between labs, which prompts labs to work with each other to reduce discrepancies and combine data to improve the accuracy of variant classifications.
This requires investment into advancing standardized approaches and data structures that facilitate sharing, which labs may not have committed to yet, Rehm said. "Our community also has to come to an agreement about what patient data can and can't be shared and how to best to share the data," she added.
For example, Ambry Genetics, which is among the most prolific submitters to ClinVar, spent around $20 million to launch AmbryShare a few years ago. In contrast to Color's database, AmbryShare includes allele frequency data from more than 11,400 research-consented customers with hereditary breast and ovarian cancer, whom Ambry sequenced on its own dime. In the "frequently asked questions" section of the website, Ambry assures users that the data are de-identified and available in aggregate by variant and disease type.
Public genetic databases raise concerns about the possibility that the identity of those donating their information will be exposed, putting them at risk for discriminatory practices. Such considerations make labs hesitant to share variants, particularly rare variants, with detailed phenotypic information.
Color's database currently contains data from customers who consented to partake in research, and were explicitly told about the database and the risks of sharing their information in an open-access repository. Individuals can withdraw their consent at any time, and Color will remove their information from the database.
The company tells customers that it makes "reasonable efforts to limit queries that identify individuals as a unique or rare carrier of any variants." Database users can search for specific genetic variants and filter them according to gender, age, ethnicity, cancer history, variant classification, and zygosity, but the variant must be present in at least five Color customers.
A Color spokesperson said that the firm decided to allow searches for variants that are found in five or more people because it would allow the database to maintain power for statistical queries and prevent the common techniques used to re-identify people in de-identified datasets. According to experts GenomeWeb spoke to, however, there is no agreed-upon threshold in this regard within the scientific community.
Color's database also restricts users from identifying too many phenotypic characteristics of individuals with a particular variant. For example, users may see that nine men and one woman have a certain variant, and that nine of those individuals are Caucasian and one is African, but they won't be able to tell whether that woman is Caucasian or African.
Variants seen in as few as one person are included in the database, and will show up as part of broader queries, but those rare variants are not searchable themselves as another measure against re-identification. If researchers are interested in studying a very rare variant, Color is open to collaborating with them, but it would require the explicit consent of the customer with that variant.
These considerations around how to limit confidentiality breaches, give reasonable assurance of privacy, and garner informed consent can be tricky, and absent standardized policies, patients' clinical data tend to stay within genetic testing labs' internal databases. "I suspect most of the identifiable clinical data will stay put locally," said Robert Cook-Deegan, a professor at Arizona State University's School for the Future of Innovation in Society. "But we still need to build the global infrastructure and sharing norms to enable interpretation of genomic variants, and [Color's database] is a great step in that direction."
Color is hoping that its peers in industry and academia will see the decisions it has made in sharing variant and clinical data and will start to consider how they can do the same. Of course, there will be labs that aren't swayed by Color's example. There are a few labs that don't want to participate in ClinVar because they view their variant classifications as proprietary, and as a differentiating factor in an increasingly competitive genetic testing market, though according to Rehm, this kind of thinking is on the decline.
Myriad Genetics is perhaps the most outspoken in its opposition to public variant databases. Last year, researchers from Myriad published an analysis comparing 4,250 unique BRCA1/2 variant classifications in ClinVar to its own proprietary database and found that 14.5 percent were classified differently; 12.3 percent had some agreement with at least one entry in ClinVar and Myriad's determination; while 73.2 percent were in agreement. Since nearly 27 percent of BRCA1/2 variant classifications didn't match up between Myriad and ClinVar in this analysis, the authors questioned the utility of the public repository.
This analysis, Rehm pointed out included a lot of data in ClinVar from outdated literature and older databases. Other research teams have published papers comparing variant classifications between clinical labs in ClinVar and have reported far fewer discrepancies.
By launching an open-access database, Color's variant classifications are open to such scrutiny, but the company welcomes this. Treating variant data as proprietary "is not the way commercial entities should play in this space," Zhou said. "Our goal is to be very transparent."
The vast majority of the variants in Color Data (precisely 2,754,640 variants) are classified as benign, approximately 58,000 are deemed likely benign, 10,900 are variants of uncertain significance, 1,200 are likely pathogenic, and 4,500 are reported as pathogenic. However, any researcher or lab interested in comparing Color's variant classifications against their own classifications could already do so based on the information in ClinVar. As of last December, Color has made more than 10,700 submissions into the database.
Moreover, Color doesn't want to compete based on which lab is better at classifying variants. This is a common stance among labs regularly submitting to ClinVar. "When labs fight over lab classifications, it is a huge disservice to patients," Zhou said. "We as an industry should be moving toward congruous classifications. The more we can get the community and labs to work together on this the better we'll be serving the patient."