NEW YORK – At its annual plenary meeting held this week in Boston, the Global Alliance for Genomics and Health (GA4GH) announced the launch of five new standards that are designed to support and enable responsible genomic data sharing.
The five standards, developed as part of the GA4GH Connect five-year strategic plan, are Crypt4GH, Variation Representation, Phenopackets, Tool Registry Service API, and the Data Security Infrastructure Policy. They address issues in data security, cloud computing, phenotype and variant data exchange, and the ethical implications of personal data use.
"The newly approved standards and updates are a major milestone in our work under GA4GH Connect, and we anticipate several more standards will be approved in the coming months," GA4GH CEO Peter Goodhand, said in a statement. "We are also launching an update to the GA4GH connect roadmap that accelerates our goal of enabling a federated, interoperable network of genomic data tools and resources."
The Tool Registry Service (TRS) API, now in its second iteration, supports the exchange of tools and workflows for analyzing, reading, and manipulating genomic data. It is one of a series of technical standards from the GA4GH Cloud Work Stream that help genomics researchers move analysis algorithms to datasets in disparate cloud environments rather than moving the data around.
Susheel Varma, one of the product leads for the TRS API and ELIXIR Competence Centre project manager for technology and science integration, told GenomeWeb in an interview that the registry and API, were designed to provide a service that allows communities of bioinformaticians from different omics fields to store and retrieve tools in a standardized way from different cloud providers.
Currently, most genomics tools and workflows are developed for use within specific environments and stored in registries tied to those environments. Since each registry requires that the tools and workflows it stores meet unique criteria in terms of things like hardware, tools in one registry may not work in other environments. This lack of platform interoperability can complicate the process of replicating studies that use tools developed to work with a specific environment.
"It is a waste of time and resources," Brian O'Connor, director of University of California Santa Cruz's computational genetics program and co-lead of the GA4GH Cloud Workstream, said in a statement. "Developers are building multiple versions of the same exact tool to fit the standards of each individual registry in which they want it to run.”
The TRS API addresses these problems by enabling the exchange of bioinformatics tools and associated dependencies packaged using containerized technologies such as Docker, making it possible for these tools to be moved around and used on different cloud platforms. Specifically, it provides standard mechanisms that let researchers list, search, and register tools across multiple registries. It also supports tools and workflows that make use of standards such as the Common Workflow Language (CWL), the Workflow Description Language, and Nextflow. The TRS API can also act as a bridge between tool registries. Multiple registries hosting different sets of tools and workflows that have implemented TRS can share information with each other giving researchers access to tools that may not be available on their own platforms.
"For the tool provider, it gives you a mechanism to curate and develop a community around a particular tool," Varma said during the interview. "For the user, it gives them a canonical reference for a particular version of a tool that can be used, which also makes their bioinformatics workflow reproducible."
Varma noted that TRS compliments existing resources such as Galaxy and Cromwell by making it possible for researchers to stitch together workflows that incorporate tools from these distinct platforms and run them in whatever environment they choose. Developers can also register their tools using the TRS so that these tools can be visible on multiple platforms. In addition, the TRS API works with other APIs developed by the GA4GH Cloud Work Stream group such as the Workflow Execution Service.
According to its developers, two workflow-sharing platforms – Dockstore and Biocontainers – have implemented version two of the TRS API. ELIXIR’s version – which is on Biocontainers – contains over 8,000 tools stored in over 68,000 containers. These tools have been packaged into 610 workflows so far.
The standard is also being used by researchers involved in the International Cancer Genome Consortium's Accelerating Research in Genomic Oncology project and the National Heart, Lung, and Blood Institute's Trans-Omics for Precision Medicine program. "It's been incredible seeing how the specification has grown," Varma said. "It gives communities the view that this is a sustainable standard and that they can use and build upon these standards."
The Variant Representation (VR) specification provides a flexible framework of computational models, schemas, and algorithms for exchanging genetic variation data. The specification, created under the auspices of the GA4GH's Genomic Knowledge Standards Work Stream, was developed with input from national information resource providers, major public initiatives, and diagnostic testing laboratories.
Robert Freimuth, co-lead of the Genomic Knowledge Standards Work Stream and an assistant professor of biomedical informatics at Mayo Clinic, said in a statement that the specification "is a step toward filling the gap between the exchange mechanisms used by the research, translational, and clinical communities, which is necessary for the implementation of genomic and precision medicine."
In emailed comments to GenomeWeb, Freimuth noted that one of the challenges with effectively utilizing genomic data for research or clinical practice "is the difficulty in exchanging test results between systems in a way that is computationally unambiguous." Furthermore, "the need to address this problem grows with each new data set and knowledge base," he said. "The GA4GH VR Specification provides a scalable solution to that challenge."
Larry Babb, the VR specification product lead and a senior principal software engineer at the Broad Institute, noted in a statement that the "specification will allow different communities 'to speak the same language' whether it's diagnostic labs and EHR vendors who are collecting samples or investigators who are accessing them."
Features of the specification include an extensible terminology and information model that provides standard computational data structures for biological concepts such as allele, sequence, variation, and genotype. It includes a machine-readable schema for structuring genetic variation data for electronic exchange, conventions for normalizing data to allow users to compare and interpret datasets collected at different institutions, and unique computed identifiers for variants.
The specification addresses some sources of ambiguity in identifying and sharing sequence variation, Reece Hart, a software engineering consultant and principal author of the VR specification, said in an interview. He used an example to explain: If an additional T is added to a sequence of 5Ts, current standards such as HGVS and VCF might categorize the addition as an insertion or a duplication. Furthermore, depending on the standard used, the added T might be placed on one end of the sequence or the other.
To get around these issues, the VR specification incorporates ideas from the NCBI's SPDI project, which adjusts sequence position to account for any ambiguity resulting from insertions or deletions. "So, we would say that’s 5Ts replaced by 6Ts," Hart explained. "The value in writing it that way is that you represent the entire bounds of that ambiguity" and "you don't pick a single representative."
Furthermore, the specification also offers a mechanism for identifying sequences that works by creating digests of genomics sequences and uses these as identifiers rather than the sequence name. Since the digest is based on the sequence itself, it is consistent irrespective of what the sequence is actually named.
The developers expect that the specification will simplify the task of finding and exchanging variant information. Rather than working with multiple naming schemas and formats, "we are proposing a lingua franca for how variation is represented among systems, and then a key that is computed from that data itself so that everybody can use exactly the same key," Hart said.
He noted that the VR specification isn't intended to replace standards like HGVS or VCF. "Our goal is to change the way computers talk about variation because we think that we can do that in a way that minimizes the ambiguity," he said.
One context where this standard would be a boon is in the clinical domain where electronic health records developers are working on ways to capture genetic variation in patient records to support precision medicine, the Broad's Babb said in an interview with GenomeWeb. Many vendors are already working on incorporating sequence information into their systems using models from bodies such as HL7. "The challenge here is if folks get ahead of us and start sporadically developing these systems, they are going to spend a lot of engineering dollars and resources and have a lot of data that isn"t necessarily as useful as they would like it to be," he said. As the team continues to develop the specification, they intend to work with vendors and existing clinical standards bodies to get the specification more widely adopted.
In his comments, Freimuth noted that GA4GH driver projects like the ClinGen Allele Registry and the Variant Interpretation for Cancer Consortium (VICC) were critical to the successful development of the VR specification and that their implementation "could catalyze its further adoption." Another GA4GH driver project, the BRCA Exchange, has also implemented the standard.
According to Alex Wagner, an instructor at the Washington University School of Medicine in St. Louis school of medicine and VICC co-director, the specification makes it easier to locate variations stored in different repositories and incorporate knowledge from these resources. "Despite all the differences in the ways we think about these things, if we are talking about the same variant, we now have the same name for it," Wagner who also co-led VR's development, said in an interview. "I don't need to know what ClinGen calls it, I just need to know what the variant looks like and I can compute that ID and then ask if anyone else has computed that ID in parallel." Babb, who is a member of the ClinGen team, added that that allele registry has a mechanism for researchers looking to apply the specification to their variants to do so – and the variants can be in the HGVS or VCF formats.
The developers have begun planning future iterations of the VR specification. "The schema was designed specifically to provide opportunities for expansion," Hart said. "Right now, we represent the simplest but also the most prevalent kind of variation that occurs," but they plan to expand the schema to describe more complex variation such as copy number variants. "Beyond that will be other kinds of structural variation, particularly fusions or translocations, haplotypes and genotypes," he said.
For its part, Crypt4GH is a standard file container format that is designed to help researchers share sensitive genomic data securely and keep it safe once it's been shared. According to its developers, current approaches for sharing data use encryption techniques that secure the data during transfer but do not guarantee its safety upon completion of the transfer.
"If the receiver's hard drive were to be hacked or their computer stolen, the sensitive patient information could fall into the wrong hands," Alexander Senf, Crypt4GH product lead and scientific programmer at EMBL's European Bioinformatics Institute, noted in a statement.
Crypt4GH addresses this problem by using a two-fold encryption system to the protect the data – both the data and the mechanism for unlocking it are encrypted. To access the data, researcher needs two keys: a private key to verify their identity and a second key to encrypt the data being transferred.
"The scheme is essentially an envelope encryption," Senf explained in an interview. Specifically, "the bulk of the data is encrypted in a symmetric encryption, and then we [use] a specific algorithm that allows us to keep byte level accessibility [to] the data," he said. "The envelope itself is encrypted using an asymmetric encryption scheme."
Crypt4GH works for both genomic and phenomic data, according to Senf, and with different file formats including BAM, CRAM, and VCF files. "It allow[s] us to encrypt data in such a way that access can be limited to specific people, but at the same time, it can also be included in the libraries that analysis software uses to read the data," he said. Furthermore, the encryption schema allows "access to the data in a streaming fashion, so we don't have to always have the entire file available to use it. That allows us to work on data of any size, so there's no practical limitation on it," he added.
Crypt4GH has been implemented by researchers at the European Genome-phenome Archive, Australian Genomics Health Alliance, and Wellcome Sanger Institute.
Other standards released at this year's GA4GH plenary include Phenopackets, which provides standards for sharing disease and phenotype information for diagnosing and treating rare and common diseases as well as cancer. The group also announced the release of the Data Security Infrastructure Policy, which provides standards and implementation practices for protecting the privacy and security of shared genomic and clinical data.