Skip to main content
Premium Trial:

Request an Annual Quote

Genomic Data Interoperability, Remote Workflow Key to New Global Alliance APIs

Premium

CHICAGO (GenomeWeb) – The first set of interoperability standards from the Global Alliance for Genomics and Health's GA4GH Connect strategic roadmap are now available as application programming interfaces. Eventually, GA4GH plans on having an entire suite of standards intended to facilitate secure sharing, harmonization of reference sequences, and better analysis of genomic data for both research and clinical purposes.

Released this month at GA4GH's sixth plenary meeting in Basel, Switzerland, the first four products include the Beacon API, the Workflow Execution Service (WES) API, the Htsget API, and the Refget API.

"We are now a real-time platform for the discovery of genetic mutations across a global federated network with tiered access to clinical metadata, using one account," Marc Fiume, CEO of DNAstack and Beacon API project leader, said in a presentation at the plenary meeting that was streamed live online.

Beacon, a variant search protocol, is the foundational piece for the roadmap, which itself is an early product of GA4GH Connect, a five-year initiative that GA4GH launched a year ago in hopes of better serving the international genomic data community. The alliance grew out of a need within the genomics community for collaboration, interoperability, and ways of sharing data responsibly in the research context.

In early 2017, GA4GH partnered with the European Life-Sciences Infrastructure for Biological Information (ELIXIR) initiative to establish "beacons" at multiple sites that will make it easier to search European genomic datasets as well as develop protocols for securely sharing phenotype data.

Beacon actually dates to the first Global Alliance conference, according to Fiume. Five beacons were built within a few months of the March 2014 proposal. 

"We thought, wouldn't it be a great testament to interoperability if we can actually search across these simultaneously, in real time?" Fiume recalled. This led to the creation of the Beacon Network. He described it as "a federated search engine" that now contains hundreds of public beacons.

There are now about 100 beacons "lit" at 90 institutions in 34 countries across the globe for sharing research data, according to Gary Saunders, human data coordinator for Cambridge, UK-based ELIXIR and core lead of the Beacon project.

"You can use the Beacon as a protocol to search for a particular allele at a particular site," Saunders said. This, he explained, helps to break down silos that have hindered data sharing in research and clinical genomics alike.

Because human genomic and phenotypic data is sensitive — and now subject to the European Union's General Data Protection Regulation — users have to be authorized and authenticated.

"This can be a process that takes quite some time, and then when you're permitted access, you download the data and you analyze these data locally," Saunders noted.

"The datasets that you then get access to may actually not have any of the information you're interested in. You might be interested in a certain receptor or a certain enzyme within the human genome and particular variants within that locus," he said.

Beacon essentially allows dataset owners to set up beacons on their data and allow anyone to query it with the Beacon protocol for a particular allele at a particular site. 

"If you ask me, 'Have you seen this allele at this position,' I come back and I simply say yes or no," Saunders said. If the answer is no, the user can move on and not waste time on a formal application for data access. But if there is data of interest, that user can then look to register for access.

"It's a first step, a very lightweight protocol which allows users to query datasets before they have to ask for access to that data. It's to open up the data," Saunders added.

The newly approved Beacon API includes something called the ELIXIR Authentication and Authorization Infrastructure, or ELIXIR AAI, to expedite the process of applying for access to multiple databases. "I am known to you and I log in with AAI … then we can allow you to see more datasets and query more datasets behind my beacon," Saunders said.

This API also contains clinical metadata, according to Fiume. In a survey of users of public beacons, the Beacon API development team found that the majority were using beacons for either clinical research or diagnostics, particularly when looking for rare variants and variants of unknown significance. It would make beacons more useful for at least half of them if they could access case-level annotations about individuals, he said.

The 1.0 release allows users to input both structured and custom metadata. "It now allows you to use the beacon for disclosing patient information, phenotypic information, variant annotations, classifications, and also pointers to data that can be used in a cloud-workstream run," Fiume said.

After gaining access through the Beacon protocol, the Htsget API allows users to stream data without having to copy and transfer large files. Refget helps people retrieve reference sequences.

"Htsget is this standard for doing secure, real-time streaming of genetic data, typically sequencing reads or genetic variation data, in the form of SNPs and indels," explained Thomas Keane, team leader of the European Genome-phenome Archive and the European Variation Archive at the European Bioinformatics Institute of the European Molecular Biology Laboratory. He also leads the large-scale genomics work stream for GA4GH.

"Refget is this new standard for retrieving reference sequences," and it attempts to standardize nomenclature in the process, Keane said.

"The idea for Refget is to try and figure out a way that we can enable the high-throughput pipelines to not get hampered by these kind of sequence naming issues that then cause inconsistencies across groups when you are just trying to do a genomic analysis. It creates headaches for bioinformaticians," Keane said.

The primary use case for Refget is the CRAM file format for storing sequencing reads. Taken together, Htsget and Refget take a "hash" of a sequence, using a checksum operation to pull and transfer a subset of the file, which CRAM requires anyway.

"This API is just formalizing the process of actually getting the sequence back from the checksum," Keane said. "It's such a fundamental operation for genomics."

Refget takes a reference sequence, normalizes it, then calculates two checksums. "Once you have these two identifiers, these go into a database, and then the API comes in with this URL sequence and the checksum identifier. It looks at what that sequence is in the background and returns back to the sequencing question over a restful API," Andy Yates, EMBL-EBI Genomics Technology Infrastructure team leader and a coleader of the GA4GH genomic knowledge standards work stream, explained in a session from the recent plenary meeting.

The European Genome-phenome Archive (EGA) already has made the Htsget standard part of its own data-access API. With the Broad Institute's Integrative Genomics Viewer or other integrated genome browsers, users can read hashes directly from EGA without having to download a file.

"You can slice out a region that you are interested in from whatever BAM or VCF file. You can come along with SAMtools and you can work on the command line and you can connect directly to EGA and stream out your data. You can slice out a region, and do this all securely," Keane said.

EGA has integrated with the pan-European RD-Connect database to access raw genetic variation information on rare diseases. "You can pull back the raw and supporting sequencing reads for the particular variant that you're looking at without having to download the whole bulky files," Keane noted.

EGA also has a Refget server connected to the European Nucleotide Archive, according to Keane. Yates said that GA4GH is developing a "serverless" model of the API that will run on the Amazon Web Services cloud platform.

The final piece of the recent set of releases is WES, the Workflow Execution Service, which lets researchers run genomic informatics tools and workflows on data in various environments, including clouds and other remote installations. Through WES, users can submit requests to workflow execution systems, then monitor processes that are underway.

For this project, GA4GH decided to standardize and package workflows for deployment to any computing environment. "We can use WES to use exactly the same workflow in a number of different clinical settings. That could be really powerful," Saunders said.

"You can be confident that if I've built a particular workflow, I should be able to run that workflow wherever I want on whatever data I have and be confident I'll get the same answer," according to David Glazer, founder of Google Genomics, engineering director at Google-affiliated Verily, and co-chair of the GA4GH Cloud Work Stream. He also made his comments on the plenary livestream.

"The execution of those portable workflows in your preferred computing environment and your preferred storage environment, that's the point. That's the portability that we're looking for," Glazer added.

With WES, the API can request that a workflow be run on specific data, let the user pass in parameters for the workflow, check on the status of a workflow, and cancel the operation if necessary.

"As we make it easy to take a workflow, take an analysis and run it the same way in all these different environments, with different policies, and different physical environments, that increases the pool of available, compatible, and comparable results that we can work with," Yates said. "That's why we want this ability for a tool builder to build their tool, have it run everywhere, and as a researcher, consume those results in a known, compatible way across [the whole process]."