NEW YORK (GenomeWeb) – The Data Working Group of the Global Alliance for Genomics and Health (GA4GH) recently released a new version of an application programming interface they developed to enable researchers to better share and access genomic data.
GA4GH's data working group is the arm of the international alliance focused on establishing and enhancing open standards and formats for storing and representing data, as well as an API that will connect analysis tools to data.
In the last year, the team has been working on a so-called Genomics API, which is built on file formats that have been developed over the last five years for large-scale genomic sequencing projects. It features cleaner models, an easy-to-use data description schema, and a web-enabled interface. This recent incarnation of its API, dubbed version 0.5, builds off version 0.1 and adds a series of new features not available in its predecessor. For example, the earlier API provided definitions for read data but lacked definitions for variants and sequence alignment, and these are now available in 0.5, David Haussler, a professor of biomolecular engineering at the University of California, Santa Cruz, told BioInform. He is also the co-chair and co-founder of the data working group.
The data working group eventually plans to launch a full version of the API that will include more features that are either in development or to be developed, but hasn't yet set a date. However, it is making early versions of the API available for researchers to use to access and exchange information across remote sites.
Planned features for the official release of version 1.0 of the API include increased descriptions of different kinds of variants, Haussler said. At present, the API only describes mutations such as SNPs and small insertions and deletions, he said, but "we are working towards describing more complex structural rearrangements, such as inversions and segmental duplications." Currently members of the community use multiple methods to describe these types of variants so "we're trying to standardize that a bit more." They are also mulling improvements to the reads module of the API, including coming up with more standardized ways of describing objects generated by sequencing experiments, as well as methods of extracting and exchanging that information, he said.
The data working group plans to develop and add new modules for the genomics API. One of these will deal with the metadata, providing standardized ways of recording and exchanging information associated with genetic data such as who generated the data, what the initial sample was, and what sequencing instrument was used to generate the data. This is one of the trickier tasks because users will not all be interested in the same sorts of metadata and yet it isn't feasible to capture every bit of information tied to genetic experiments. "It is always the most gruesome task to sort through the metadata [but] we're trying to make that easier," Haussler said. Another module will enable users to access and exchange data from gene expression experiments.
Since it was first made available, version 0.1 of API has been used by groups at the European Bioinformatics Institute, the US National Center for Biotechnology Information (NCBI), Google, Genome Savant, and Harvard Medical School's Biomedical Cybernetics Laboratory to power a growing community of applications. Researchers at Harvard Medical School, for instance, are using the API to enable apps for the TBResist initiative "that bridge from raw sequence data to clinically useful phenotypes," Gil Alterovitz, a faculty member at the Harvard Medical School and director of the Biomedical Cybernetics Laboratory, said in a statement. "Also, the Substitutable Medical Applications and Reusable Technology Genomics platform is using the Global Alliance interface to enable interoperability between electronic medical record information (HL7) and raw genetic sequence information." Alterovitz is a member of the alliance's data and clinical working groups.
TBResist is a global effort to collect and sequence the complete genomes of multiple strains of drug-resistant tuberculosis from different geographical locations. The goal is to gain a better understanding of the disease and associated co-morbidities and develop better methods and tools of disease control. Alterovitz told BioInform that researchers involved in the effort are using the Genomic API, as well as the API to the SMART platform to access and combine clinical and genomic information in "a more holistic" and "efficient" way to better understand drug resistance in TB and potentially develop new therapies to fight an emerging global threat. "TB is one example of an infectious organism but this [is a] platform that can be used for other bacteria and infectious organisms as well," he said.
As analysis tools adopt the new API, researchers will be able to extend their own infrastructure to utilize cloud resources, such as those available from Amazon Web Services, Google Cloud, and Microsoft Azure. David Glazer, co-chair of the data working group's Reads Task Team and engineering director for Google Cloud platform and Google Genomics, said in a statement that "Google already supports Version 0.1 of the API, and we'll be adding support for Version 0.5 soon."
Matt Wood, Amazon Web Services' general manager of data science, said that his firm views "these new APIs as a vital component for collaboration and development of next-generation tools that can run cost-effectively at massive scale." AWS is "proud to support [GA4GH's] efforts, and help in defining new operating models, such as the latest Genomics API," he added.
Besides its work on the Genomics API, GA4GH's data unit also has its hands in three alliance driver projects. One of these is called the Matchmaker Exchange, which has the two-fold purpose of simplifying the process of searching for and using information on genes and phenotypes, as well as connecting researchers with shared interests and/or similar cases. About 40 scientists from research consortia and clinical sequencing centers including Harvard Medical School, University of Toronto, and the Broad Institute are involved in developing the Exchange.
A second alliance initiative is called the Beacon project, which is described as "a project to test the willingness of international sites to share genetic data in the simplest of all technical contexts." Haussler explained it this way: Essentially, beacons are servers installed locally by institutions that external users can send simple queries to for information about the genomic data available at the site. These queries are of the form 'do you have a genome that has a T at a specific position on a specific chromosome' and the server responds 'yes or no.'
It's a simple service that opens up an opportunity for institutions to discuss their current policies and regulations that govern the sharing of genomic data and explore ways to share data more broadly in a secure fashion, Haussler said. "We want to cultivate an atmosphere in which genomic information is shared as freely as possible while respecting the privacy of the individuals involved who are sequenced," he said. "The obstacles to that are not so much technical as they are social," he said, adding that it is important "to have a very simple technical platform that highlights the social problems … to get that dialogue going. That's what the Beacon is. It's not meant to completely revolutionize scientific research in genomics but to get the dialogue going so that people can actually ask the question 'why can't we share data?'" So far, groups at the Wellcome Trust Sanger Institute, the NCBI, the EBI, and other have set up beacons.
The data working group is also involved in the data sharing efforts of two cancer-centric initiatives. The first of these is the International Cancer Genome Consortium's Pan Cancer Genome Analysis project, an effort to analyze data from 2,000 whole tumor genomes. The second project aims to collect and share information about mutations in the BRCA 1 and 2 genes with an eye towards identifying the pathogenic changes within these genes that are associated with cancer, primarily breast and ovarian.
The Genomics API is one of the first products to be developed and distributed by GA4GH. While the data working group has a core developer team, it is open to input from researchers at any institution. They can explore sample apps, build implementations from scratch or from existing samples, and provide feedback on the API and its documentation, the team said.