CHICAGO – A group at Spain's National Center for Genomic Analysis-Center for Genomic Regulation (CNAG-CRG) in Barcelona has harnessed a protocol for accessing sequencing and variant data to help assess potentially pathogenic genetic variants within the context of a European Union-funded program to improve diagnosis of rare diseases.
The CNAG-CRG researchers have built a bioinformatics platform based on HTSget, an application programming interface (API) created several years ago by the Global Alliance for Genomics and Health (GA4GH) standards body, to retrieve "slices" of genome and exome alignments from the European Genome-Phenome Archive (EGA) and other hosts. They can then run visualization software on these much smaller datasets in search of matching variants.
"Our work highlights the impact of developing and implementing interoperability standards, which will be essential for the establishment of large, federated genomics data networks," the CNAG-CRG team explained in a paper published in Cell Genomics this month.
"As a result, it is no longer necessary for over 11,000 datasets to download large alignment files to visualize them locally," the researchers added.
"You don't need to go through the process of having to move the data into specific buckets to pull down a whole genome, but rather just access specifically the part of the exome that you want to access," said corresponding author Sergi Beltran, head of bioinformatics and data analysis at CNAG-CRG.
In the experiment described in the paper, the mean time for accessing a region of an exome or genome and then producing a visualization of the alignment was about 23.9 seconds. Only one of 864 requests failed.
Beltran called HTSget the "glue" between the RD-Connect Genome-Phenome Analysis Platform (GPAP) and the EGA, which the Barcelona center codeveloped with the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI). One of the listed authors of the Cell Genomics paper, Alexander Senf, formerly of EMBL-EBI, was on the team that developed HTSget.
The work described in the paper is a small piece of Solve-RD, a research consortium involving participants from 21 organizations in Western and Central Europe and the US that combines clinical expertise with genomics and other technologies into a "genetic knowledge web" of genes, variants, and phenotypes in pursuit of better diagnosis of rare diseases. Solve-RD started in January 2018 with €15.4 million ($16.9 million) in funding from the European Union's Horizon 2020 program and is set to expire in June after a six-month extension.
Earlier platforms such as the Database of Genomic Variation and Phenotype in Humans using Ensembl Resources (DECIPHER) cannot visualize patient-specific genomic alignments, according to CNAG-CRG. "Our implementation enables both … the inspection of a causative variant from a solved case and comparing it with a current case under investigation and … the investigation of genomic variants or regions of interest for interpretation purposes," the authors wrote.
Beltran called Solve-RD kind of a "parallel project" to Care4Rare-Solve in Canada. He and Care4Rare-Solve principal investigator Kym Boycott have worked together on several publications.
Solve-RD exists to improve the diagnosis of rare diseases. It utilizes RD-Connect GPAP, informatics software developed at CNAG-CRG to manage processing, analysis, and sharing of genomic and phenotypic data. The Cell Genomics paper described GPAP and the EGA as "two key components" of Solve-RD's technical infrastructure.
The GPAP database contains phenotypes and genotypes of more than 26,500 patients and close relatives. Rather than storing large genomic alignments on its own cloud, it downloads "genomic slices" in 1-gigabyte chunks from repositories such as the EGA as necessary to fulfill specific research queries.
The Barcelona team developed a custom genome browser module within RD-Connect GPAP that contains an embedded version of the Broad Institute's Integrative Genomics Viewer (IGV) visualization tool. They said in the paper that this software combination has been tested and validated on about 11,750 datasets.
A GPAP user with a variant of interest can click on an IGV link within the GPAP interface to request a visualization of the appropriate genomic alignments from the EGA. EGA servers locate corresponding BAM or CRAM alignment files, then, following the HTSget protocol, return a slice of the alignment for visualization within the GPAP environment.
GPAP, which is recognized by the International Rare Diseases Research Consortium as a "privacy-preserving environment" for data analysis, lets authorized users search and prioritize genetic variants according to characteristics including sequencing coverage, known effects, expected pathogenicity, population frequency, and disease associations. Through its connection to the MatchMaker Exchange platform, GPAP also permits researchers to find and contact consented patients with the same disease, variants, or phenotypes.
"The problem we are trying to tackle is not so much the visualization itself. We have not done anything novel in terms of how we show the information," Beltran said. "It's rather how we can avoid having to transfer very, very large amounts of data to do the computations and storing this information several times."
He said that the work described in the paper is effectively a demonstration that the process could be applied at a larger scale. "It's functional, it can be used for real-world data, and, in principle, even diagnosis," Beltran said.
The CNAG-CRG bioinformaticians wrote in their paper that preliminary reanalysis using this visualization of slices of alignments for the first 4,400 Solve-RD cases in mid-2021 produced 255 new diagnoses. In one example, they described how visualizing genomic alignments for a TRIP4 variant resulted in the diagnosis of cerebellar hypoplasia and spinal muscular atrophy.
Despite contributing to real-world diagnoses, Beltran said that the integration is still a research project. "Obviously, we can't call ourselves a diagnosis service," he said, since clinicians decide how to use the information generated, making it essentially a form of clinical decision support.
At least for the processes described in the Cell Genomics paper, Beltran and colleagues are not looking for CE-IVD marking, though they do want to develop their technology to a point where it can be integrated with routine clinical practice.
Beltran said that the program has three major activities: data reanalysis, generation of new multiomics data, and validation of novel candidate genes.
Solve-RD has collected more than 20,000 existing exome and genome sequences from undiagnosed rare disease patients and close relatives, and is also generating new genomes, exomes, metabolomes, transcriptomes, and epigenomes, according to Beltran.
"Obviously, all these have to be backed up by an informatics infrastructure," he said.
A key goal of Solve-RD is to "comprehensively reanalyze … inconclusive exomes and genomes from undiagnosed patients submitted by partnering European Reference Networks and undiagnosed disease programs from Spain and Italy," according to the Cell Genomics paper. "Therefore, one of the main challenges facing Solve-RD is the ability to effectively collect, store, process, share, and interpret vast quantities of data, provided by more than 51 different centers across Europe, within a secure and collaborative environment."
The Barcelona bioinformaticians are particularly focused on detecting variants in "repetitive regions of the genome where properly identifying short insertions, deletions, and copy number variants is more challenging," the authors wrote.
The EGA did have to overcome two challenges when implementing HTSget. For one thing, the API initially was not compatible with CRAM files or with older versions of BAM files, though GA4GH subsequently fixed these shortcomings. Developers also struggled with how and where to decrypt data when moving specific alignment regions.
The authors also said that Solve-RD could theoretically apply the GPAP technology and HTSget link to RNA sequencing alignments in the context of transcriptomics. Beltran said that work has not started on other omics.