NEW YORK — The choice of which human reference genome a lab uses could influence the variants called, an issue research and clinical labs need to keep in mind as they choose a reference.
The GRCh38 human reference genome came out more than seven years ago and filled in gaps, added alternate scaffolds, and made other updates, and new patches and updates are still being released for it. But not all labs have made the switch from the GRCh37 reference genome, also known as hg19, to GRCh38. One recent survey has suggested that most clinical labs are reluctant to make the switch, as they are not sure whether the benefit is worth the cost of changing workflows.
A new study has found that, for some regions of the genome, which reference genome is used to call variants from exome sequencing data may make a difference. Moez Dawood, an M.D./Ph.D. student at Baylor College of Medicine, and his colleagues uncovered a set of more than 200 genes — including ones implicated in Mendelian diseases — that are enriched for discordant variant calls.
"The implication in the entire field [that] moving up to 38 is just going to make things better is not quite what we saw," Dawood said, noting that in some cases the older reference was better than the newer one, but also the reverse. "You really need to be savvy about what you're looking at."
Still, the field is slowly updating its tools to work with GRCh38, though that changeover may become complicated by the release of the telomere-to-telomere human genome assembly.
"We're very quickly coming to a point where even as clinical labs migrate to 38, that there's a whole other frontier that's already live that they have to contend with," said Midhat Farooqi, the director of molecular oncology at the Center for Pediatric Genomic Medicine at Children's Mercy Kansas City.
For their American Journal of Human Genetics study, Dawood and his colleagues generated variant calls for 1,572 exome sequencing samples collected by the Baylor-Hopkins Center for Mendelian Genomics using both the GRCh37 and GRCh38 reference assemblies. While both references gave largely similar results, Dawood and his colleagues noticed that about 1.5 percent of SNVs and 2 percent of indels were discordant between the different reference genomes.
The discordant variants the researchers uncovered also tended to congregate in regions of the genome assemblies they dubbed discrete discordant reference patches, or DISCREPs. DISCREPs were themselves enriched for segmental duplications, alternate haplotypes, and known assembly issues and fix patches.
A set of 206 genes were also enriched for discordant calls, the researchers found. These genes further included eight associated with known Mendelian phenotypes. Additionally, three known pathogenic or likely pathogenic variants were called differently based on the assembly used, suggesting the choice of reference genome used could influence the molecular diagnosis of some Mendelian disorders.
These findings are largely consistent with previous work from the US Food and Drug Administration's Huixiao Hong. About three years ago, he and his colleagues reported on their comparison of SNVs called by aligning whole-genome sequencing data to the two reference genomes and converting the SNVs called between the two references. They found that 1.5 percent of SNVs were discordantly converted between the two. He and his colleagues additionally traced the spots where these discordant calls arose to "difficult" parts of genome assemblies like repeat regions.
"The findings from both studies suggest caution is needed when translating identified variants between different versions of the human references in scientific research and clinical labs," the FDA's Hong said in an emailed statement.
The US National Center for Biotechnology Information's Valerie Schneider noted that while the community has long suspected there might be differences in variant calling, the relatively low number of sites reported in the study might also provide reassurance.
Dawood added that both references are affected by discordant calls. He suggested that researchers, especially those focusing on the 206 genes they identified with an enrichment of discordant calls, make their reference choice based on which one better identifies variants in their genes of focus, a plan of action echoed by Schneider.
But, he said, it is a trickier call for clinical labs doing high-throughput exome testing. There, he suggested taking care with clinical annotations involving those 206 genes.
"The implication is don't just blindly move up to 38. There are things to think about. There are times when 19 is out-performing 38," he said.
Still, in his paper, Hong recommended that GRCh38 be used for sequencing-based SNV analysis. Lisa Lansdon, a laboratory genetics and genomics fellow at Children's Mercy, likewise finds GRCh38 to be the better option. "Stemming from our work and from some of the work that is now being published, like this [AJHG] paper, it's becoming more and more apparent that 38 is the stronger build," Lansdon said. She cautioned that "there are still discrepancies in 38 as well. But there are fewer."
However, as she, Farooqi, and their colleagues reported earlier this year in the Journal of Molecular Diagnostics, many labs have yet to make the change to the newer GRCh38 human reference genome — they, too, have not fully switched. They got the idea for the survey as they faced the challenges of changing over themselves.
"That prompted the question of: Well, how many labs have switched over and if they have or haven't switched over — we were suspecting that they had not — what were some of the things that were keeping them from doing so?" Farooqi said.
According to their survey of about two dozen clinical labs offering next-generation sequencing-based testing, only 7 percent had already moved over to GRCh38, and most, 54 percent, had no plans to change. Most commonly, labs said they did not think the benefits from changing over outweighed the costs, both in time and money. Clinical labs, Farooqi noted, must revalidate their pipelines as well as realign their existing clinical data to GRCh38 for their internal variant database, a time-consuming process.
Still, a change might be coming. Both Lansdon and Farooqi noted that more tools that labs rely on for variant interpretation — like gnomAD and DECIPHER — now also support GRCh38 coordinates. Farooqi additionally pointed out that research labs are more likely to have made the switch, which will also lead the literature to switch and could, in turn, spur clinical labs to make the changeover.
"It's coming to a point where you cannot avoid the problem anymore," he said.
At the same time, though, researchers led by the National Human Genome Research Institute's Adam Phillippy reported in a preprint posted to BioRxiv in May that they generated the first telomere-to-telomere human genome sequence. This sequence, they said, was an even better representation of the human genome than GRCh38.
Uptake of that sequence as reference could likewise be slow, Lansdon said, if the tools are not there. "What I anticipate is that you'll see research laboratories starting to use it, paving the way for the clinical laboratories," she said. "Eventually when there's enough support, yes, absolutely, I think that some of the clinical laboratories will start to adopt that as well."