This story has been updated to include comments from another AMP session.
CHICAGO – The Genome Reference Consortium's GRCh38 reference genome — also known as hg38 — superseded GRCh37 (hg19) in 2013, yet many clinical molecular laboratories still rely on the older one for variant calling and genome analysis. Lurie Children's Hospital of Chicago, though, is among those that have successfully made the jump.
"In short, hg38 is a better assembly of the genome," Sabah Kadri, director of bioinformatics at Lurie Children's, said this week during the Association for Molecular Pathology's (AMP) virtual annual meeting.
The newer assembly corrects sequencing errors from hg19 and has more coverage for centromeres, but mostly, its alternative loci make it better, Kadri said. She noted that many research labs have moved to hg38, as have popular databases, notably version 3 of the Genome Aggregation Database (GnomAD), but clinical labs have been slow to switch.
Information technology is a key limiting factor, though fear of the unknown might be just as much of a reason, she suggested.
"This is a problem that I know many labs are hesitant to tackle," Kadri, a computational biologist in the hospital's pathology department, said. "Not knowing what the changes are and how they would specifically affect your [next-generation sequencing] assay or your processes in the clinical lab … makes people afraid to make this leap."
She said that labs often have four questions about making such migrations. How much effort does the move require, and does the lab have the capability to put in the effort? How would the migration affect the lab's specific tests, results, databases, and clinical reports? Are the informatics systems in the lab capable of handling the assembly changes? What changes must be made to the bioinformatics pipeline?
Kadri said that Lurie Children's was at this point a little more than a year ago, but successfully completed the move of its 4,700-gene medical exome and related panels for germline testing to hg38 in July. At the time, the hospital was changing its in-house sequencing platform and thought that it would be the right time to migrate to hg38 because it had to revalidate the entire assay menu anyway.
She said that the process applies to somatic testing as much as germline assays.
To make the change, bioinformaticians and pathologists evaluated how the genomic sequence changed in hg38 for their specific clinical territory, how the gene ontology changed, and whether there were effects on variant calling or annotation. "All three of these can have effects on your bioinformatics pipeline and processes," Kadri explained, adding that changes to genome sequences and ontology can also affect assay design.
There were four stages to the migration, with overlap between them: understanding implications of assay design, changes to genome sequencing, changes to the bioinformatics pipeline, and changes to lab informatics, including variant databases. "We were not that systematic from the start," Kadri admitted.
Assay migration design could happen in two ways: conduct tests in hg19 and convert the variants later, or just do everything in hg38. Lurie Children's, an affiliate of Northwestern University Feinberg School of Medicine, chose the latter approach.
"The most important thing is to check whether gene or exon boundaries have changed," Kadri advised. "Don't let this intimidate you." It also is critical to ensure proper alignment to know if spike-in probes cover these new regions.
Probes and exons should be mapped to hg38 to understand the changes. Though Lurie Children's did not have to change its assay design or add any spike-ins, Kadri said that the analysis helped the lab understand the differences in the two assemblies because these differences can affect results downstream.
For some genes, the Broad Institute's Integrative Genome Viewer (IGV) annotations of the Reference Sequence (RefSeq) database may not match the latest version, which also may not match an earlier hg19 version used to create a given assay, Kadri noted. "We learned the hard way [that] RefSeq annotations in hg19 actually do get updated with annotations in hg38," she said.
"Do not use IGV blindly. Use your own files that are used by your pipeline, or you might end up missing some of these changes," Kadri advised.
Of the 4,700 genes that the Lurie researchers reviewed, they saw exonic changes in about 200 that did not have probes in the hospital's medical exome test. Most were not on the clinical panels, and for those that were, they decided to backfill with Sanger sequencing data. There were 220 additional genes included in the panels with new exonic territory, but since there were probes, Lurie did not have to make any adjustments there, Kadri said.
"We found very little change in the exonic landscape between hg19 and hg38," Andrew Skol, a bioinformatician and statistical geneticist at Lurie Children's, said during another AMP session.
In the second stage of understanding the effects at the genomic sequence level, the hospital started with pipeline resource files, mapped them to hg38, then evaluated the changes. "We realized that it's not as straightforward as that," Kadri said. "There are a lot of nuances in these mappings."
Most gene intervals will map perfectly, though. For those that map with differences, they started looking at the differences and created a mapping toolkit called the Reference Genome Mapper, or ReGe, that uses the National Center for Biotechnology Information's Genome Remapping Service (NCBI Remap) for others to take a similar look at their specific genes of interest.
The ReGe toolkit categorizes and organizes the remap at the gene level, performing a coverage analysis to make it easier to track and evaluate the effects of the conversion. "It points out where attention needs to be paid," Skol explained, adding that his team is also developing a web app for simple searches.
In the bioinformatics stage of conversion, the Lurie team evaluated data quality and considered changes to the bioinformatics pipeline.
To speed up the process, they ran the differences between hg19 and hg38 through the ReGe toolkit to predict the regions that might have mapping issues. Two reference controls from the Genome in a Bottle (GIAB) consortium helped improve accuracy, Kadri said.
She said that the "million-dollar question" is whether the alternative loci should be incorporated into the hg38 implementation, which Lurie decided not to. "Our bioinformatics is just not sophisticated [enough] yet to manage the alignment quality issues that accompany these loci," Kadri said.
Lurie chose to tweak its bioinformatics pipeline to add a multi-mapping module in an effort to improve data quality for variant calling. "It's not a perfect solution for sure, but it definitely helps us flag these regions, and then we do confirmation using an orthogonal assay," Kadri said.
In the final stage of migration, Lurie is looking at how the move is affecting variant annotation. "Your variant annotation changes are going to really depend on your tertiary analysis system," she said. A custom lab system might require more changes than an off-the-shelf installation. Lurie uses Alamut commercial software and has had few issues, she said.
This also could affect the laboratory information management system (LIMS), though. As long as the genome assembly is tracked as a tabled entry, there were no database update issues needed, Kadri said. However, the Chicago hospital's LIMS automatically designs primers for Sanger confirmation at the time a variant is selected. Because this design was built with hg19, Lurie had to implement a new hg38 module.
For copy number calling, Lurie has a single piece of software for NGS and microarrays, but arrays are still being processed in hg19 and NGS in hg38, so management of this process requires two separate databases.
Justin Zook, leader of the human genomics team at the US National Institute of Standards and Technology's Material Measurement Laboratory, said during the same AMP session that Lurie took a proper approach with overlapping systems because a single reference does not allow for proper alignment of reads. Some genes changed copy numbers from GRCh37 to GRCh38, for example.
GRCh38 uses a different assembly model with 261 alternative loci, representing many haplotypes. "This helps to correct some issues with GRCh37," Zook said. These removed false gaps that caused problems with read alignment, though some analytics software has not yet caught up.
"It's actually quite hard with current tools to use these alt loci and annotate variants on them," Zook said.
Zook, who coleads the Genome in a Bottle Consortium (GIAB), gave an overview of past, current, and future reference genomes for variant calling that continue to advance the field. He noted that the Telomere-to-Telomere (T2T) consortium and the Human Pangenome Reference Consortium are developing new reference assemblies.
He suggested that a combination of GRCh38 and the T2T genome assembly may be better than GRCh38 alone in some instances. T2T is an international team led by investigators at the University of California, Santa Cruz, and the National Human Genome Research Institute.
A planned update from GRC called GRCh39 has been "indefinitely postponed," Zook said, while the consortium evaluates new models and sequence content for the human reference assembly.
Meanwhile, T2T is developing the first nearly complete sequence of a human genome, according to Zook. The consortium released version 1.0 of this assembly in September, expanding coverage of centromeres and heterochromatin.
"This is a really big advance in a new assembly," Zook said. "The idea is that maybe this could be another reference that you could use."
He said that the T2T releases makes improvements over GRCh38 in terms of variant calls. For example, GRCh38 is missing one copy of GPRIN2. A new reference, from cell line CHM13, provides better coverage in that region, particularly with longer Pacific Biosciences HiFi reads. It also corrects mismapped reads in the existing copy of GPRIN2.
In some cases, though, GRCh37 and T2T may be better than GRCh38 because the latter has extra copies of CBS and KCNE1, which can lead to mismapped reads, even with long reads. "It's useful to view any reference with skepticism," Zook said.