BALTIMORE – The Genome in a Bottle Consortium has added a benchmarking dataset for 273 medically relevant autosomal genes that were previously excluded in its reference data due to their repetitiveness or polymorphic complexity.
The curated benchmark set contains 17,000 single-nucleotide variations, 3,600 insertions and deletions, and 200 structural variations for human genome reference GRCh37 and GRCh38. The report also identified false duplications in these assemblies that, when masked, can improve variant recall.
The new benchmarks can help clinical labs better detect and validate pathogenic variants that were previously overlooked, according to experts in the field, aiding disease diagnosis and further research into these medically relevant genes.
To achieve its benchmark dataset, the GIAB team generated a haplotype-resolved whole-genome assembly using long-read sequencing data from Pacific Biosciences technology. Published Monday in Nature Biotechnology, the report was spearheaded by scientists from the National Institute of Standards and Technology, DNAnexus, and Baylor College of Medicine, along with collaborators.
A worldwide public-private consortium led by NIST, GIAB characterizes human genomes in order to provide a reference standard and reference methods for researchers and clinical laboratories. So far, the group has focused on seven genomes, including a pilot genome and two family trios. In June 2020, the consortium released structural variant calling benchmark data, also in Nature Biotechnology. To further complete GIAB's reference data, researchers targeted 395 challenging genes that are medically relevant but were not well resolved by GIAB's previous benchmarks in the new study.
Among them, there were highly homologous genes such as SMN1 and SMN2, mutations in which can result in spinal muscular atrophy; NCF1, NCF1B, and NCF1C, which are associated with chronic granulomatous disease; and genes with structural variants like RHCE, which is linked to blood disorders. Additionally, there were genes associated with cardiovascular diseases, cancer, and other genetic disorders.
According to Fritz Sedlazeck, associate professor at the Human Genome Sequencing Center at Baylor College of Medicine and a senior author of the paper, the almost 400 genes included in the study are "particularly tricky" because of their high repetitiveness and polymorphism. As a result, they are difficult to analyze with short-read sequencing since "you don't know if a segment that you just sequenced falls into this one region or those other regions," he explained.
"One little bit of criticism [of GIAB] sometimes is that we are shying away from these tricky regions [of the genome]," Sedlazeck said. Focusing on this core set of challenging genes can not only improve people's understanding of their polymorphism but also enhance method development to further characterize these genes in clinical diagnostics, he added.
For their study, the team adopted a de novo assembly method called hifiasm using Pacific Biosciences HiFi reads. "Independent of using a reference genome, this method is able to stitch together the long reads into the two haplotypes that every person has," said Justin Zook, a researcher at NIST and a GIAB coleader who is also a senior author on the paper.
After aligning the de novo assemblies to the reference for variant calling, the team curated the variant regions. "We looked at the genes as a whole to make sure the assembly was properly resolving the whole gene," Zook said. "Then we also asked for volunteers in the GIAB community to take their best variant call sets for these genomes and compared them to our new benchmark." After quality control, the researchers were able to finalize a set of 273 curated genes out of the initial 395. Additional efforts involving a broader cohort are still needed to further benchmark the unresolved genes, according to the study authors.
One of the "fascinating things" Sedlazeck said he learned from the study was that there were a number of genes that were successfully assembled and their variants called, yet the team still failed to benchmark them. For instance, although the hifiasm assembly was able to resolve the entire LPA region, which contains multiple tandem-duplicated copies and is associated with cardiovascular disease, the authors were unable to benchmark the variants. This is because the repeat can be represented in different ways in different variant call formats, hampering the team's ability to form a reliable benchmark across the board.
"[Different] variant callers are like different dialects for a language," said Chen-Shan Chin, senior director of deep learning in genomics at DNAnexus and another senior author of the paper. Just like various dialects describe the same thing with different vocabulary, Chin explained, variant callers present the same variant in different ways, making it difficult to establish a benchmark to compare their accuracy.
As part of the study, the researchers also identified genes in either the GRCh38 or GRCh37 reference that are falsely duplicated. "This is something we didn't expect to find going into the study," said Zook, adding that the reason why some challenging genes were excluded from previous GIAB references was that there were errors in the reference genomes.
It remains to be seen how the new benchmark data will affect research and clinical diagnostics. DNAnexus, a DNA sequencing data analysis and management company and one of the sponsors of the study, said it currently does not have any near-term plans to develop a commercial product based on the data presented in the paper.
"There's still a lot of roads between this [study] and being able to drop it into a diagnostic or potentially into drug development in terms of attacking these [genetic] regions," said John Ellithorpe, the company's president. "But we're particularly excited about the development of the space and pulling in long-read technology into potential diagnostics in this area."
"I think it's a really important study," said Christian Marshall, molecular laboratory director at the Hospital for Sick Children in Toronto, who was a peer reviewer for the study. "As a clinical lab director, we're always interested in benchmarking our results … you have to have good reference benchmarks to be able to do that."
However, "a substantial amount of disease-causing genes that exist in those areas are hard to benchmark," Marshall said. "So, you have to rely on other ways to do that. And it becomes quite difficult using other orthogonal methods."
According to Marshall, one of the issues with currently available reference benchmarks is that some highly repetitive or highly homologous regions within the genome that are hard to conquer with short-reading sequencing technologies are often omitted. As a result, he said, many of these genes tend to be ignored during benchmarking.
There are certain targeted technologies, such as long-range PCR sequencing, to validate specific pathogenetic variants, Marshall said. But these methods are often "very targeted and painstaking," he noted, and require "a lot of work." That said, Marshall thinks improved benchmarks reaching into these more complex regions of the genome, such as the ones released in this study, will be "imperative" for test validation.
Despite the advantages of long-read sequencing demonstrated in this paper, Marshall thinks the technology still faces barriers to wide adoption by clinical labs for genome-wide sequencing. "I think in research, it's much easier to use other technologies," he noted. "But in the diagnostic lab, short-read sequencing has been the main technique that people have been using for a long time."
Compared with short-read sequencing, Marshall said, long-read sequencing is "a little bit harder to deal with and a little bit less mature in terms of the analysis and interpretation of the data." Plus, it is logistically difficult for a diagnostic lab to have multiple NGS platforms running in parallel, which "takes a lot of energy and resources" to meet the clinical regulatory requirements, he added.
Nonetheless, Marshall said the benchmark reference in this study is "universal" and will be "extremely useful" to analyze any samples regardless of whether they are sequenced using long-read or short-read methods, as the benchmarks provides a way to check for accuracy. Additionally, "as you have these benchmarks, algorithms can be tuned to be more and more accurate [to detect variants]," he pointed out.
Benedict Paten, an associate professor of biomolecular engineering at the University of California, Santa Cruz, also said that with the reference developed by the long-read method, researchers can now "tune our algorithms for short reads to at least get more out of those regions." UCSC is part of the GIAB consortium, but Paten was not an author of the recent study.
Paten said that as GIAB expands its reference, "the new challenge will become integrating information from diverse, different genomes together."
"I think it just comes back to that question of whether or not having benchmark sets against a single reference is actually sufficient for characterizing all new samples and all new genes, and I would argue it probably isn't," he said. "That's where pangenome [methods] come in."
Zook said he also considers building a pangenome as the way forward to home in on the unresolved challenging genes in the recent study. The Human Pangenome Reference Consortium, he said, is currently sequencing over 300 different genomes from diverse populations and plans to build a pangenome representation. "That could be the basis for a new reference that everyone compares other genomes to," he said.