HOUSTON – New technologies that allow researchers to account for more of the genomic diversity in humans are leading to better variant detection.
At the American Society of Human Genetics conference here yesterday, researchers presented work showing how de novo assemblies and graph genome representations could identify important but previously undetected genomic content.
"People trying to identify clinically relevant variants are actually limited by the reference genome. We're only shedding light onto the biology we can see and right now that's the reference genome," said Karen Wong, a graduate student at the University of California, San Francisco. Anything that's not on it can't be seen unless you're fancy and can do de novo sequencing. If we can think outside the box, that can improve the whole of human genomics."
Wong was one of three speakers who presented data from papers published over the last two years, and included some unpublished data.
The session comes as some researchers are breaking away from using a linear reference sequence and launching new projects to better incorporate genomic diversity. This year, the Genome Reference Consortium announced it has indefinitely postponed the release of the next reference sequence, GrcH39, and last month the National Institutes of Health announced $29.5 million in funding for two research centers that will be working on the "pangenome," a graph-based reference genome that will incorporate variation from more than 300 individuals around the world.
Using 10x Genomics' Linked Reads product for short-read sequencing, researchers at the University of California, San Francisco, led by Wong, assembled de novo genomes from individuals representing five different human populations, leading to the identification of non-reference unique insertions (NUIs) ranging from a few hundred base pairs to entire gene duplications, many of which "may have functional significance," Wong said. Some were identified as exons not present in the GrcH38 reference genome.
Wong's team published their initial study of NUIs in 17 individuals from five ancestry groups in 2018 in Nature Communications, reporting 1,800 NUIs comprising 2.1 megabases of previoulsy undescribed genomic content.
"Among these, 64 percent are considered ancestral to humans since they are found in non-human primate genomes," the authors wrote. "Furthermore, 37 percent of the NUIs can be found in the human transcriptome and 14 percent likely arose from Alu-recombination-mediated deletion."
Now, in collaboration with the Academia Sinica in Taiwan, they've assembled more than 300 genomes and identified 172,000 NUIs.
While Wong's presentation highlighted some of the inadequacies of existing linear reference genomes, she noted that the linear form was still important for her work.
"We're putting data back to where they belong in a linear fashion," she said. And context for NUIs is very important. "An insertion landing in an intron will be very different than one landing in an exon," she said.
Later in the session, researchers from Seven Bridges Genomics, a bioinformatics firm based in Charlestown, Massachusetts and Belgrade, Serbia, and the University of California, Santa Cruz Genomics Institute discussed the emerging graph genome technology that will bake in variation to the reference. The graph concept allows for representation of SNPs, indels, duplications, and inversions in a reference sequence and could help identify variants.
A "simple" graph-based informatics pipeline made for whole-genome trio sequencing — where two parents and their child are sequenced, often to help diagnose a rare disease — helped make extra calls for approximately 300,000 SNPs and 45,000 indels, on average, across three families, said Amit Jain, graph technology director at Seven Bridges. A "bespoke graph reference" made from the parents' genotypes to be compared against the child's genome helped call more than 35,000 more SNPs and more than 15,000 extra indels, a Seven Bridges spokesperson added in an email.
Researchers from Seven Bridges published a technical report on benchmarking studies of its Graph Genome Pipeline in January in Nature Genetics. "Using a graph genome reference improves read mapping sensitivity and produces a 0.5 percent increase in variant calling recall, with unaffected specificity. Structural variations incorporated into a graph genome can be genotyped accurately under a unified framework," the authors wrote.
They noted that a graph genome concept could also include transcriptomic data, where "the transcriptome could be represented as genomic deletions, allowing RNA-seq reads to be directly aligned across exon–exon junctions" and that it could enable a "personalized graph genome … in order to provide an optimized scaffold for somatic variant detection in matched cancer genomes."
And graph reference representations helped to better detect structural variations with large effects, particularly insertions, according to Charles Markello, a graduate student in Benedict Paten's lab at UCSC. He added that graph genomes can reduce bias and allele skew when mapping reads, as shown by simulated data sets.
UCSC has developed a variation graph (vg) toolkit, which is available on GitHub, described in 2018 in Nature Biotechnology. The vg toolkit can "construct or import a graph, modify it, visualize it, and use it as a reference," the authors wrote, as well as "accurately map new sequence reads to the reference using succinct indexes of the graph and its sequence space" and "describe variation between a new sample and an arbitrary reference embedded as a path in the graph."
In July, the group posted a preprint to BioRxiv describing an extended vg toolkit for structural variant (SV) genotyping. The authors wrote that its "method of mapping reads to a variation graph leads to better SV genotyping compared to other state of the art methods," adding that their work "shows the benefit of directly utilizing de novo assemblies rather than variant catalogs to integrate SVs in genome graphs."
"We envision a future in which the lines between variant calling, alignment, and assembly are blurred by rapid changes in sequencing technology," they wrote.
Wong said "objectively, graph [representation] is clearly better" than linear representation of the genome. But she was interested to see if the ASHG crowd, especially the more clinically minded ones, would buy in. "I want to ask them if they actually want to use a graph genome," she said.