Skip to main content
Premium Trial:

Request an Annual Quote

Variant Detection Gets Boost From Graph Genomes, De Novo Assemblies


HOUSTON – New technologies that allow researchers to account for more of the genomic diversity in humans are leading to better variant detection.

At the American Society of Human Genetics conference here yesterday, researchers presented work showing how de novo assemblies and graph genome representations could identify important but previously undetected genomic content.

"People trying to identify clinically relevant variants are actually limited by the reference genome. We're only shedding light onto the biology we can see and right now that's the reference genome," said Karen Wong, a graduate student at the University of California, San Francisco. Anything that's not on it can't be seen unless you're fancy and can do de novo sequencing. If we can think outside the box, that can improve the whole of human genomics."

Wong was one of three speakers who presented data from papers published over the last two years, and included some unpublished data.

The session comes as some researchers are breaking away from using a linear reference sequence and launching new projects to better incorporate genomic diversity. This year, the Genome Reference Consortium announced it has indefinitely postponed the release of the next reference sequence, GrcH39, and last month the National Institutes of Health announced $29.5 million in funding for two research centers that will be working on the "pangenome," a graph-based reference genome that will incorporate variation from more than 300 individuals around the world. 

Using 10x Genomics' Linked Reads product for short-read sequencing, researchers at the University of California, San Francisco, led by Wong, assembled de novo genomes from individuals representing five different human populations, leading to the identification of non-reference unique insertions (NUIs) ranging from a few hundred base pairs to entire gene duplications, many of which "may have functional significance," Wong said. Some were identified as exons not present in the GrcH38 reference genome.

Wong's team published their initial study of NUIs in 17 individuals from five ancestry groups in 2018 in Nature Communications, reporting 1,800 NUIs comprising 2.1 megabases of previoulsy undescribed genomic content.

"Among these, 64 percent are considered ancestral to humans since they are found in non-human primate genomes," the authors wrote. "Furthermore, 37 percent of the NUIs can be found in the human transcriptome and 14 percent likely arose from Alu-recombination-mediated deletion."

Now, in collaboration with the Academia Sinica in Taiwan, they've assembled more than 300 genomes and identified 172,000 NUIs.

While Wong's presentation highlighted some of the inadequacies of existing linear reference genomes, she noted that the linear form was still important for her work. 

"We're putting data back to where they belong in a linear fashion," she said. And context for NUIs is very important. "An insertion landing in an intron will be very different than one landing in an exon," she said.

Later in the session, researchers from Seven Bridges Genomics, a bioinformatics firm based in Charlestown, Massachusetts and Belgrade, Serbia, and the University of California, Santa Cruz Genomics Institute discussed the emerging graph genome technology that will bake in variation to the reference. The graph concept allows for representation of SNPs, indels, duplications, and inversions in a reference sequence and could help identify variants.

A "simple" graph-based informatics pipeline made for whole-genome trio sequencing — where two parents and their child are sequenced, often to help diagnose a rare disease — helped make extra calls for approximately 300,000 SNPs and 45,000 indels, on average, across three families, said Amit Jain, graph technology director at Seven Bridges. A "bespoke graph reference" made from the parents' genotypes to be compared against the child's genome helped call more than 35,000 more SNPs and more than 15,000 extra indels, a Seven Bridges spokesperson added in an email. 

Researchers from Seven Bridges published a technical report on benchmarking studies of its Graph Genome Pipeline in January in Nature Genetics"Using a graph genome reference improves read mapping sensitivity and produces a 0.5 percent increase in variant calling recall, with unaffected specificity. Structural variations incorporated into a graph genome can be genotyped accurately under a unified framework," the authors wrote. 

They noted that a graph genome concept could also include transcriptomic data, where "the transcriptome could be represented as genomic deletions, allowing RNA-seq reads to be directly aligned across exon–exon junctions" and that it could enable a "personalized graph genome … in order to provide an optimized scaffold for somatic variant detection in matched cancer genomes."

And graph reference representations helped to better detect structural variations with large effects, particularly insertions, according to Charles Markello, a graduate student in Benedict Paten's lab at UCSC. He added that graph genomes can reduce bias and allele skew when mapping reads, as shown by simulated data sets. 

UCSC has developed a variation graph (vg) toolkit, which is available on GitHub, described in 2018 in Nature Biotechnology. The vg toolkit can "construct or import a graph, modify it, visualize it, and use it as a reference," the authors wrote, as well as "accurately map new sequence reads to the reference using succinct indexes of the graph and its sequence space" and "describe variation between a new sample and an arbitrary reference embedded as a path in the graph."

In July, the group posted a preprint to BioRxiv describing an extended vg toolkit for structural variant (SV) genotyping. The authors wrote that its "method of mapping reads to a variation graph leads to better SV genotyping compared to other state of the art methods," adding that their work "shows the benefit of directly utilizing de novo assemblies rather than variant catalogs to integrate SVs in genome graphs."

"We envision a future in which the lines between variant calling, alignment, and assembly are blurred by rapid changes in sequencing technology," they wrote.

Wong said "objectively, graph [representation] is clearly better" than linear representation of the genome. But she was interested to see if the ASHG crowd, especially the more clinically minded ones, would buy in. "I want to ask them if they actually want to use a graph genome," she said.

The Scan

Study Follows Consequences of Early Confirmatory Trials for Accelerated Approval Indications

Time to traditional approval or withdrawal was shorter when confirmatory trials started prior to accelerated approval, though overall regulatory outcomes remained similar, a JAMA study finds.

Sequencing Study Leads to Vaccine Target in Bacteria Behind Neonatal Meningitis

Researchers eBioMedicine track down potential vaccine targets with transposon sequencing on mutant bacteria causing neonatal meningitis in mouse models of the disease.

Multiple Myeloma Progression Influenced by Immune Microenvironment Expression

Researchers in NPJ Genomic Medicine compare RNA sequencing profiles of 102,207 individual cells in bone marrow samples from 18 individuals with rapid or non-progressing multiple myeloma.

Self-Reported Hearing Loss in Older Adults Begins Very Early in Life, Study Says

A JAMA Otolaryngology — Head & Neck Surgery study says polygenic risk scores associated with hearing loss in older adults is also associated with hearing decline in younger groups.