Skip to main content
Premium Trial:

Request an Annual Quote

New Pangenome Bioinformatics Toolkit Offers "Multiscale" View of Variants in Diverse Genomes


NEW YORK – A new set of tools for genome analysis using the concept of the human pangenome is available for researchers, enabling visualization of variants at different scales, from single-nucleotide polymorphisms to large structural variants.

In a paper published in Nature Methods this week, researchers led by Jason Chin of GeneDx, the genetic testing firm formerly known as Sema4, described the Pangenome Research Tool Kit (PGR-TK) and validation of its ability to resolve complex regions of the human genome, such as the human leukocyte antigen (HLA) locus, certain Y chromosome genes, and the Genome In a Bottle's challenging medically relevant gene list.

"Like binoculars, it allows you to adjust focus and see specific structures at different scales," said Gustavo Stolovitzky, CSO at GeneDx.

As the pangenome idea continues to pick up steam, the PGR-TK arrives with a "unique" take on pangenome analysis, said Heng Li, a bioinformatician at Harvard University who has developed other pangenome analysis tools and who was not involved in the paper. "It visualizes complex structural variations in an intuitive way. This allows users to closely investigate these complex events, which can't be easily achieved with other tools."

"It's a nice first milestone to getting graph genomes into the hands of medical researchers," said Fritz Sedlazeck, a bioinformatician at Baylor College of Medicine and a coauthor on the paper. "A lot still needs to be done, but the trajectory is right."

PGR-TK joins a growing list of bioinformatics tools built to make use of the pangenome reference, a concept that seeks to place individual genomes in the context of the rich diversity of human genetic variation that exists throughout the world.

As the pangenome is based on nearly gapless assemblies, it affords the ability to analyze regions of the human genome that were previously too complex to allow linear alignment.

By using a graph structure, these genome assemblies can deal with challenging regions, such as long repeats and large insertions or deletions. In addition, by comparing each genome with many others— there are already 47 haplotype-resolved genomes in the Human Pangenome Reference Consortium database — researchers can get a better sense of whether a variant may be associated with a certain phenotype.

"There are many pangenome toolkits already that focus on building a whole-genome graph first," Chin said. "That's computationally intensive. In a lot of cases, we're interested in a certain region. Our tool is built to allow you to fetch and focus on a couple of regions that you’re interested in first." This approach makes it computationally efficient and, in principle, could allow a researcher to analyze a cohort of genomes at the same time.

The ability to focus on a region of interest, and at different scales within that region, is especially powerful, Sedlazeck said. He pointed to a preprint he and Chin posted to BioRxiv a year ago. In it, they used Chin's toolkit to analyze the LPA gene, which is associated with cardiovascular disease risk. One region of the gene consists of 5.5 kb repeat units, where the number of repeats is inversely correlated with CVD risk. "We see that there are interesting variants inside these copy numbers, not every copy is the same," he said. Some individuals have copy number variants that should suggest a higher risk, but their phenotypes aren't bearing that out. "Something is missing," Sedlazeck said. "[Chin] and I believe that this missing extra is there inside these repeats."

Stolovitzky stressed that the toolkit is not yet certified for use in the clinic; however, he suggested that as soon as it is, GeneDx will seek to implement it and include its findings in reports for healthcare providers. "If you have one person's genome you're interested in, it could be very useful if you know what you’re looking for," he said.

It's also an important part of the company's pivot to whole-genome sequencing, especially based on long reads.

Here, efficiency is also key. "This is a very fast algorithm," Stolovitzky said. "I think it is going to allow us to study big cohorts in order to make these genotype-phenotype associations, which will eventually inform the clinical use of [long-read] technology that will access these complex architectures."