NEW YORK – Graph-based genomes, such as the one being built by the Human Pangenome Reference Consortium (HPRC), are already being used to do more than just genetic variant analysis.
Last week, researchers from the University of California, Santa Cruz published a paper in Nature Methods showing how a graph genome helps analyze haplotype-specific expression in bulk RNA-seq without needing to characterize a sample prior to the experiment. They've developed a bioinformatics toolkit for "pantranscriptomics" that helps map RNA to the more complex pangenome reference.
"With this toolkit, we are employing this more diverse data that we can now get from the pangenome to improve the measurement of gene expression data, something that can widely vary between individuals," said UCSC professor Benedict Paten, a senior author of the study. "The aim is to make the impact of this more diverse data felt on studies that are looking at gene expression, resulting in better analysis for cell models, organoid models, and other research applications."
And researchers from Canada's McGill University, led by Guillaume Bourque, have released studies showing the utility of graph genomes in chromatin immunoprecipitation sequencing (ChIP-seq) and ATAC-seq (assay for transposase-accessible chromatin by sequencing).
"The fact that 20 years later we now have a different way of thinking about the human genome is a pretty great change in technology," Bourque said. "It's an exciting area of new science."
The era of the pangenome is just getting started. Even the draft pangenome reference has only been released as part of a preprint as the authors await publication in a peer-reviewed journal. But these studies suggest that pangenomes could prove useful for just about any assay where the first step is to map reads back to a reference.
Launched in 2019 with $29.5 million from the National Human Genome Research Institute, the pangenome project has progressed alongside the efforts of the Telomere-to-Telomere consortium. Karen Miga, a researcher at UCSC who was not involved in the new study, is central to the leadership of both T2T and HPRC.
Gapless human genome assemblies, such as the one the T2T team released in June 2021, are critical to creating the pangenome, which combines data from many genomes, encompassing a greater diversity of human genetic variation.
While the pangenome consortium has availed itself of new wet lab methods, such as more accurate long-read sequencing, at its core it is a bioinformatics project. As such, it has spurred new computational methods, such as a "semi-automated" diploid genome assembly, published in Nature in October.
In that vein, the pantranscriptomics toolkit offers researchers ways to analyze RNA with a richer reference that helps account for splicing.
Sequences next to each other in an RNA molecule can come from nonconnected areas of the genome, making it challenging to correctly align them to a reference. Moreover, splicing sites are not uniform across the human population and can vary between individuals. Add to that the fact that expression can come from either the maternal and paternal chromosomes, and it can be hard to tell exactly where a read should map to.
The UCSC pipeline identifies which areas of the genome the RNA sequencing data comes from, including the splice sites, and marks those points on the pangenome reference. Those marked points are then compared to a pantranscriptome consisting of haplotype-specific transcripts generated from the reference data contained within the pangenome. Finally, it estimates levels of gene expression based on this comparison between the mapped data and the transcripts in the pantranscriptome and identifies which haplotypes the genes come from.
Jonas Sibbesen, an author on the paper who is now an assistant professor at the University of Copenhagen, noted that while there are some existing tools that use a graph-based approach for transcription analysis, they only work on smaller gene regions like the HLA variable region. "This is the first tool that can do this genome-wide," he said.
Running the toolkit is "very achievable for any computer server," said Jordan Eizenga, a postdoc in Paten's lab and author of the study, but there are computational costs to using it. "There is some penalty in terms of both memory used and speed" in comparison with linear reference-based approaches, he said.
Bourque's lab has explored using personalized graph reference genomes for epigenetic studies. So far, they've shown that these references can improve peak calls. In ChIP-seq experiments, new peaks identified were enriched for indels and SNVs and are likely to differ between individuals.
In a study of mobile element insertions, graph references revealed an increase in ChIP- and ATAC-seq peaks of around 2 to 3 percent. Those data were published in a BioRxiv preprint in May. Bourque's lab also led a study on graph-based methods for analyzing ChIP-seq data in 2020 in Genome Biology.
"My students were very disappointed because 2 percent is not a huge difference," Bourque said. "But what's important is that the 2 percent is actually associated with genetic differences between individuals. Those are the most interesting regions. For many applications, what you want are the peaks that differ. That's where the graph really helps."
Bourque suggested that graph-based genomes can help analyze additional types of data, such as those generated by single-cell and Hi-C assays.
"Typically, the first step of any of these genomic analyses is you have reads and map onto the genome," he said. "Now everything has to change because we're changing that first step."
Computational biologists have their work cut out for them: "There are a lot of technical challenges that need to be addressed in terms of adapting all the tools that are downstream of that initial step," he said.