This story has been updated to correct Deanna Church's comments. She was referring to compatibility with ENCODE, not GENCODE.
NEW YORK – Researchers from the Telomere-to-Telomere consortium (T2T) have assembled "the first truly complete 3.055 billion base pair sequence of a human genome," according to a new BioRxiv preprint posted last week.
The gapless assembly of all human chromosomes — except the Y chromosome — adds more than 150 Mb of previously unknown content to the human genome, mostly segmental duplications and satellite repeats from the centromeric regions and acrocentric arms of certain chromosomes. While the genome is technically diploid, it comes from a complete hydatidiform mole, a type of uterine growth that has two copies of the same haplotype. But the new methods used here, especially graph-based assembly of long-read sequencing reads from Pacific Biosciences and Oxford Nanopore Technologies, have paved the way towards getting not just one complete genome, but enough to represent the majority of human genomic diversity.
"This is the last base camp before the summit," said Adam Phillippy, a bioinformatician at the National Human Genome Research Institute and a co-chair of T2T. "With the Human Genome Project (HGP), while it created billion-dollar industries and unlocked huge discoveries, there was always a nagging feeling in the back of my head of 'Gee, it's not really done,'" he said.
He hopes the genomics community will see the T2T-CHM13 assembly as an achievement in itself and use it as a linear reference genome. "All comparisons show that this genome is much more representative than GRCh38," he said. "It's much more human."
"The authors did a nice job of demonstrating value," said Deanna Church, VP of mammalian business at genome editing firm Inscripta and a former staff scientist at the National Center for Biotechnology Information, where she helped lead development of the Genome Reference Consortium's GRCh38 reference genome. "It's a better assembly than GRCh38," she said.
Though the US government-backed HGP was declared finished in 2003, it never delivered a genome that would satisfy a completionist. After a fraught race with Celera Genomics, a private effort led by former NIH researcher Craig Venter, the sides called it a draw when they delivered two draft genomes in 2001. A 2007 paper of Venter's genome claimed to be the first diploid genome of a named individual, and a 2008 paper claimed to deliver a complete genome of an individual (DNA structure pioneer and former Cold Spring Harbor Laboratory Director James Watson) using next-generation sequencing technology, but both still contained gaps, as did reference genomes including 2009's GRCh37 and 2013's GRCh38.
The limitations of these previous efforts were essentially technical. Centromeric regions and segmental duplications can be hundreds of kilobases long, making them impenetrable to Sanger and short-read sequencing methods.
In 2018, researchers led by Matt Loose of the University of Nottingham and Nick Loman of the University of Birmingham in the UK published a genome assembly for which they used a new protocol to generate nanopore reads with N50 lengths greater than 100 kb and up to 880 kb. This brought the number of gaps down to about 100. From working together on that project, Phillippy and Karen Miga, a satellite repeat researcher at the University of California, Santa Cruz, became convinced that a gapless genome was possible and launched the T2T consortium later that year.
They chose to work with the complete hydatidiform mole and its less complicated genome, which was hugely important, Church said, perhaps equally important as the sequencing technology used in the project. "It really simplifies the problem. There might be a little bit of heterozygosity, but it's a simplified way of representing one true haplotype," she said.
In June 2020, T2T published a paper on the first gapless chromosome assembly, of the human X chromosome, created using ultra-long nanopore reads as well as PacBio sequencing, optical genome mapping from Bionano Genomics, linked reads technologies from 10x Genomics (now discontinued) and Illumina, in addition to long-range interaction data generated with Hi-C assays.
But to get the whole enchilada, the T2T team pivoted to an approach based mostly on PacBio's HiFi read technology, which creates a consensus sequence that is more than 99.9 percent accurate by only using reads with at least a Q20 quality score. While nanopore ultra-long reads were easy to assemble, the error rate was too high and the researchers had to use other technologies to make the assemblies sufficiently accurate. So-called "polishing" with Illumina short reads or even PacBio reads was actually a source of error, Phillippy noted.
While working on HiCanu, a graph-based assembly algorithm published last year, Phillippy and postdoc Sergey Nurk realized that they could use PacBio's HiFi data to assemble centromeres.
"Right when the pandemic was starting, [Nurk] brought me the first assembly graphs and we saw that all the chromosomes were essentially coming together in one piece," Phillippy said. "Then the problem was just figuring out the right way to walk through the data."
To create T2T-CHM13, the team created a "conservative" genome graph from HiFi reads using code taken from HiCanu and Miniasm, an assembler developed by the Broad Institute's Heng Li. Using ultra-long reads from Oxford Nanopore the researchers were able to find the right paths through the graph to generate a consensus sequence for each chromosome. There was a polishing step at the end; however, Phillippy said there were not many corrections made that way. "We did use Illumina at the final stage to call variants," along with nanopore and HiFi data, he noted. In addition to using Google's DeepVariant, a curation team manually identified some corrections that were added to the genome.
All told, the project took about a year with sequencing costs of at least $50,000. "There were a lot of other datasets that went into validation and were gradually amassed over the years," Phillippy said. "So, $50,000 would be my rough estimate if we were going to do it all over again knowing what we know now. The true cost of the project was higher." His lab is planning to develop a new assembly algorithm that can integrate HiFi and nanopore data, which could be released in a year or less.
The methods used helped the researchers account for about 182 Mb of new sequence, around 8 percent of the human genome, most of which had been previously inscrutable due to its repetitive nature or near-identicality to other genomic regions. Centromeric satellite repeats comprised about 180 Mb and segmental duplications comprised 68 Mb, though there is overlap between the two categories, the manuscript authors noted.
The assembly added more than 3,000 new genes, including about 150 protein-coding genes. The number of segmental duplications, also called low copy repeats, grew to 41,528 from 24,097. And 66.1 Mb has been assigned to the very short arms found on the five acrocentric chromosomes that can contain ribosomal DNA, satellite repeats, and segmental duplications, which had essentially been ignored in prior assemblies.
"While I think [T2T-CHM13] is much more than an incremental improvement, it's unclear to me that the tools are there to enable adoption of it," Church said. "Without resources like GnomAD (Genome Aggregation Database) or ENCODE, I'm not convinced we'll see widespread adoption of the assembly." She noted that while GRCh38 was a better assembly than its predecessor, many researchers still use GRCh37, the previous version. "The amount of effort it takes to transition to a new assembly is huge," she said. "It's not just about the bases in the sequence."
Neither of those tools will be available for use with the new assembly without additional studies. "Presumably almost all of [GnomAD and ENCODE] data will be from short-read sequencing experiments and so will not be easily mapped to the new repetitive regions of the genome," Phillippy said. Moving the GnomaD data to the new reference would provide minimal benefit and discovering new variants in the new genomic regions will take more long-read sequencing. "GnomAD is a tremendous resource, and to rebuild it on top of new sequencing technology will probably take a long time," Phillippy said.
Similarly, ENCODE data will need to be replicated with new methods, he said, but reanalyzing it with respect to the new reference "would likely improve the results by reducing bias and other errors caused by the incomplete CRGh38 reference," he said, adding that the T2T team is working on a manuscript that directly addresses these issues.
The new genome is presented as a linear structure, so existing alignment software tools should work fine, Phillippy said. Like many other resources, it contains genome content mostly associated with European heritage. "If you're happy with the reference genomes that already exist, I see no reason why you need to change processes," he said. "We're talking about maybe a few hundred genes impacted by improvements." But the equation changes for studies of structural variants, he said: "Most of the new long-read studies I'm involved with are planning to use it." Phillippy also noted that he discussed with members of the GENCODE team, part of the ENCODE project that maps protein-coding genes, about a multi-year effort to curate the new genomic regions and put them into future GENCODE releases.
Both Phillippy and Church agreed that the ultimate goal should be many phased genomes, representing individuals from all over the world, something that is being pursued by the Human Pangenome Reference Consortium. A gapless genome with distinct, phased haplotypes could be ready in as soon as a year, Phillippy said, with many more hopefully soon to follow.
"Once we get this done, the next ones are exponentially easier," Phillippy said. Then, with multiple reference genomes available to align to, the bioinformatics field will really have to respond with new approaches to alignment and variant calling.
"Let's see what 100 genomes look like and then build tools around that," he said.