NEW YORK (GenomeWeb) – Although advances in next-generation sequencing have driven down the cost of sequencing human genomes such that large-scale sequencing projects are now feasible, one limitation is that the human reference genome lacks representation from diverse populations. Novel, population-specific variation will not be detected if genome sequences are aligned to that reference. While researchers have known about this problem for some time, a team led by Seoul National University and Macrogen has illustrated the importance of having a population-specific reference genome in the de novo assembly of a Korean genome.
In a study published today in Nature, the researchers described using a combination of technologies to generate a de novo assembly of a Korean genome known as AK1. They identified more than 11,000 novel structural variants, including sequences that seemed to be unique to Asian ancestral groups, and were able to close many of the gaps in the human reference genome.
Stephan Schuster, a professor at Nanyang Technological University in Singapore who was not part of this study but who collaborates with the Macrogen group on the GenomeAsia 100K project, said that the study is "a huge advance." The AK1 assembly is the most contiguous human genome assembly that has been published, he added, and shows that now "is the time to move toward generating high-quality reference genomes to study populations."
Schuster added that "the problem of sequencing thousands of genomes has more or less been solved" with the increasing throughput and decreasing cost of short-read sequencing platforms. But the utility of being able to do that is "limited by working with one reference genome," he said.
As demonstrated in the Nature study, there is a lot of variation, even in known disease-associated regions, that is specific to certain populations or ancestral groups. Those variants, particularly structural variants, could be missed in an individual if there is not a good population-specific reference genome to align to.
The current human reference genome has the most utility for individuals of European descent, so groups like Macrogen and others are looking to generate de novo assemblies that can serve as references for other populations.
In addition, thanks to advances in technology, the Macrogen team was even able to fill in some gaps in the current reference, GRCh38, that were previously intractable using short-read technology.
Jeong-Sun Seo, lead author of the study, chairman at Macrogen, and a professor at Seoul National University, told GenomeWeb that the group plans to continue to improve the AK1 genome, aiming "to create an assembly that is continuous from telomere to telomere with the complete representation of the highly repetitive centromere."
To sequence and assemble the AK1 genome, the Macrogen team used a combination of technologies that included Pacific Biosciences' RSII platform, BioNano Genomics' Irys, Illumina sequencing, BAC clone sequencing, and 10X Genomics' GemCode.
The majority of the sequence data was generated by PacBio's RSII, using the P6-C4 chemistry. The researchers ran 380 SMRT cells to sequence the genome to 101X coverage. The PacBio-generated sequence alone generated 3,128 contigs with an N50 length of 17.9 megabases. Next, they used BioNano Genomics' Irys system to place the contigs into larger scaffolds, which resulted in 2,832 scaffolds with an N50 of 44.8 megabases.
The team also sequenced the genome to 72X coverage on Illumina's HiSeq X Ten for "polishing," and sequenced a BAC library with a combination of Sanger, Illumina, and PacBio technology, which confirmed the assembly and also enabled haplotyping. In addition, they used 10X Genomics' GemCode in conjunction with Illumina sequence data to produce linked reads to help with the haplotyping.
They also performed transcriptome sequencing, using both Illumina and PacBio technology.
The final assembly is "characterized by marked contiguity that has not been achieved by non-reference assemblies of the human diploid genome so far, and improves on the previous best N50 length by 18 Mb," the authors wrote.
The longest scaffold of 113 megabases spanned chromosome 5 completely. In addition, eight other chromosomal arms could be each represented with a single scaffold, including both the short and long arm of chromosome 20, Seo said.
Seo added that the BAC library in particular "confirmed the robustness of the assembly and enabled accurate haplotype phasing of the genome even in complex regions," which included some medically relevant genes like CYP2D6 and the hypervariable MHC class II region.
The researchers were also able to use the AK1 assembly to close gaps in the human reference genome, GRCh38. They closed 65 gaps completely and resolved 40, using local realignment and reassembly as well as spanning reads. In addition, they shortened 72 of the remaining 85 gaps with 663 kilobases of sequence. The gaps were in regions of the genome that are difficult to sequence with short reads, such as tandem repeats.
"If you look at the stats," the AK1 genome "beats any genome that has been published so far, in terms of scaffold contiguity and contig contiguity," Schuster said. However, he estimated that doing the sequencing and assembly "was so expensive it would be prohibitory" to repeat this with other genomes. In addition, he noted that the Macrogen researchers put in an "enormous amount of manual work rectifying the assembly errors."
The cost could be reduced in part by using PacBio's newer instrument, the Sequel, instead of the RSII, Schuster said, but that would still leave the BAC library component, which would add significantly to the cost.
Seo said that the total cost of the project cost around $1.7 million. He did not anticipate making major changes to the combination of technologies used, but said that the group is interested in developing new algorithms "that can take advantage of different technologies for de novo assembly." He said the researchers also plan to incorporate Hi-C sequencing technology to improve the continuity and also plans to construct 100,000 BAC clones from the AK1 genome and sequence them with PacBio technology. "In particular, we will be exploring heterochromatic regions and segmental duplications of the genome that are often misassembled even in the human reference genome," he said.
The AK1 reference genome could eventually have an impact in the clinic, particularly due to its structural variant findings.
Population-specific reference genomes are especially important for studying structural variants, Kai Wang, an associate professor at Columbia University, who led an effort to assemble a Han Chinese genome de novo, told GenomeWeb.
Trying to call structural variants by short-read sequencing and alignment to the GRCh38 reference "is not very reliable," he said, especially for individuals not of European descent. The false-negative and false-positive rates are high, "partly because of the technology and partly because ethnicity-specific structural variants may not be accurately detected by aligning to GRCh38," he said. But if we have a population-specific genome, it's a lot more possible to call those accurately."
He added, however, that for SNVs, the GRCh38 reference would still be reliable, since the database for clinically important SNVs is very comprehensive.
Jonas Korlach, chief scientific officer at PacBio and an author of the Nature study, agreed with Wang that population-specific reference genomes are particularly important for identifying structural variation.
The study "represents a powerful example of the movement toward ethnically diverse reference genomes to support global precision medicine initiatives," he said in an emailed statement. "The exacting standards of reference quality genome projects demand that the full complement of genetic variation be represented, not just the single nucleotide variations."
Of the more than 18,000 structural variants the Macrogen team found, nearly 12,000 were novel. "These structural variations are reflected in the transcriptome and in the regulatory regions of the genome, which are becoming increasingly important in understanding the full range of genetic variations that cause complex diseases," Seo said.
He said that the team would continue to perform functional analyses of the unique structural variants, and identify which ones are Asian-specific and which ones may be clinically relevant.
In the current study, Seo said, the researchers first focused their analyses on structural variations that were not from repetitive regions or within duplications.
Then, to figure out which were population-specific versus previously undetected universal variants, they compared their results with high-coverage genomes from the 1,000 Genomes Project as well as to high-coverage Asian genomes and assessed the frequency of the variants to determine whether they were population- specific. For instance, out of 853 identified insertions, 45 were Asian-specific.
"With the increase in numbers of samples and improvements in our methods, we expect to find even more Asian-specific structural variations that explain the underlying population demographics," Seo said.
Seo added that further analysis of the structural variants has found that many appear to be shared with the chimpanzee and gorilla. "Therefore, we think we have found a set of ancestral structural variations that have been evolutionarily maintained within the Asian population," he said.
Creating the AK1 assembly was the first part of a larger project known as the Asian Genome Project. Phase 2 of the project is to sequence the genomes of 10,000 individuals from various disease cohorts and populations. Seo said that the group has now completed sequencing the genomes of 3,000 Japanese, 3,000 Korean, and 1,000 Mongolian case and control samples and is sequencing additional individuals from the Chinese population to "understand the population substructure in Northeast Asia."
The Macrogen group is also working with Schuster on the GenomeAsia 100K project to construct "additional population-specific reference genomes in understudied populations," Seo said.
In the future, Seo envisions the AK1 reference genome being used as the standard for clinical genomics in Asia. The Macrogen team is currently participating in a precision medicine initiative in Korea and recruiting individuals for the Asian Genome Project, gathering both health records and genomic DNA, with a particular focus on Northeast Asians, Seo said. "The Asian reference genome, together with curated data, will enable precision medicine in Asia," he said.