Skip to main content
Premium Trial:

Request an Annual Quote

Chinese Genome Assembled From Long-Read Sequence Data

NEW YORK (GenomeWeb) – Using long-read sequencing and physical mapping, researchers from the University of Southern California and elsewhere have generated a de novo assembly of the genome of a Chinese individual.

By combining their sequencing approach with transcriptome sequencing, the researchers reported in Nature Communications yesterday, they were able to fill in gaps in the human reference genome as well as annotate spliced genes not present in GENCODE and missed by short-read approaches.

"Improved understanding and better characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations," USC's Kai Wang, the senior author of the study, and his colleagues wrote.

Using Pacific Biosciences' RS II sequencer, he and his colleagues sequenced DNA obtained from a healthy Chinese man to 103X coverage. From those data, they generated 44.2 million sub-reads with a mean length of 7 kilobases and an N50 length of 12.1 kilobases. They further generated a 2.9 gigabase genome assembly with 5,843 contigs using the Falcon algorithm, along with 206 megabases of associated contigs or alternative haplotypes.

By folding in short reads generated on the Illumina HiSeq X platform, the researchers further polished their assembly and corrected errors. They also compared their assembly to a physical map of the same DNA sample generated using BioNano Genomics' IrysChip, aligned it against the human reference genome, examined its consensus accuracy versus the reference, and aligned the RefSeq transcripts to other assemblies to conclude that their assembly is complete and of high quality.

With this de novo assembly in hand, Wang and his colleagues were able to fill in some of the 966 N-gaps present in the human reference genome. For instance, they reported that more than a quarter of those gaps — including 148 gaps on primary chromosomes  — could be filled or partially filled by their assembly.

They also identified more than 9,800 deletions and 10,000 insertions in their assembly and found that about half are short tandem repeats or mobile element insertions. By filtering those structural variants shared with the haploid genome reference CHM1 or found in segmental duplications, they homed in on likely functional structural variants particular to their assembly. One homozygous deletion they found eliminates the 10th and 11th exon of C1orf168, a deletion that had previously only been noted as a heterozygous change in East Asian populations and thus might be an Asian-specific structural variant.

All in all, Wang and his colleagues identified 12.8 megabases in their assembly that were not present in the human reference genome, and only about a third of that sequence could not be mapped to other Asian genome assemblies. This, the researchers said, indicates that the sequences specific to their assembly might be found in Asian populations.

The researchers also examined the transcriptome of the Chinese individual using PacBio's long-read RNA sequencing (Iso-seq). They generated four libraries of different transcript sizes and used short-read RNA sequence data as an error-correcting measure. From their Iso-seq data, the researchers predicted more than 58,000 high-quality consensus isoforms for some 30,000 loci, and noted 57 isoforms that didn't overlap with any GENCODE transcripts.

Wang and his colleagues also scoured their data for functional variants that might be of clinical relevance. While they unearthed 2,432 variants, of which 20 were predicted to be pathogenic, an allele frequency calculation determined that 18 of these 20 variants had minor allele frequencies of more than 1 percent and were thus unlikely to highly penetrant disease-causing variants. The other two variants, the researchers noted, were listed as pathogenic in error, leading them to caution others to be careful in how they interpret pathogenic variants from databases.

"[W]hile short-read-based alignment and variant calling based on [a] reference genome remain a common practice to assay personal genomes, de novo assembly by long-read sequencing may reveal novel and complementary biological insights," Wang and his colleagues wrote. "Furthermore, long-read RNA sequencing may identify novel transcripts that can be missed by short-read RNA sequencing."