Skip to main content
Premium Trial:

Request an Annual Quote

BGI Team Generates Haplotype-Resolved De Novo Human Genome Assembly

NEW YORK (GenomeWeb) — A team of researchers from BGI-Shenzhen has generated a haplotype-resolved de novo assembly of the genome of an Asian man, which they say is the most complete de novo assembly of an individual's genome to date.

For the assembly, the scientists, led by BGI's Gane Ka-Shu Wong and Jun Wang, combined whole-genome shotgun and fosmid pool sequencing, using next-generation sequencing technology from Illumina and BGI's Complete Genomics. In the 5.15-gigabase assembled genome, they identified a large number of previously undetected insertions and deletions, as well as almost 7.5 megabases of novel coding sequence, including at least six predicted genes. Their results were published online in Nature Biotechnology this week.

Previous projects, for example from groups at the University of Washington and Stanford University, the Max Planck Institute for Molecular Genetics, and Complete Genomics have already demonstrated long-range haplotyping of human genomes using NGS technology, but those studies mapped sequence data to an existing reference genome, which limited their ability to detect intermediate-sized indels, complex and long structural variants, and novel sequences, the BGI scientists wrote. 

Their new assembly, they suggested, along with other genomic data available for the same individual, a Han Chinese dubbed YH whom BGI first sequenced in 2008, could serve as a reference standard for developing new sequencing and assembly techniques and for functional studies of RNAs and proteins, similar to the standards developed by the National Institute of Standards and Technology's Genome in a Bottle Consortium. NIST released its first human DNA reference material for whole-genome variant assessment — from a HapMap individual from Utah — earlier this month.

For its project, the BGI team generated about 600,000 fosmid clones from YH DNA, which they sequenced in pools of 33 fosmids using the Illumina HiSeq 2000, generating 1,700 gigabases of sequence data in total. Because only a fraction of the genome was present in each pool, the probability of having two fosmids from the same genomic area in the same pool is very small. 

They also generated about 330 gigabases of whole-genome shotgun sequence data on the Complete Genomics platform and used about 200 gigabases of Illumina HiSeq 2000 data from previously constructed libraries with short and long inserts. In addition, they used 67 gigabases of transcriptome data and 105 gigabases of methylation data from previous studies.

From the whole-genome shotgun data, they generated a de novo assembly of the YH genome. Separately, they assembled the fosmid data, one pool at a time, resulting in 23 gigabases of fosmid-assembled haploid (FAH) sequences, which they used to improve the completeness and continuity of the draft assembly. This resulted in a reference genome, called YHref.

Next, using heterozygous SNP markers, they phased the FAH sequences into two haplotype groups and merged them with overlapping sequences from YHref to construct a haploid-resolved diploid genome (HDG) sequence.

The HDG comprises 5.15 gigabases of sequence, with a haplotype N50 of 484 kilobases, and has an estimated error rate of 8 x 10-5. In it, the researchers identified 3.27 million SNPs, 745,000 short indels, 18,000 intermediate indels, 13,000 long indels, 111 inversions, and 167 translocations.

About 7 percent of the SNPs and 59 percent of the short indels are novel, meaning they are neither in 1000 Genomes Project data nor in dbSNP. Most of them are contained in repetitive regions of the genome. Based on a validation of a subset of the novel variants, about three-fifth of SNPs and a little more than half of the indels are likely to be real.

About 40 percent of variants detected only in the haploid-resolved variant set appear to be false positives, owing to biases in fosmid coverage, systematic sequencing errors, and repetitive sequences. Repeats are "still the biggest challenge for de novo assembly of short-read data," the authors noted.

Conversely, the haplotype-resolved data also missed a number of variants, many in regions that were not well covered by the fosmid pools, and others that were inadvertently removed by quality control measures.

In the future, the researchers wrote, their pipeline could be applied to the de novo assembly of other types of genomes, including those with high levels of heterozygosity or polyploid genomes. It could also be adapted to be able to include data from emerging technologies that do not require fosmids, such as Complete Genomics' long fragment reads.