Skip to main content
Premium Trial:

Request an Annual Quote

AGBT: WGS with PacBio Yields High-Quality Assemblies, IDs Important Structural Variants


NEW YORK (GenomeWeb) – Pacific Biosciences' niche has primarily been whole-genome sequencing of microbial organisms or resolving complex regions of larger genomes, but as its throughput and average read length have continued to increase over the years, customers have correspondingly been using the technology for larger and larger genomes. Now, a number of customers are using the technology to de novo sequence whole human genomes, according to presentations at last week's Advances in Genome Biology and Technology meeting in Marco Island, Fla.

PacBio scientists have de novo sequenced the genome of J. Craig Venter, who, in a presentation said that his company Human Longevity plans to use PacBio technology to generate whole-genome assemblies of 30 reference genomes. Macrogen CEO Jeong-Sun Seo presented on his use of the technology to generate an Asian-specific reference genome, and Richard McCombie, professor of human genetics at Cold Spring Harbor Laboratories, said his lab as well as the Ontario Institute for Cancer Research have used PacBio sequencing to de novo sequence a human breast cancer genome from a cell line. Meanwhile, users such as Gene Myers, founding director of the Systems Biology Center at the Max Planck Institute, are creating better bioinformatics tools, such as Daligner, to aid in the assembly of PacBio-generated sequence data.

During a PacBio-sponsored lunch workshop at AGBT, CEO Mike Hunkapiller said that company scientists had de novo sequenced Venter's genome to a total coverage of around 84x. Reads greater than 20kb were used as seeds for the de novo assembly and they comprised around 40x coverage of the genome.

The company made use of an assembly algorithm, FALCON, to generate an assembly consisting of around 3,000 primary contigs with an additional 4,761 associated. The primary contig N50 was 10.4mb, while the primary contig N90 was 1.3mb. The median read length was 17kb. Hunkapiller noted that the major histocompatibility complex, a notoriously difficult to sequence 4-5mb region, was contained in one 9mb contig. PacBio used its latest chemistry, P6/C4, as well as bioinformatics tools Daligner, FALCON and DNAnexus' cloud platform.

Previously, Hunkapiller has said that a new market for its RS II system will be customers who purchase Illumina's HiSeq X Ten machine, a system geared for extremely high-throughput whole human-genome sequencing, such as for population studies. Those customers, he has said, have realized that while Illumina's technology can generate thousands of genomes at a lower price point, there are still many regions in the genome where short-read sequencing struggles, including repetitive regions, areas of high GC content, and highly homopolymeric regions. PacBio sequencing, by contrast, can produce high-quality reference genomes.

Venter's Human Longevity was one of the first companies to invest in the HiSeq X Ten and has the capacity to sequence 100,000 genomes per year. Venter said that he plans to use the company's two RS II systems to generate "30 reference genomes of ethnographic diversity," with the goal of interpreting the human genome. The "secret to success," he said, is "long-range continuity."

Macrogen, another early HiSeq X Ten customer, also purchased PacBio instruments to complement the X Ten platform and help with generating an Asian-specific reference genome. Seo said during the workshop that as part of the Asian Genome Project, it needed an Asian-specific reference genome. Recent whole-genome sequencing studies of Asian individuals have uncovered a number of novel, Asian-specific variants that are not present in the human reference genome, he said. For instance, a 2010 Nature publication of the first whole-genome sequencing of a Japanese individual found 3mb of sequencing that was not contained in the reference genome.

Phase 1 of the Asian Genome Project included the sequencing of 851 healthy individuals, and Phase 2, which is ongoing, aims to sequence 10,000 individuals from disease cohorts, including cancer, neurodegenerative, and genetic disorders. The project includes major hospitals and sequencing centers from Korea, Japan, China, and India. However, there have been major challenges due to gaps in the assembly when aligning back to the reference that are likely associated with ethnic differences. "Only a small amount of the structural variation can be detected," Seo said.

As such, he said it is critical to create an Asian reference genome. Macrogen tested de novo sequencing using PacBio technology in combination with HiSeq sequencing, PacBio technology alone, and PacBio sequencing and assembly with gap filling by BAC clones.

Seo said that the PacBio technology in combination with a BAC-based approach yielded the best results. The scientists sequenced the genome to 72x coverage with average read lengths of 13.4kb. They used Daligner and FALCON for the assembly, and generated 5,522 primary contigs, around 4,000 associated contigs, with a contig N50 of 7.3mb.

The group identified a number of unique sequences and structural variations. For instance, on chromosome 20 alone, the team identified 196 novel insertions and 260 novel deletions, with an average size of 634bp.

Some of those novel structural variants are disease-related, Seo said. For instance, the team identified a structural variant in the gene PADI4 that is associated with rheumatoid arthritis in Korean populations, but is not found in European populations. In addition, he said the group found a large insertion in the NINL gene, which is related to pigmentation. That gene is the "most differentially expressed gene between Asian and Caucasian populations," he said, and the de novo PacBio sequencing identified a novel 8kb insertion in that gene.

In the future, Seo said that one of Marcogen's goals is to build a medical grade reference genome.

These higher quality medical genomes come at a cost though. Currently, to generate a "reference medical grade de novo genome" using the RS II will run around $40,000 for a 50x coverage genome, PacBio CSO Jonas Korlach told GenomeWeb. However, by the end of the year, with improvements to read length and throughput, the cost will drop to $10,000 per genome, he said.

PacBio has always positioned itself as being able to resolve complex regions, including structural variants, due to its long reads. However, at last week's AGBT Cold Spring Harbor's McCombie presented on the use of the technology to de novo sequence a whole cancer genome for the first time. McCombie said that since the advent of next-generation sequencing, researchers have been able to sequence whole genomes at a much lower cost and get very good SNP data, but "miss a lot of structural variants." So, he wanted to see whether the long reads of PacBio could be used to "look at higher resolution at the part of the genome that we don't see even with Sanger sequencing."

The team chose a well-studied breast cancer cell line with HER2 amplification as a proof of principle to test the technology. Aside from being a very well-characterized genome, it is also very structurally altered. "It's really a mess," McCombie said.

The team de novo sequenced the genome using just the RS II and is comparing that assembly to other assemblies. The project was started in November and is an ongoing collaboration with researchers from the Ontario Institute of Cancer Research, he said, and the results presented were preliminary.

McCombie also highlighted performance improvements to the RS II system since the group started the project in November 2014. Then, the teams were achieving mean read lengths of 6.2kb and yields of 213mb per SMRT cell. Since switching to the newer chemistry though, the throughput per cell has increased five-fold to over 1gb per SMRT cell and mean reads have increased to 11.3kb.

DNAnexus performed the assembly of the genome in just 21 hours, making it the fastest genome assembled. The contig N50 was 2.56mb with the maximum contig 23.5mb.

Average coverage of the genome is 54x, McCombie said, with approximately 12x coverage contained in reads over 20kb and 50x coverage in reads longer than 10kb.

As expected due to karyotyping results, the coverage per chromosome is highly variable. Coverage of chromosome 17, which contains the amplified HER2 gene, is much higher. When compared to Illumina sequencing data, he said that although the overall results are similar — higher coverage of chromosome 17 due to the HER2 amplification — there is a drop-off in coverage, even within the HER2 gene, due to the presence of short repeats.

Focusing in on the HER2 gene, the PacBio technology detects recombinations and breakpoints. Illumina sequencing detected some, but not all, of the breakpoints, and also had false positives, McCombie said.

In addition, sequencing on the RS II detected some complex structural variations that were not identified from HiSeq data, such as an inverted duplication involving the HER2 gene.

"Looking more deeply, there were a series of events," McCombie said, that can be pieced together by following the detected fusions, translocations, and break points, including translocations involving chromosome 17 and chromosome 8, as well as a known fusion between RARA and PKIA.

"What we think happened," McCombie explained, is that the first event was a translocation involving the HER2 gene on chromosome 17 into chromosome 8. "Chromosome 8 is incredibly amplified," he said, indicating the event was an early one. That translocation was followed by a second duplication within chromosome 8, followed by a partial duplication and inversion, and then finally, the HER2 region within chromosome 8 was duplicated one last time.

McCombie said that the team is making its data freely available. All the raw data from the HER2 cancer line is available now, and the group will soon have the whole-genome assembly and methylation analysis available. In addition, he said that the team is planning to analyze 100 individual cells.