Skip to main content
Premium Trial:

Request an Annual Quote

PacBio Users Report Progress in Long Reads for Plant Genome Assembly, Tricky Regions of Human Genome

Premium

Pacific Biosciences and several users recently reported on progress with the company's long-read sequencing technology in de novo and hybrid assemblies of microbes and plants, as well as applications in analyzing difficult-to-sequence regions of the human genome.

At last month's Advances in Genome Biology and Technology conference in Marco Island, Fla., researchers from Cold Spring Harbor Laboratory discussed how they have used the PacBio RS to sequence and generate assemblies of duckweed, flatworm, and rice genomes. Researchers at Mt. Sinai Hospital, meantime, demonstrated that a human genome can be sequenced with the PacBio at between 10-fold and 15-fold coverage, and said they are working on ways in which to incorporate PacBio in clinical workflows in order to better assess repetitive regions of the genome and large structural variations.

At the conference, Jonas Korlach, PacBio's chief scientific officer, also presented data that the company has generated in collaborations with the Netherlands' National Institute of Public Health and Environment, the US Food and Drug Administration, and the Centers for Disease Controls on de novo assembly of pathogens such as Bordetella pertussis and Listeria and Salmonella species.

At AGBT, Eric Antoniou, manager of Cold Spring Harbor Laboratory's sequencing center, discussed how his team used the RS to sequence the duckweed, a potential new biofuel, comparing PacBio's C2 chemistry with its recently launched XL chemistry.

Sequencing with the XL chemistry increased the mean read length from 2,645 bases to 4,695 bases and the maximum read length from 16 kilobases to 22 kilobases. Additionally, the sheer number of long reads spiked with the XL chemistry — over 50 percent of the sequences generated from the XL chemistry were contained in reads longer than 5,035 bases, compared to reads of 3,863 bases with the C2 chemistry.

"Looking at data from each SMRT cell," he noted, "even the smallest maximum subread length is longer than the largest from the C2 chemistry."

In a separate project at CSHL between Greg Hannon's laboratory and Mike Schatz's laboratory to sequence the 700-megabase flatworm genome, Antoniou said that researchers compared sequence data generated from two different XL libraries made on the same day by different lab technicians, using the same library kit.

One library generated "the longest read I've ever gotten," at just over 26 kilobases, Antoniou said, while the other library's maximum read length was around 21 kilobases.

Additionally, library number two had a greater number of long reads with over half of the sequence data contained in reads at least 5,122 bases long, compared to the first library, whose N50 read length was 4,640 bases.

The key difference between the two libraries was that in the library with longer read lengths, no fragments under 1 kilobase were included, Antoniou said, illustrating that it is "worth spending time making your library."

The labs also compared an Illumina-only assembly to a PacBio-only assembly for the flatworm genome, which is "very complex with large duplications and repetitive rearrangements," Antoniou said.

The team generated 40-fold coverage of the genome with Illumina reads and 7-fold coverage of the genome with error-corrected PacBio reads. The N50 contig size with the PacBio-only reads increased to 7 kilobases from 1.7 kilobases with only Illumina reads, Antoniou said.

Antoniou also showed data the laboratory has previously presented of its work on the 430-megabase rice genome, around 40 percent of which consists of repetitive regions.

Compared to the C2 chemistry, sequencing with XL chemistry increased the mean read length to 3,241 bases from 2,392 bases and nearly doubled the maximum read length to 20 kilobases. Forty eight percent of the error-corrected reads were longer than 5 kilobases, a 77 percent increase from the C2 chemistry result, Antoniou said.

The lab has also tested a number of different assembly techniques, including hybrid PacBio and Illumina assemblies and Illumina-only assemblies with multiple libraries with different insert sizes.

An Illumina-only assembly created from libraries with different insert sizes and the AllPaths assembly algorithm produced an N50 contig size of just over 18 kilobases, while an Illumina/PacBio hybrid assembly doubled the N50 contig size to just over 36 kilobases.

Tackling Pathogens

Also at AGBT, PacBio's Korlach presented on a number of applications for which the PacBio is particularly useful, including the sequencing of B. pertussis, which causes whooping cough. The genome consists of one circular chromosome around 4 megabases in size and around 10 percent of it is in repeat elements.

Prior to sequencing B. pertussis with PacBio, there were only two assembled genomes for the pathogen, one of which required more than 130,000 Sanger reads and another that used a combination of 454 sequencing and Sanger sequencing, which assembled the genome into 300 contigs and still required over 10,000 additional Sanger reads to fill in gaps, Korlach said.

Having a more complete picture of the B. pertussis genome is particularly important because whooping cough has been on the rise, said Korlach, and even though there is a vaccine available, it is only effective in 80 percent of cases and there has recently been an "emergence of vaccine escape strains."

In a project with Frits Mooi at the Netherlands' National Institute for Public Health and the Environment, PacBio has sequenced nine strains of the pathogen and done de novo assemblies using only PacBio reads and the company's HGap assembly tool.

For each strain, the company used between four and eight SMRT cells and assembled the genome into one contig. Included in the nine, were some of the first vaccine escape strains, Korlach said.

"Now we have a detailed view of large structural variants that cause differences in these strains," Korlach said in his presentation.

Looking at the virulence genes also illustrates large differences between the strains, and by constructing phylogenetic trees, it will be possible to understand the relationships between the different strains and how they cluster and have evolved. The phylogenetic tree also showed that the two strains that had previously been sequenced were "quite far" from the main group, Korlach said.

Additionally, in the first two strains sequenced with Sanger sequencing, four mobile elements were identified, and PacBio sequencing identified five additional mobile elements in the nine strains.

"The de novo approach highlights that there's quite a diversity of different phage and prophage elements that are present in these different strains," Korlach said.

The company has also sequenced several Salmonella strains, including one for an FDA study to test how quickly complete genomic information could be generated from an ongoing outbreak, Korlach said.

From a Salmonella outbreak that occurred in Arizona last October, the company was able to sequence a clinical isolate and generate a complete assembly in less than one week, identifying that the genome was contained on one chromosome with two plasmids containing "never before seen sequence," Korlach said.

Those novel sequences can now be used to create diagnostics specific for that particular outbreak, he added.

Aside from the Arizona isolate, the company sequenced and assembled seven other Salmonella strains and also generated epigenomic information from each of the strains, identifying a "large amount of diversity" between the methylomes of the strains. In one strain, the team identified three different methylation types, as well as a relatively new DNA base modification — phosphorothiation — "where one of the non-bridging oxygens in the backbone of the DNA is modified to a sulfur atom," Korlach said. While not much is known about the function of these types of modifications, there are indications that it may be involved in oxygen stress response, he added.

In a separate collaboration with the CDC and the FDA, PacBio sequenced 16 strains of Listeria and identified similar differences between the strains' methylomes, including an "interesting signature" in three of the strains that involved a modified thymine, Korlach said.

Human Genomes

While the primary applications of the PacBio RS to date have been on microbial and other smaller genomes, some researchers are starting to move into human genomes.

Eric Schadt, director of the Institute for Genomics and Multiscale Biology at Mt. Sinai Hospital and former chief scientific officer at PacBio, is looking to implement PacBio sequencing into Mt. Sinai's clinical sequencing pipeline. The hospital's core sequencing facility is equipped with two HiSeq 2000s, two HiSeq 2500s, a MiSeq, a PacBio RS, and an Ion Proton. The core facility shares a CLIA license with the Genetic Testing Laboratory at Mt. Sinai.

In a presentation at AGBT, Schadt said that he has sequenced the human genome on the PacBio, generating around 12 million reads and over 10-fold mapped coverage. The mean read length was 4,066 bases, with a mean subread length of 2,766 bases. Error-corrected reads had an accuracy of more than 99 percent.

Moving forward, he said the lab is looking to implement PacBio sequencing into its clinical workflow.

For instance, the lab currently does carrier screening on an Illumina genotyping platform. While the lab has been working on a next-gen panel for this test, there are around 20 genes that have repeat expansions as a pathogenic marker, which make them difficult to test with short-read sequencing.

Schadt said that the lab is interested in "complementing [the next-gen panel] with long-read data and developing a hybrid panel." In a test of the PacBio on some of the problematic genes, such as the Huntington's disease gene, the "repeats are completely spanned," he said. In the gene CACNA1A, which is associated with neurologic disorders, PacBio sequencing covered two repeat expansions 1 kilobase apart within a single read.

Schadt said that the lab has also tested PacBio's ability to sequence through other tri-nucleotide repeats within the human genome. Of 10,000 tri-nucleotide repeats that consist of at least 50 repeated units, 10-fold coverage of the genome with PacBio covered 84 percent of them, he said.

PacBio sequencing was also able to identify a 500-base "drop out" in the major histocompatibility complex region that produced mapping artifacts with Illumina sequencing as well as areas of heterozygosity in the MHC locus that were missed by Illumina and 454 sequencing.

Schadt said that the lab is now close to gaining regulatory approval for a hybrid Illumina/PacBio exome sequencing test, which would include exome sequencing on the Illumina with a targeted capture of 5 megabases to be done on the PacBio that would include difficult-to-sequence regions.