Researchers have sequenced and de novo assembled the Drosophila melanogaster genome on Pacific Biosciences' RS II — the first time an animal genome has been sequenced and assembled solely with PacBio technology — and have produced a genome with fewer gaps and longer contigs than the current reference.
Sergey Koren, a bioinformaticist at the National Biodefense Analysis and Countermeasures Center and University of Maryland, developed software for error correction of PacBio reads dubbed PBcR, and presented on the Drosophila assembly at the International Plant and Animal Genome meeting in San Diego earlier this month.
Additionally, the company is planning this year to increase its throughput four-fold to achieve 1 gigabase of data per SMRT cell and average read lengths greater than 10-15 kilobases, as well as improvements to sample prep and new methods for assembly of diploid genomes.
The Drosophila genome, estimated to be around 140 megabases, but potentially as large as 220 megabases, was sequenced in six days using 42 SMRT cells to 90-fold coverage and produced average read lengths of 10 kilobases. Using the Celera assembler, the researchers constructed a haploid assembly in 128 contigs with an N50 length of 15 megabases and a maximum contig length of 24.6 megabases. Total turnaround time from sample to final assembly was six weeks.
PacBio scientists collaborated on the project with researchers involved in the Berkeley Drosophila Genome Project, and researchers from the University of Maryland and the University of Manchester.
According to Sue Celniker, co-director of the Berkeley Drosophila Genome Project, the PacBio-only assembly is a huge improvement over the reference genome, which is currently in its fifth iteration. Researchers involved in the Berkeley Drosophila Genome Project have spent over 10 years working on the reference genome using a combination of Sanger sequencing, BAC clones, and other manual and labor-intensive approaches. Yet, using just one next-gen sequencing technology, and over just six weeks, the PacBio technology was able to piece together regions that have proved particularly troublesome, like heterochromatin and the Y chromosome, she said.
"There's been some persistent repeats that we couldn't get through, that [PacBio] did," she told In Sequence. "Having those very long reads allows you to get through large arrays of repeats."
Researchers are still evaluating and comparing the PacBio assembly to the reference, so Celniker said she could not precisely say how many of the remaining gaps the PacBio assembly was able to close.
However, it is already clear that in some cases the long reads were able to generate a more contiguous sequence than the reference. For instance, chromosome 2R was reduced to two pieces in the PacBio haploid assembly from 27 pieces in the reference. Chromosome 2L was reduced to between 4 and 6 pieces from 6 pieces, and chromosomes 3L and 3R were reduced to 1 and 3 pieces in the PacBio assembly from 22 and 15 pieces, respectively.
Additionally, in the most recent release of the Drosophila reference genome, only around 1 percent of chromosome Y is represented. While the BDGP researchers have since assembled around 7.5 percent of the Y chromosome, the team anticipates that more than half of the Y chromosome will be assembled with the PacBio data.
Part of the reason for less Y representation in the reference genome is that the fly DNA was taken from embryos, so there is no way to know whether male or female DNA was being used, Casey Bergman, a senior lecturer in computational and evolutionary biology at the University of Manchester, told IS. But in the PacBio collaboration, only male flies were used, he said.
Bergman's lab became involved with the project last summer after it released a dataset generating whole-genome shotgun sequences using PacBio technology of the Drosophila reference strain as well as Illumina sequences that it used to error-correct the PacBio reads. The company contacted Bergman to collaborate on generating data and doing de novo assembly using its newer sequencing chemistry.
Bergman said that this Drosophila genome validates PacBio's technology for use in de novo assembly, and shows the value of long reads. Genomes that have been assembled using short-read sequencing technology, like the panda genome, are put together in contigs that are tens of kilobases, he said. But, the Drosophila has an N50 of 12 megabases. "That is chromosome-sized segments. It is what was declared finished for many genomes 10 years ago, and is of much higher contiguity and sequence quality," he said.
Short-read sequencing technology is valuable for applications like identifying genes or fragments of genes, and enables many genomes to be sequenced cost-effectively — but it doesn't give you the long-range architecture, Bergman said.
The PacBio-only assembly also has some advantages over the hybrid PacBio/Illumina assembly, Bergman said.
One problem with error correction, he said, is that Illumina technology does not sequence well through repetitive regions, so the Illumina-corrected reads in those repetitive regions are not as good. "You don't really get the gain in the regions of the genome where you need them for the long-range assemblies," he said.
Adam Phillippy with the National Biodefense Analysis and Countermeasures Center, who worked on the assembly, agreed. In theory, a hybrid assembly approach is beneficial because it combines two orthogonal technologies and can take advantage of the strengths of both, he said. And indeed, in many genomic regions, a hybrid assembly works well. But, since short reads do not align well to certain regions, like repeats, it is difficult to use short reads for error correction in those regions.
"Short reads are notoriously hard to map against a repetitive genome," Phillippy said. "It's much easier to align long reads to long reads, so you assemble the repeats much more effectively."
Phillippy and Koren last year published a study in Genome Biology, estimating a cost of about $1,000 for de novo sequencing and assembly of microbes with PacBio technology. Additionally, the researchers compared self-correction to hybrid correction and found that self-correction was often better in terms of accuracy and contiguity.
Phillippy said that he expects these conclusions for microbial genomes to carry over to larger genomes, especially as throughput and read lengths continue to increase, and the Drosophila genome is the first evidence of that.
Further improvements
Looking ahead, Jonas Korlach, PacBio's CSO, said that the company is planning further improvements to its read lengths and throughput this year.
The company plans to increase throughput to 1 gigabase per SMRT cell and average read lengths to greater than 10-15 kilobases. An increase in read length will be achieved by several factors, Korlach said. The company continues to study different polymerases and is working out ways to optimize the signal from the nucleotide.
For instance, in its latest sequencing chemistry, P5-C3, the company incorporated a protective scaffolding strategy, which reduces photo damage to the polymerase and enables longer reads. Korlach said that the company continues to improve upon this strategy. Additionally, the company has found that "nicks or damage to the DNA template can stall the polymerase and thereby reduce read length," so researchers are looking at ways to do more "efficient DNA damage repair during sample preparation."
Korlach added that the company is also looking at ways to improve loading efficiency, which would also increase throughput. Each SMRT cell contains 150,000 zero-mode waveguides, each of which has the potential to be occupied by a polymerase and template complex. However, the current method of loading is limited by Poisson statistics, meaning that only about one-third of the ZMWs will be occupied with one polymerase-template complex with the remainder occupied by either none or more than one complex, Korlach said. "However, we believe through improvements in loading, we can at least double the amount that we are currently loading per SMRT cell."
These improvements, which will be delivered over the year, will come in the form of software upgrades and a new sequencing kit, Korlach said. None will require a hardware upgrade or installation.