SAN FRANCISCO (GenomeWeb) – Researchers from Pacific Biosciences have demonstrated how the company's circular consensus sequencing method can be used to de novo sequence and assemble a human genome. The team, which described the protocol in a BioRxiv preprint, sequenced a Genome in a Bottle reference sample to 28-fold coverage.
Aaron Wenger, a principal scientist of bioinformatics at PacBio, said that the company plans to offer the method as a supported protocol on the Sequel instrument in the first half of the year, but customers can also follow the steps in the paper. In addition, he said, the company plans to develop bioinformatics tools specific to this type of sequencing read and continues to work on optimizing the chemistry.
Overall, the group found that its approach led to highly accurate variant calling as well as a high-quality de novo assembly. They achieved precision and recall above 99.91 percent for SNVs, 95.98 percent for indels, and 95.99 percent for structural variants. In addition, the researchers were able to phase nearly all variants into haplotypes. The de novo assembly produced a contig N50 above 15 megabases and was 99.998 percent concordant with the Genome in a Bottle benchmark. The team was also able to correct some errors present in the GIAB dataset.
"It's a very nice demonstration," said Wigard Kloosterman, an associate professor at the University Medical Center Utrecht in the Netherlands who was not involved with the study but whose lab was one of the first to sequence a human genome with Oxford Nanopore's MinIon. "This is the first paper that shows the combination of high-accuracy sequencing plus long reads, so that makes it really exciting technology," he said. The researchers also showed, he added, how this data could be used to "separate out the haplotypes of the human genome and produce separate read sets for each haplotype."
Jared Simpson, an investigator with the Ontario Institute of Cancer Research, who receives some research funding from Oxford Nanopore, said that although he has not evaluated the BioRxiv paper in depth yet, the main results show that the "CCS assembly is both accurate and highly contiguous and I think sets a new state-of-the-art for human genome assemblies."
He noted, however, that the method "required many sequencing runs and significant computational resources, so it isn't clear how widely this technique can be applied."
The CCS method described in the preprint is the same concept that has always been available for PacBio's sequencing technology. The main difference, said Wenger, is that with new chemistry launched last fall, the CCS reads are much longer.
Historically, Wenger said, CCS could not be done with DNA molecules longer than around 1 to 3 kilobases, but with the upgraded chemistry and some protocol modifications, the researchers were able to circularize DNA fragments that were on average, 13.5 kb.
The first step, Wenger said, was to make a standard library but to "use size fractionation to get a narrow band of molecules all approximately the same size." Typically, the goal is to try and get as many of the really long molecules as possible, but for the CCS protocol, the goal is to keep the molecules around the same size. In this case, the average fragment size was 13.5 kb.
The second key to enabling the longer read lengths was a pre-extension step prior to starting the sequencing reaction. In the past, read lengths were limited due to the polymerase's tendency to fall off the DNA molecule when it encountered any type of damage. With the updated protocol, a pre-extension step is performed after size-selecting DNA molecules and circularizing them with hairpin adaptors. During that step, the polymerase runs as normal prior to being imaged on the sequencer. At the end, only the molecules that still have polymerase attached are sequenced.
In the paper, the researchers let the pre-extension run for 12 hours, followed by 24 hours of sequencing per SMRT cell. Sequencing generated 89 gigabases of CCS read data, or on average 2.3 gigabases of CCS reads per SMRT cell. To sequence the genome to 28X coverage, they ran 39 SMRT cells, but they showed that the same quality could be achieved with about half the number of SMRT cells, Wenger said. On average, each SMRT cell costs about $800 to run, so a 28X genome would cost around $31,000 while a 15X genome would cost about $16,000, although SMRT cell costs vary a bit by geographic region, he added.
Wenger noted that the firm's new 8 million well chip, which is now in the hands of some early-access customers, would enable around an eightfold higher throughput and a reduction in costs.
The researchers also evaluated bioinformatics tools for variant calling. Wenger said one advantage of the CCS method is that the process produces more accurate reads than the standard PacBio sequencing method, with individual CCS reads having an average accuracy of 99.9 percent. That enabled tools, such as GATK, which were developed for short-read sequencing, to be used to call variants.
However, those tools were developed for short reads and "with implicit assumptions about the error profile," Wenger said, which differs between Illumina sequencing and PacBio sequencing. For PacBio reads, the dominant errors are indels, while for Illumina, they are mismatches, leading to lower accuracy when GATK was used to call indels.
The PacBio team also worked with Google to use that company's DeepVariant, a bioinformatics tool based on deep learning that adjusts to different data types. The researchers found that when the DeepVariant tool was trained with Illumina reads, it performed poorly on the PacBio CCS reads, particularly for indels. However, training DeepVariant on the CCS reads boosted the accuracy for both SNVs and indels.
"That was a very interesting result," Wenger said. In a blog post, the Google team described the model it made for the PacBio CCS data and said that it plans to include a PacBio CCS model in its next release of DeepVariant.
Wenger said that going forward, the main focus will be on reducing cost and increasing throughput. Already, lower cost is possible simply by sequencing to a lower coverage, he said. The team showed in the paper that sequencing to 15-fold coverage yielded similar performance as sequencing to 28-fold coverage. In addition, he said, the team is working on increasing the average DNA fragment size, which would boost yields, adding that 20-kilobase lengths should be possible in the near future. Another improvement the team is working on is to increase the efficiency of the chip use. The current SMRT chips have 1 million reaction chambers where sequencing takes place, but in this paper, only around 240,000 of them were used. "We're working to take advantage of the full capacity of the chip," Wenger said. The launch of the new version of the chip with 8 million sequencing chambers will also boost yield and reduce cost.
Finally, Wenger said, the team is looking to take advantage of the unique data type to design new base calling algorithms. For instance, in the paper, the researchers showed that they could use haplotype information to call variants. Wenger said the next step would be to design algorithms from the ground up that "use that long-range structure to improve variant calling," which would have a particularly big impact on indel calling.
Kloosterman said there are advantages to both PacBio's and Oxford Nanopore's sequencing platforms,. He does not have experience using the Sequel system but his lab has been using Oxford Nanopore's PromethIon system for about one year. He said he is able to sequence a human genome at 30-fold coverage on one PromethIon flow cell for around €2,000 ($2,272) but cautioned that this is not a true apples-to-apples comparison since his lab uses the sequencing solely to call structural variants. "We're well aware that looking at smaller variants like indels and SNVs is still very difficult with those reads," he said. "We're all aiming for the best quality sequence and the longest reads," he added. With regards to longest read lengths, Kloosterman cited researchers such as Matt Loose from the University of Nottingham in the UK, who have "pushed that to the limits with Oxford Nanopore," demonstrating reads that are megabases in length. "But, everyone is aware of the noisiness of those reads and the complications that has for applications like assembly and variant calling." By contrast, a "key point of this paper is that it has the best of both worlds — the long reads and accurate sequences — which gains a lot of power in terms of interrogating human genomes."