ST. PAUL, MINN. – Researchers from the University of Minnesota have developed a library preparation method for Pacific Biosciences' Sequel platform that addresses the challenge of sequencing DNA molecules that are too long for short reads but not long enough to make long-read sequencing cost effective.
Their solution: link up the sequences to get a molecule that can be analyzed using PacBio's circular consensus sequencing protocol.
In a paper published last week in Scientific Reports, the researchers, led by Nisha Kanwar and Burckhard Seelig of the University of Minnesota, described a version of concatenation sequencing that they developed for their work in protein engineering.
Their approach uses the Golden Gate assembly method of cloning to link DNA sequences about 800 bases long. "This is the Death Valley. Illumina can't do it anymore, but for PacBio, it's wasting their power," Seelig said. "Our method is not a new invention but significantly improves upon the previously published [methods]." Each original sequence gets linked to four others, creating a molecule about 5 kb long.
"This paper shows that you can improve accuracy while not sacrificing length of read," Tim Whitehead, a protein engineer at the University of Colorado who was not involved with the study, said in an email. "It is a really elegant demonstration of how library preparation can be used with existing commercial sequencers to improve error rates. This is an incredibly important problem for these sorts of directed evolution campaigns where you are sequencing variants with very high pairwise identity, so [low] error rates are crucial."
Seelig's lab focuses on trying to find new, more efficient enzymes by starting with huge libraries of trillions of DNA molecules and winnowing those down.
"When we make a library, we have no way to even analyze that library," he said. In vitro selection helps find winners, but each round takes time and money. "Ideally, you do fewer cycles and use deep sequencing at an early stage to see what you've got. It allows us to see things that are not the winners, but runners-up that for some other reason might be even more interesting."
To add to the technical challenges, Seelig is also constrained by funding. "We're an academic lab, we can afford $1,000 here and there, but not much above $15,000," he said.
These were the constraints that led Seelig and Kanwar, then a postdoc in the lab, to come up with their PacBio-based concatenation sequencing protocol. They declined to say what kind of enzyme they were working on, citing a pending publication.
But they had pools of sequences they wanted to learn about and needed a method that would let them see single-base differences, which could be important for protein function.
"This was a means to an end," Kanwar said. "We were more interested in data than building a technology at the time." Some other sequencing methods, such as one using unique molecular identifiers with Illumina NGS, required more computing power or expertise than she had access to at the time. "That's why we ended up with this route of stitching the molecules together," she said.
The method is an adaptation of concatenation sequencing, a method developed by Roche Sequencing Solutions and published in 2017, that solves some issues. Seelig said his team has not sought intellectual property based on their tweaks.
Improvements included linking the fragments using overhangs so that the concatenation is directional, making each construct a defined length, and using short connector sequences. "You're wasting sequencing ability if you have really long adapters," Seelig said.
The results were good enough that the team wanted to make it available to other researchers. "The resolution that we got from our different stages of the experiment is really high," Seelig said. "We really could distinguish tiny differences between families and big differences between rounds of directed evolution."
For this, they made an additional request of their bioinformatics collaborators at the University of California, Los Angeles, Celia Blanco and Irene Chen. "We could have said, 'Just do it for us,'" Seelig said. "But we were more ambitious and asked, 'Can you write this up and make it for dummies like me?'" The result was DeCatCounter software, which is now available on GitHub. "It's user friendly for people who don't have much bioinformatics experience at all. It should be able to easily adjust this to whatever your sequencing parameters are," Seelig said.
Total costs included the Golden Gate assembly, library multiplexing, and PacBio sequencing on a single flow cell that had nine different barcoded DNA libraries on it, Seelig said.
His hope is that both protein engineering and other researchers could find use for it. "It's for anything 400 bp and beyond that. Most proteins are around that intermediate length," he said. "PacBio can sequence really long, but only a few proteins are really that long."
Seelig and Kanwar ultimately ran their experiment on the PacBio Sequel I, but they said it should work on the newer Sequel II, as well, which will only increase efficiency.
The Sequel II, they said, enables accuracy of 99.9 percent on reads 10 kb to 15 kb. "Using these hardware improvements in the case of our [approximately] 870 bp libraries, we would concatenate 10 instead of five genes per amplicon, leading to an additional twofold increase in throughput," the authors wrote. Overall, advancements could increase sequencing depth sixteenfold and analyze protein libraries with longer DNA sequences.
Whitehead suggested that "with some minor tweaks" the method could be compatible with other sequencing platforms, such as those from Oxford Nanopore Technologies.
"Researchers sequence a lot of short things because they know they can do it on Illumina," Kanwar said. "Maybe they're not looking at longer [reads] because of that factor. If that's the case, this would be a nice method to help change that."