NEW YORK (GenomeWeb) – An Illumina-developed sequencing strategy that uses contiguity-preserved transposition to achieve indexed transposition and amplification steps is showing promise as a means of producing haplotype-resolved human genomes and improving de novo genome assemblies generally, new studies suggest.
The method, known as "contiguity-preserving transposase sequencing," or CPT-seq, was built around the observation that a DNA transposase enzyme called Tn5 — which is used to introduce adaptor and/or index sequences into DNA fragments — actually remains associated with unbroken DNA molecules after transposition, Frank Steemers, director of Illumina's Advanced Research Group, explained in an email message.
"[L]ong DNA molecules remain intact and contiguous even after transposition with adapter sequences," Steemers said, adding that these complexes can then be recombined in different ways for subsequent sample preparation steps, including further labeling.
As they reported in Nature Genetics this weekend, Steemers and his co-authors took advantage of this transposase complex characteristic with the goal of coming up with a speedy, simple, and straightforward way of accurately phasing variants in genomes in a manner that could eventually be parallelized and automated.
Indeed, their results so far suggest that CPT-seq can accurately phase de novo and compound heterozygous variants without the help of sequence data from a parent-child trio, other relatives, or statistical information based on other individuals from the same population data.
To achieve this, the researchers first label long DNA fragments with adaptor-transposase complexes. Those fragments are then diluted, recombined, and divvied into new pools before transposase enzymes are released to produce smaller DNA chunks that are labeled a second time during an indexing PCR amplification step.
In this manner, the team creates thousands of differently labeled DNA pools without using more than 96 wells at any stage of the protocol. When these pools are sequenced, meanwhile, they reveal information about the long DNA molecules from the original step that would otherwise be lost.
"[W]e developed a combinatorial indexing approach that effectively generates 10,000 'virtual' compartments from 96 physical compartments," Steemers said. "This virtualization approach reduces reagent costs, and greatly improves the robustness of the assay since we don't have to dilute down to sub-haploid genome content within each physical compartment."
"The real innovation here is the ability to have, effectively, more pools than are possible through physical constraints," agreed University of Washington genome sciences researcher Jay Shendure, a co-author on the CPT-seq haplotyping study. "The effective number of compartments increases up to nearly 10,000, while the physical number of compartments we are using in any stage of this protocol is just 96."
In a complementary study in Genome Research, Shendure and collaborators from the University of Washington and Illumina outlined an assembly technique — and newly developed software called fragScaff — that folds CPT-seq data into de novo draft genome assemblies to bolster their contiguity and completeness.
For both the haplotyping and genome assembly applications, the researchers began by separating high-molecular weight DNA fragments from a starting pool into individual wells of a 96-well plate.
Fragments in each well were then transposed using an approach that resembles the Nextera kit, Shendure explained. But rather than fragmenting the DNA immediately, the transposase enzyme was allowed remain at a given site after the first transposition step.
"You're effectively loading or decorating these transposase complexes onto the high-molecular weight DNA without actually fragmenting it," he said.
DNA fragments marked with different adaptors could then be pooled once more, diluted, and re-distributed in a new combination into another 96 wells before releasing the transposases and performing a second round of PCR-based indexing that not only creates a distinct double label on each fragment, but also amplifies the DNA and introduces the appropriate sequencing adaptors.
"You have many high-molecular weight fragments entering the process, but any one fragment is going to get one and only one combination of indexes from the first stage and the second stage," Shendure said.
Theoretically, the number of pools with distinct barcode combinations could differ depending on the type of plate or compartment used, he explained. But given a standard 96-well format, the final number of pools with discernible barcodes is 9,216 — subsets of the genome that can be sequenced while maintaining locational information and fed into assembly or haplotyping software.
For their Genome Research study, for instance, Shendure and his colleagues generated CPT-seq data in an effort to augment and improve de novo human, mouse, or fly genome assemblies, significantly enhancing the contiguity and N50 contig lengths of the draft assemblies produced by shotgun and mate-pair sequencing.
They noted that the mid-range CPT-seq contiguity data offers assembly cues that are complementary to those provided by Hi-C — a source of long-range sequence information that Shendure and others described in Nature Biotechnology last year.
"Three [kilobase mate-pair library sequencing] and shotgun data will give you short range information. The Hi-C works amazingly well provided you have good-sized scaffolds or contacts going into it," Shendure said.
"It's this middle ground that I think we still need good technologies for," he added.
With that in mind, his team developed a new assembly algorithm called fragScaff that's directed at "filling the void" between short-range and long-range genome regions using additional CPT-seq data.
The software takes the set of contigs that come out of an input assembly produced from shotgun and mate-pair sequencing and uses the additional CPT sequence data to more fully define locational relationships between these contigs, Shendure explained.
From the draft de novo genome assembly and the CPT-seq information, then, the researchers "look for an unusual number of coincidences, in terms of reads mapping to the end of one contig and reads mapping to the end of another contig," he said. "What we often see are these extreme outliers, where we have way more coincidences than you'd expect by chance … We can effectively define that as a link."
At the moment, the Hi-C sequence data is layered on following the use of CPT-seq to further improve a given de novo assembly. Ideally though, Shendure noted that it would be advantageous to be able to perform both types of analysis and assembly simultaneously.
The fragScaff software has been designed to deal fairly specifically with data from CPT-seq experiments, though Shendure pointed out that it may also be useful for dealing with sequence data produced using any approach that involves large numbers of clone dilution pools.
Though fragScaff does demand another form of sequence data, he noted that the amount of CPT-seq needed to significantly bolster an assembly's contiguity is relatively minor compared to the sequencing horsepower behind the original input assembly.
Likewise, a modest amount of additional CPT-seq makes it possible to obtain haplotyping information that would otherwise require more complicated computational or experimental phasing.
When using CPT-seq data for phasing in their current experiments, Illumina's Steemers said, he and his team are generating roughly 60 billion bases of extra sequence to make haplotyping sense of a variant call file produced from a human genome sequenced to the standard depth of 30-fold coverage, which itself requires roughly 100 billion bases of Illumina sequence.
In their proof-of-principle experiments in Nature Genetics, the researchers were able to achieve haplotyping blocks spanning up to a million bases or so, with around one to two "long-switch" errors occurring per 10 million bases, on average.
Their experiments using DNA from a HapMap trio with well-characterized haplotype patterns suggest that CPT-seq in combination with variant call files from genome sequence data can phase more than 95 percent of SNPs genome-wide.
A fraction of variants are still missed, which the study's authors attributed to factors such as biases introduced during PCR amplification, other stages of the library prep process, or sequencing itself.
Steemers noted that the Illumina-developed CPT-seq method has been patented and available for potential commercial development. The price tag for CPT-seq is still somewhat unclear since the method is currently done through a research protocol, he said, but is expected to remain relatively inexpensive given the simplicity associated with each step in the protocol.
Data generated from CPT-seq experiments can either be fed into custom haplotyping software or into software such as fragScaff for genome assembly. Shendure's team has so far had limited success in trying to combine the two applications into a single analysis, though he hopes to see the further experimental and analytical development aimed at outputting haplotype-resolved diploid de novo assemblies from an amalgamated pipeline.
In the nearer future, Steemers said the team is collaborating with several groups to test and validate the use of CPT-seq for phasing genomes in the clinical research setting. They are also interested in further streamlining the CPT-seq workflow and coming up with additional applications for the approach such as phasing for particular parts of the genome.
"We believe that the method is applicable to targeted phasing applications as well," Steemers said, adding that "[t]his could be of great interest to the clinical research market, particularly in [human leukocyte antigen] and [Illumina's VeraCode ADME panel] sequencing."