NEW YORK (GenomeWeb) – Researchers from Pacific Biosciences and their collaborators have developed two algorithms to assemble long-read sequencing data into phased diploid genomes.
The investigators, led by Johns Hopkins University's Michael Schatz, tested their open-source algorithms FALCON and FALCON-unzip on Arabidopsis thaliana sequences. As they reported in Nature Methods this week, they found that their approach yielded contiguous, complete, and accurate phased genome assemblies. They then applied their approach to the trickier-to-assemble genomes of the Cabernet Sauvignon variety of Vitis vinifera and the coral fungus Clavicorona pyxidata.
"The new genomic information that will be generated with this approach will accelerate the development of new diseaseresistant wine grape varieties that produce highquality, flavorful grapes and are better suited to environmental changes," the University of California, Davis' Dario Cantu, who led the wine part of the study, said in a statement.
FALCON is a diploid-aware long-read assembler that error-corrects raw reads and computes an initial assembly using a string graph of read overlaps. The string graph includes sets of haplotype-fused contigs and of bubbles that represent regions that differ between homologous sequences, the researchers noted. FALCON-unzip is haplotype-resolving tool that refines those contigs to provide a set of phased primary contigs and associated haplotype contigs.
Schatz and his colleagues evaluated their approach using three Arabidopsis genomes: the inbred lines Col-0 and Cvi-0, and a hybrid of those two lines. After generating long-read sequencing data, they assembled each of these Arabidopsis genomes using FALCON.
They reported that their assemblies had contig N50 sizes of 7.4 Mb for Col-0 and 6.0 Mb for Cvi-0 — some 10 times to 100 times more contiguous than recently published assemblies and approaching the contiguity of the TAIR10 assembly, which was generated through BAC sequencing.
The researchers also gauged the completeness of their assemblies using the BUSCO software to identify highly conserved plant orthologs. BUSCO, they reported, could identify 95.6 percent of genes in their Col-0 assembly and 94.8 percent of genes in their Cvi-0 assembly, compared to only 95.7 percent in the TAIR10 reference. This suggests that the team's new assemblies are more complete.
Additionally, by aligning primary contigs and haplotype contigs from the hybrid assembly to the parental assemblies, the researchers found most haplotigs only showed SNPs or structural variants in one of the parental genomes, suggesting their phasing approach was accurate.
Schatz and his colleagues then applied their assembly approach to the highly heterozygous and repeat-heavy V. vinifera cv. Cabernet Sauvignon, which is a cross of the Cabernet Franc and Sauvignon Blanc cultivars. Using both a BUSCO-based approach and an alignment to the V. vinifera reference genome, the researchers reported that their assembly was fairly complete. They noted that V. vinifera harbored homologous regions with high variation rates, which they attributed to the fact that it's outcrossed.
They similarly used their approach to assemble the genome of the heterozygous diploid fungus, C. pyxidata, which grows on hardwood trees across North America. It has been resistant to short-read assembly efforts, the researchers said, but their approach yielded a rather contiguous assembly. The researchers also uncovered regions of low heterozygosity within the C. pyxidata genome, which they said could be due to selective pressures or the result of inbreeding.
Davis's Cantu noted that the new data on the wine grapes will not only help researchers understand what traits Cabernet Sauvignon inherited from each parental line, but also help them breed new vines with improved traits. In particular, he said, traits that could enable vines to better deal with the drought and heat that are expected to worsen with climate change could be of particular importance.