NEW YORK (GenomeWeb) – A team of researchers has introduced a new method for de novo human genome sequence assembly and phasing that combines short-read sequencing with linked-read sequencing and genome mapping.
As reported today in Nature Methods, the team, which includes researchers from the University of California, San Francisco, BioNano Genomics, and 10X Genomics, expanded on a method described last year by Icahn School of Medicine at Mount Sinai's Matthew Pendleton and his colleagues.
But rather than rely on long-read sequences from Pacific Biosciences as Pendleton et al. did, the UCSF-led team swapped that part of the procedure for linked-read data from 10X Genomics. 10X data has previously been shown to be useful for determining human germline and cancer genome haplotypes.
As a pilot study of this new method, UCSF's Pui-Yan Kwok and colleagues applied their method to sequence and assemble the genome of an individual from the HapMap Project to find that their approach was about equivalent to or better than others.
As the researchers noted in their paper, high-quality human genome assemblies have been hampered by the repetitiveness of the human genome and its similarity to other eukaryotic genomes, its diploid nature, and the lack of low-cost DNA sequence platforms that can generate accurate long reads.
"With this proof-of-principle study, we have shown that these limitations can be overcome by using three complementary sets of mapping-sequencing data that can be generated in parallel in a short time by an average laboratory at reasonable cost," Kwok and colleagues wrote.
The new method described by Kwok and colleagues relies on two parallel tracks. In one, Illumina sequence reads are assembled into scaffolds using the SOAPdenovo short oligonucleotide analysis software. To order those scaffolds into longer blocks, researchers fold in sequence data generated on the 10X GemCode platform and use the program fragScaff to generate a new scaffold. At the same time, they generate a sequence motif physical map using BioNano Genomics' Irys System, which is then combined with the 10X scaffold to yield a final hybrid assembly map.
The hybrid-assembled scaffolds are then phased using 10X's Long Ranger software, with the BioNano Genomics maps helping to resolve some repetitive regions.
As a pilot study, the researchers used this approach to assemble and phase the human HapMap sample NA12878.
The initial Illumina assembly of NA12878 yielded more than 14,000 scaffolds with a scaffold N50 of 0.59 Mb, while the 10X and the BioNano Genomics maps had fewer scaffolds and higher scaffold N50 values. The hybrid assembly, meanwhile, contained just 170 scaffolds and had an N50 size of 33.5 Mb, which the researchers said was a 57-fold improvement over the initial Illumina assembly.
In addition, the researchers reported a median phase block size of 4.7 Mb and that some 2.8 million SNVs, or some 97.2 percent, could be phased.
As compared to the reference genome, the researchers found that their assembly was more accurate than the ALL-PATHS assembly of this genome that was published in 2011 and was 95.2 percent comparable with the Pendleton et al. approach. In addition, they noted that 95.7 percent of all exons were present in their new assembly.
While Kwok and colleagues said their approach represents an improvement upon others, they noted that it has a number of limitations.
For instance, they noted that as the 10X approach relies upon high-molecular weight DNA preparations, using archival samples might not be possible. In addition, linked reads are generated through random k-mer amplifications of 50-kilobase to 100-kilobase molecules, but might not always be amplified. As such, to minimize these N-base gaps multiple sequencing libraries of various sizes would have to be generated, increasing the amount of work involved.
They also pointed out a few potential improvements. The contig and scaffold N50 length could be increased, they said, by using a larger range of insert sizes and that 10XG sequence data could be used to extend contigs or fill in gaps between neighboring contigs.