NEW YORK (GenomeWeb) – Baylor College of Medicine and Harvard Medical School researchers have developed an approach that can assemble large genomes with a reported 99 percent accuracy.
While genome assembly approaches have been able to produce scaffolds that are megabases in length, they typically haven't been able to produce scaffolds that encompass the full lengths of chromosomes, Baylor College of Medicine's Erez Lieberman Aiden and his colleagues noted. But by combining Hi-C contact data with draft genome assemblies, they reported in Science today that they have been able to generate chromosome-length scaffolds.
In particular, they applied this approach to assemble a human genome with 23 scaffolds that corresponded to the 23 human chromosomes and spanned 99 percent of their length. They also assembled the genomes of two mosquitoes that serve as disease vectors.
"The ability to rapidly and reliably generate genome assemblies with chromosome-length scaffolds should accelerate genomic analysis of many organisms," the researchers wrote in their paper.
Based on how often different loci make contact with one another, Hi-C provides insight into how genomes are folded. But as the distance between loci affects how often they make contact, the researchers reasoned that Hi-C data could also help them figure out what loci are located near one another along the length of a chromosome and use that information to order contigs.
Lieberman Aiden and his colleagues used Hi-C data to first find and fix errors within initial assemblies. As they detailed in their paper, they homed in on regions where the contact pattern of a scaffold suddenly changed to uncover misjoins. Using an algorithm, they anchored, ordered, and oriented the resulting sequences based on their contact frequencies before then merging the contigs and scaffolds. This way, they said, they generated scaffolds with both strong sequence homology and similarities in long-range contact frequencies. To power this, the researchers developed a high-performance computing system they called Voltron that's based on IBM Power Systems platform, according to IBM. They also relied on Mellanox and NVIDIA tools.
To test this approach, the researchers sequenced a human cell line using an Illumina platform and created a draft assembly using DISCOVAR de novo that consisted of 73,770 scaffolds, which they then improved using in situ Hi-C data. After splitting, anchoring, ordering, and orienting 30,539 scaffolds — they left out the smallest scaffolds — they had 23 scaffolds between 28.8 megabase pairs and 225.2 megabase pairs in length.
When they compared this Hi-C-informed assembly to the human reference genome, the researchers reported that the 23 scaffolds corresponded to the 23 human chromosomes and spanned 99 percent of their length and 91 percent of their sequence. Of the more than 29,000 scaffolds included in the Hi-C-informed assembly, 99.7 percent were assigned to the correct chromosome, the researchers said.
Lieberman Aiden and his colleagues also applied this approach to generate assemblies for the Aedes aegypti and Culex quinquefasciatus mosquitoes, both of which transmit disease. Using in situ Hi-C data they generated, they turned an Ae. aegypti genome assembly of 4,756 scaffolds — based on Sanger reads — into an assembly of three large scaffolds that corresponded to the three Ae. aegypti chromosomes. Of the 1,826 markers included in a genetic map of Ae. aegypti that could be mapped in this new assembly, 1,822 were in agreement between the map and assembly.
The researchers similarly generated a Cx. quinquefasciatus genome assembly with three chromosome-length scaffolds. The researchers were also able to gauge where centromeres were located as well as examine genome evolution among mosquitoes.
"Overall, our results show that incorporating Hi-C data into genome assembly provides a rapid, inexpensive methodology for generating highly accurate de novo assemblies with chromosome-length scaffolds," the researchers wrote.
Lieberman Aiden and his colleagues noted, though, that their assemblies still contain errors, especially for local ordering of small, adjacent contigs, but added that additional data or more sophisticated analysis might be able to improve those results.