Using data generated by Hi-C sequencing to stitch together and organize short-read scaffolds, a team from the University of Washington has shown that it's possible to produce chromosome-scale de novo assemblies for mammalian genomes.
As they reported earlier this month in Nature Biotechnology, the researchers came up with an algorithm called LACHESIS to harness chromosomal interaction information gleaned from Hi-C sequencing experiments in ways that help with clustering, ordering, and orienting contigs produced with the short-read assembler ALLPATHS-LG.
Using that strategy, the team put together chromosome-level de novo genome assemblies for Drosophila, mouse, and human with sequences generated on Illumina instruments.
This proof-of-principle publication marks the first time that mammalian assemblies at the chromosome-scale have been described for mammals sequenced exclusively with short-read approaches, senior author Jay Shendure, a genome sciences researcher at the University of Washington, told In Sequence.
In the case of the human assembly, for example, adding in a Hi-C dataset to datasets obtained by shotgun paired-end and short jump mate-pair sequencing made it possible to assign sequence scaffolds to the appropriate chromosomes with 98 percent accuracy. The order and orientation of those scaffolds was accurate roughly 99 percent of the time, study authors reported.
Based on their success so far, members of the team are now scaling up the application of the assembly method to other plant and animal genomes, Shendure said.
"There's a lot of work that could be done to do better on the informatics — potentially a more integrated approach that takes all this data into account within the framework of a single assembly algorithm," he said. "But … with these three kinds of data, the information is there and you can potentially generate these chromosome-scale assemblies."
By applying the de novo assembly approach to short-read sequence data generated for a well-characterized cancer cell line, meanwhile, he and his co-authors identified both new and known sequence translocations in the cancer genome, prompting enthusiasm about using LACHESIS to help characterize other cancers.
The work represents the latest in a series of applications for Hi-C sequencing, a method initially developed to explore three-dimensional interactions between DNA.
In the same issue of Nature Biotechnology, researchers from the Ludwig Institute for Cancer Research in La Jolla, the University of California, San Diego, and elsewhere introduced a scheme called HaploSeq for phasing variants in the genome using information from Hi-C sequencing data (IS 11/5/2013).
University of California, Davis Genome Science researcher Ian Korf told IS that the use of Hi-C sequence data in the short-read genome assembly sphere "solves certain types of problems."
"It's better for putting contigs on the right chromosomes and to order and orient some of those contigs," said Korf, an organizer with Assemblathon, an assembly methods competition, who was not involved in the new study, "but it doesn't fill in all the gaps that there are. And there's a huge number of gaps."
Those gaps are partly a consequence of sequence repeats that may muddle attempts to connect different blocks of sequence, Korf noted, but also reflect spots in the genome not sufficiently covered by short-read sequences from the outset.
For his part, Korf argued that for those bent on achieving a fully finished genome, the best strategy is still Sanger sequencing with bacterial artificial chromosomes. But given the relative affordability of short-read sequencing, most investigators have shifted to generating large volumes of short-read sequence that can be used to produce adequate, but far-from-finished, draft assemblies.
Shendure, too, said that the feasibility of producing high-quality genome assemblies has "actually gotten progressively worse" in the years since the human genome project, in part due to the increased reliance on relatively inexpensive short-read sequence data that fail to provide long-range contiguity information.
In a GigaScience study published this summer, for example, Korf, Shendure, and others compared de novo assembly methods as part of an Assemblathon 2 analysis.
That work hinted that many available assembly methods produce workable draft genomes while also highlighting the inherent limitations of sequence datasets typically included in such de novo assemblies, Shendure said. "Even the best assemblies were highly, highly fragmented compared to what you want, which is chromosome-scale assemblies."
As part of their efforts to develop methods for fleshing out genome contiguity in the realm of genome assembly, haplotype phasing, and so on, Shendure and his colleagues hit on the idea of tapping into some of the chromatin interaction information that can be obtained by Hi-C sequencing.
The group relied on intra-chromosomal interactions as a source of assembly information. Such interactions represent the "noise" in typical Hi-C experiments, Shendure noted, "but the fact that that noise extends across even hundreds of megabases makes it a great sort of signal to exploit for scaffolding assemblies."
After doing an initial local assembly step with shotgun and mate-pair reads and ALLPATHS-LG assembly software, the researchers turned to their newly developed algorithm — dubbed "ligating adjacent chromatin enables scaffolding in situ," or LACHESIS — to further arrange pieces of sequence using contiguity information in the Hi-C datasets.
That additional data, it turned out, made it possible to not only cluster contigs within a given chromosome, but also to determine their order and orientation, Shendure explained.
"There are hundreds of these [ALLPATHS] contigs that constitute a chromosome," he said. "What we can do with the Hi-C data is then scaffold these into an entire chromosome."
In the case of the human genome, for example, the team initially used ALLPATHS-LG to assemble existing paired-end and 3 kilobase mate-pair sequences into a 2.74 billion base draft assembly with N50 scaffolds that were 437 kilobases long.
With the help of Hi-C sequences from a human embryonic stem cell line, the researchers then used LACHESIS to scaffold, order, and orient the initial ALLPATHS-LG contigs, ultimately producing contiguous, chromosome-scale assemblies with sequences that spanned centromeric regions of the chromosomes.
More than 98 percent of shotgun sequences were successfully scaffolded using this approach, the study's authors reported.
Contigs grouped to the appropriate chromosome around 99 percent of the time, though they noted that sequences from human chromosomes 20 and 21 and from chromosomes 19 and 22 grouped together, perhaps due to Hi-C interactions between those chromosomes.
The team determined that contigs within a given chromosome were in the correct order and orientation around 94 percent of the time. Across the complete genome, meanwhile, LACHESIS appeared to order and orient contigs with around 99 percent accuracy.
Likewise, the team demonstrated that the ALLPATHS-LG and LACHESIS combination could be used for assembling a mouse genome de novo from short-read shotgun, mate-pair, and Hi-C datasets.
It also put together a Drosophila assembly using shotgun and Hi-C reads in the absence of jumping read data, albeit with somewhat less sequence successfully clustered to fruit fly chromosomes and a slight dip in scaffold orientation and ordering accuracy.
When doing such de novo assemblies in the future, all three short read sequence data types would likely be generated using DNA from the same individual, Shendure said, though results from the current study suggest that may not be crucial for accuracy in the case of human assemblies.
"Human genomes are not that variable from one another," he said. "The limitations on accuracy are probably not due to the fact that we used data from two different individuals."
The accuracy of the LACHESIS seems to be more limited by interactions between different chromosomes that sometimes get picked up by Hi-C sequencing and by problems carried forward from ALLPATHS, such as gaps in parts of the genome marked by segmental duplications or other highly repetitive sequences.
Going forward, Shendure noted that there may be ways of coming up with a more integrated assembly method that combines some of the steps currently used for assembly.
With respect to sequences missed at the moment, on the other hand, he speculated that longer reads and/or improved algorithms for short-read data may continue to stretch out genome contiguity.
Even so, he and his colleagues demonstrated that the LACHESIS method already appears to hold promise for assembling not only normal genomes, but also those from tumor samples.
In the current study, the researchers showed that by generating 154 million Hi-C sequence read pairs and using this data in the newly described assembly strategy, they not only rediscovered known translocations in the HeLa cervical cancer cell line, but also found new, lower frequency candidate rearrangements.
There are still issues that need to be worked out when applying that approach to other cancer studies, Shendure explained, including limited specificity and the relatively large amounts of starting material needed for standard Hi-C library preparation protocols. Still, he said, "it could potentially be useful in [the cancer research] context as well."