NEW YORK (GenomeWeb) – Methods for teasing apart in vivo chromatin proximity patterns are fast becoming part of the assembly toolkit for researchers seeking genome assemblies with chromosome-level resolution. But the specific information that can be gained from such Hi-C experiments may still vary from one lab — or one service provider — to the next.
This month, both the Santa Cruz, California-based company Dovetail Genomics and Phase Genomics in Seattle, Washington introduced Hi-C-based assembly services. In parallel, investigators involved in large-scale sequencing studies have started field-testing both companies' approaches to look at the information gained with Hi-C data in general and to compare and contrast assembly patterns produced by each firm's service.
During a Pacific Biosciences workshop held at the Plant and Animal Genomes conference in San Diego in mid-January, neurobiologist Erich Jarvis presented assembly data for hummingbird genomes sequenced with a wide range of sequencing and assembly strategies.
Jarvis was first author on a 2015 study in Science outlining phylogenetic relationships between four-dozen bird species — part of a larger collection of studies using data from a slew of newly sequenced bird genomes. He is also a co-leader of the Genome 10K (G10K) initiative and co-founder of the Bird 10,000 Genomes (B10K) effort.
As the latter projects ramp up, Jarvis and his colleagues set out to systematically evaluate available sequencing and assembly methods en route to hammering out the most effective methods for producing accurate and complete genome assemblies.
The comparative analysis was "primarily for G10K and B10K," though the team hopes hammering out such methods will provide a resource for other vertebrate sequencing efforts and the genomics field as a whole.
The analysis spanned many of the usual sequencing and assembly suspects: sequence data generated with Sanger, Illumina, Pacific Biosciences, and Oxford Nanopore instruments, multiple assembly algorithms, and sequence scaffolding strategies offered by BioNano Genomics, 10X Genomics, NRGene, Arima Genomics, Dovetail Genomics, and Phase Genomics.
"The hummingbird project is great because nearly every approach has been thrown at it," Dovetail Genomics Co-founder Richard Green, who co-directs the University of California at Santa Cruz paleogenomics lab, said in an email. "Also, the assessment is done by a neutral, third-party group."
On the Hi-C front, for example, the investigators first asked, "Does Hi-C, regardless of the company, give you a better assembly?" Jarvis said in an interview. "The answer that I presented in the talk is that, regardless of whether it's Phase Genomics' or Dovetail's approach, we got better contiguity of the scaffolds [with HiC] — a dramatic improvement."
Even so, Jarvis noted that Hi-C data also appeared to introduce some errors into the hummingbird genome, revealed as inconsistencies in hummingbird haplotypes.
"We don't know what's causing it, but [the Hi-C data] is introducing errors that the original PacBio assembly or Illumina assembly did not start out with," he said. "You should get no more than two different choices between haplotypes. If you start to see more variability than an n of 2 in your sequences, contigs, or scaffolds, you know there's some kind of error that's been introduced."
For the hummingbird genome analysis, Jarvis noted that scaffolds produced with Phase Genomics Hi-C appeared to have somewhat more accurate phasing. On the other hand, Dovetail Hi-C scaffolding applied in combination with PacBio-based contigs produced the longest hummingbird genome scaffolds and pointed to the presence of an apparent misjoin in the hummingbird genome.
Green said the Dovetail team was encouraged by Jarvis' assessment of its Hi-C scaffolding of the hummingbird genome, calling it "the most contiguous and the most accurate."
For his part, Phase Genomics Co-founder and CEO Ivan Liachko chalked some of the discrepancies in the Dovetail Genomics Hi-C and Phase Genomics Hi-C-based scaffolds up to broader differences in the extent to which the PacBio-based contigs were corrected.
"At the very end of the process, there's a gap-filling step that happens," Liachko said in an interview. "At the time that he presented [the hummingbird assemblies], our version had not been gap-filled yet and Dovetail's had been."
According to Jarvis, the main difference was in the number of scaffolds or chromosomes predicted by each approach: Hi-C scaffolds from Phase Genomics pointed to 43 hummingbird chromosomes, while Dovetail Hi-C led to 26.
He rebutted tweets that went out during his talk claiming the "best" hummingbird genome assembly was achieved with PacBio long reads, Dovetail Hi-C, and PBJelly gap filling. "What I actually said was that the best approach is PacBio long reads, Hi-C, and PacBio gap filling. I didn't say Dovetail Hi-C."
Hi-C methods, in general, stretch back to a 2009 paper published in Science in 2009. There, a team led by investigators at the Broad Institute, the Massachusetts Institute of Technology, and the University of Massachusetts Medical School introduced Hi-C as a sequencing-based tool for gauging three-dimensional chromatin interactions in an unbiased way.
In 2013, University of Washington genome sciences researcher Jay Shendure and his colleagues demonstrated in a Nature Biotechnology study that it was possible to tap the long-range chromosomal interaction patterns produced with Hi-C "for assigning, ordering, and orienting genomic sequences to chromosomes, including across centromeres."
Liachko, who collaborated with Shendure as a member of the University of Washington's Genome Sciences department, said the Phase Genomics approach to Hi-C builds on the methods and LACHESIS algorithm described in that study. The company officially launched in 2015 as a University of Washington spinout.
While it retains elements of LACHESIS, Liachko said Phase Genomics' new Proximo service is a "much more robust, much more accurate, and a much more professional-grade product. He noted that the Hi-C pipeline has "been reworked and improved in several ways that are proprietary," while the original Hi-C protocol was "extremely laborious."
That has made it possible to apply Hi-C to organisms or tissue types the Phase Genomics team could not tackle in the past, Liachko added, including small samples and samples from a wide range of sources and treatment protocols.
Dovetail Genomics has made its own improvements to Hi-C protocols in an effort to make the in vivo proximity ligation experiments more consistent and reliable. The company's full-service Hi-C library preparation and assembly protocol can be applied to "nearly any source," Green said, from plant or animal samples to primary tissues or tissue culture sources.
Dovetail initially focused on in vitro chromatin interaction data, particularly information gleaned from its Chicago libraries. By introducing Hi-C services, the company now offers the option of combining "biology-free" signals from Chicago, used to orient and order assemblies, with Hi-C assembly clues spanning millions of bases, Green explained.
Hi-C in conjunction with Chicago "offers the best of both worlds," creating a combination that is "both highly accurate and highly contiguous," he said, noting that the Hi-C and Chicago data have applications beyond genome assembly for researchers interested in exploring three-dimensional chromosome architecture. Dovetail offers bundled pricing for customers who want Hi-C and Chicago profiling on the same sample.
Phase Genomics is pursuing additional applications for Hi-C as well. The company has a beta-stage Hi-C-based strategy called ProxiMeta that's focused on assembling the genomes of individual microbes in metagenomic collections — an application that Shendure, Liachko, and colleagues described in the journal G3 in 2014.
Over at G10K, Jarvis said the team has not settled on a Hi-C technology just yet, but will likely select one soon as it gets ready to tackle the next round of vertebrate genomes. He expects to see ongoing improvements in approaches offered by all of the companies and believes Hi-C methods that allow scaffolding and accurate haplotype phasing will be particularly advantageous in the future.
The G10K team currently plans to produce at least one reference-quality genome for each of the vertebrate genera with assembly quality metrics that are still being defined. Members of Jarvis' team have also submitted an application to the MacArthur Foundation's 100&Change grant competition proposing a digital "Noah's Ark" genome library focused on sequencing and storing data for some 8,000 endangered vertebrate species.
In line with the "kitchen sink" approach to assembly that Jarvis recommended at PAG, the researchers are currently planning to use PacBio sequencing, in combination with Hi-C scaffolding, PBJelly filling, Illumina sequence polishing, and BioNano Genomics assembly data.