NEW YORK (GenomeWeb) – A team from Pacific Biosciences and Phase Genomics has modified PacBio's FALCON-Unzip diploid genome assembler to come up with a new algorithm that assembles and phases genomes from PacBio long reads and Hi-C proximity ligation data.
Researchers from PacBio, Phase Genomics, the University of Adelaide, and the US Department of Agriculture's Agricultural Research Service used the new assembly algorithm — called FALCON-Phase — to put together a phased, diploid genome with PacBio long reads and Hi-C data from a hybrid bull that was produced by crossing Angus and Brahman cattle from the Bos taurus taurus (taurine) and Bos taurus indicus (indicine) sub-species, respectively. FALCON stands for fast aligning of long reads for consensus and assembly.
"We've seen over the years people saying they throw the kitchen sink at their genome assembly and try a lot of technologies," said Sarah Kingan, a bioinformatics scientist at PacBio. "What we're seeing from this project is that PacBio and Hi-C are kind of all you need to have a diploid, phased, chromosome-scale genome assembly."
The same hybrid bull genome was initially phased using a trio-based approach known as TrioCanu, which was developed by National Human Genome Research Institute investigators Adam Phillippy, Sergey Koren, and colleagues and described in a BioRxiv preprint earlier this year. The new FALCON-Phase assembly — reported in a BioRxiv preprint last week — was validated using that TrioCanu assembly as a truth set, given the detailed information possible from the pedigree-based data.
"We wanted to see if we could do the same thing that TrioCanu could do without parental information," explained Zev Kronenberg, a computational biologist at Phase Genomics and co-developer of the FALCON-Phase algorithm. "We validated it against TrioCanu because it's a really good truth set."
The team plans to use the algorithm for assembling and phasing the genomes of other animals in the future. That will help to understand whether the genetic distinctiveness of taurine and indicine cattle enabled the phasing of the hybrid animal, noted co-author John Williams, director of the University of Adelaide's Davies Research Centre.
"The reality is, we're working with an animal [bred] from sub-species that are genetically diverse," Williams said, so it remains to be seen how well the approach will work for individuals with more genetically similar parents.
Williams and co-author Stefan Hiendleder, a researcher at the Davies Research Centre and the University of Adelaide's Robinson Research Institute, have been working on bovine genomes for years and were involved in developing the first bovine genome back in 2004.
The investigators are currently focused on untangling gene expression and other differences between taurine and indicine sub-species, which are both thought to have been domesticated on the northern Indian subcontinent but diverged significantly since then.
The Bos taurus taurus sub-species is typically found in Europe, Williams noted, and has been highly selected for meat or milk production. In contrast, cattle from the Bos taurus indicus sub-species are more common to India, Africa, South America, and other locales. While they have not been subjected to intense selection for productivity, the indicine cattle are more robust and resistant to environmental and disease challenges.
In parts of Australia, Brazil, and the southern US, taurine and indicine cattle are crossed as a strategy for producing animals that can withstand disease, humidity, or intense heat, Williams noted.
"They cross these two sub-species to create animals that are more robust and more resistant to the environment," he said. "So commercially, they are very important."
As part of their research on taurine cattle, indicine cattle, and reciprocal crosses between the sub-species, the University of Adelaide researchers teamed up with PacBio, Phase Genomics, and Timothy Smith at the USDA-ARS to sequence a crossbred bull born to an indicine dam and a taurine sire.
"We gave them the crossbred individual to sequence and said, 'Can you separate the sequence into two genomes: the one that comes from the [indicine] and the one that comes from the [taurine] parent?'" Williams recalled. "That was the challenge that we put out."
Using Illumina short reads from the parental animals and 80-fold PacBio long read coverage for the offspring bull, the TrioCanu team used trio binning and assembly to phase the genomes passed on by each parent. Based on their results, Phillippy and his BioRxiv preprint co-authors pointed to trio binning as "a new best practice for diploid genome assembly that will enable new studies of haplotype variation and inheritance."
Indeed, Williams noted, the resulting assembly appears to be "one of the highest quality genome sequences created."
"What's really exciting is that they are real genomes [representing genomes inherited from the Angus and Brahman parents]," he explained. "Most of the genome assemblies we see are a composite of the two haplotypes in an individual, so they are virtual assemblies."
The researchers are continuing to polish the sequence for use as a reference genome, while generating additional expression data for the offspring.
Still, the success of that strategy did not dissuade the team from pursuing its original goal of separating two haploid genomes from a single individual.
That idea was still percolating at this year's Plant and Animal Genomes conference, where Kingan and Kronenberg took interest in the challenge and started laying the groundwork for FALCON-Phase.
"There weren't really any tools available out there to do exactly what we were trying to do," Kronenberg said, noting that researchers had used Hi-C data for scaffolding and SNP phasing, but not for phasing full genome assemblies.
Broadly speaking, the FALCON-Phase solution they came up with involves breaking up the haplotypes produced by the initial FALCON-Unzip assembly and reassembling them into two phased genomes with Hi-C data.
It takes a bit more PacBio coverage than usual to ensure that there is enough sequence to sufficiently cover the maternal and paternal copies of each chromosome in the resulting diploid assembly, Kingan explained. "You don't quite need double, but you do need more than a typical inbred assembly would require."
On the Hi-C side, meanwhile, the amount of data needed for the phasing assembly is on par with that already generated to scaffold assemblies.
"One of the main advantages here is, you can take what people already have and create a much better assembly without requiring any more sequencing," Phase Genomics co-founder and CEO Ivan Liachko said, noting that many teams err on the side of excess coverage and will already have enough extra PacBio sequence data to use FALCON-Phase.
Prior assembly approaches have flirted with phasing, but were typically unable to separate chromosome-scale haplotypes or parts of the genome with low genetic heterozygosity.
In a genome assembly for a highly inbred water buffalo, for example, Williams and his team phased roughly half of the genome using the FALCON-Unzip assembler and Dovetail Chicago Hi-C data, though some haplotypes could not be separated and phase switches occurred.
The researchers may now reassemble the FALCON-Unzip-generated genome with FALCON-Phase, Williams said, since the algorithm is expected to be compatible with any Hi-C data, not just that generated at Phase Genomics.
The new approach extends the local phasing previously possible with FALCON-Unzip, explained FALCON-Unzip co-developer Michael Schatz, a researcher affiliated with Johns Hopkins University and the Cold Spring Harbor Laboratory, who was not involved in the FALCON-Phase work.
"Hi-C data can span much longer distances than contiguous long reads, often reaching hundreds of kilobases to megabases, so you get much better connectivity and phasing than long reads alone," Schatz said in an email. "This is also an important advance over previous works that used Hi-C data for scaffolding without phasing, because the older scaffolding algorithms will often artificially merge the homologous chromosomes together, thereby masking the presence of any heterozygosity and sometimes leading to assembly errors."
"They demonstrate the power of this technique for mammalian genome assembly by comparing their results to those of a trio assembly of a bull genome that had been assembled using the very clever TrioCanu approach," Schatz added. "They show similar results to the gold standard trio assembly, showing it is possible to achieve very high quality results even when the trio is not available."
The FALCON-Phase team is working to continue streamlining and speeding up the algorithm, which currently takes about a day, while coming up with software that's more professional grade.
Phase Genomics offers assemblies as part of its service, though the algorithm is freely available through Github. Along with new genomes, it can be used to upgrade existing FALCON-Unzip assemblies.
"One of the things that we're most excited about is that for folks who already have FALCON-Unzip assemblies, this gives them the opportunity to go back and, without having to generate [new] PacBio data, to reassemble everything," Liachko said. "If you just get the Hi-C data, you can essentially not only scaffold everything into chromosomes, but also phase it."
In their comparison with the TrioCanu phasing data, the researchers found that the approach has limitations across parts of the genome that are prone to misassembly, Kronenberg explained, including segmental duplications and other regions with poor short-read mapping.
Additional research will also provide a clearer view of FALCON-Phase's ability to phase genomes for organisms with very high or low levels of heterozygosity. For example, Olivier Fedrigo, director of Rockefeller University's Vertebrate Genome Laboratory, noted that excess heterozygosity may be just as problematic in assemblies as genomes with few genetic differences between parents.
"That's not only a problem for FALCON-Phase, it's a problem for every assembly. Every algorithm will have this issue," added Fedrigo.
He is part of an effort called the Vertebrate Genomes Project that aims to produce reference-quality genome assemblies for all vertebrate species using a combination of technologies.
Fedrigo noted that the VGP team is keen to try FALCON-Phase after learning about it from Kingan during a weekly call with PacBio, particularly since VGP already has data from a cattle trio that can be used to cross-validate the results from the newest phasing method with patterns identified from TrioCanu.
"Part of what we want to do is phasing," he said. "We're already using FALCON-Unzip and we were planning to do the phasing at the end. It looks like, for that approach, they do phasing early on."
"Definitely this is the right direction," Fedrigo added, noting that the VGP researchers have also been working with Phillippy on scaffolding strategies that incorporate Hi-C data directly onto the FALCON-Unzip graph early in the assembly process.
"FALCON-Unzip makes mistakes, so if you apply FALCON-Phase on mistakes made by FALCON-Unzip, you may propagate the mistakes and may not be able to phase accurately," Fedrigo explained. But incorporating the Hi-C data at an earlier stage may mitigate some potential sources of errors.
For their part, investigators at PacBio plan to release a new human genome assembly soon, Kingan noted, and will phase that genome using FALCON-Phase, as well.