Skip to main content
Premium Trial:

Request an Annual Quote

Goat Genome Demonstrates Benefits of Combining Technologies for De Novo Assembly


NEW YORK (GenomeWeb) – This week, a group of researchers published a de novo assembly of the goat genome, highlighting how combining multiple technologies can improve the accuracy and contiguity of genome assemblies.

The group, led by researchers at the US Department of Agriculture and the National Human Genome Research Institute, used a combination of long-read sequencing, optical maps, scaffolding technology, and short-read sequencing to de novo assemble a goat reference genome that is 400 times more contiguous than the previously published assembly. The researchers published their results in Nature Genetics this week.

The study is a demonstration of how combining technologies can yield more complete genome assemblies, and the approach will also have applications in breeding programs, according to Adam Phillippy, head of the genome informatics section at the NHGRI and a senior author of the study.

Other research groups have also seen the value of combining orthogonal technologies to get high-quality de novo genome assemblies. Last year, for instance, researchers from Seoul National University and Macrogen published a Korean reference genome, illustrating a large amount of diversity that could be captured in their hybrid approach. And at a recent conference, several researchers discussed their groups attempts at sequencing and de novo assembly using combination of technologies.

The goat genome project was kicked off about three years ago when Timothy Smith from the USDA approached Phillippy's group to discuss the possibility of creating an assembly algorithm that would work on PacBio data for the goat genome. At the time, PacBio's technology was primarily being used on smaller genomes, Phillippy said, and no algorithm existed for larger genomes. But, over the next several years, the researchers generated sequence data and developed algorithms, and then added in orthogonal scaffolding and mapping technologies to improve the quality.

In the Nature Genetics study, the researchers first used the Pacific Biosciences RSII system to sequence the goat genome to 69x depth. That resulted in 3,074 contigs with an N50 of 3.8 megabases. Phillippy said that this project has been ongoing for several years and that the technologies have evolved rapidly. When the team started the project, for instance, they were using the XL-C2 PacBio chemistry, but as the project progressed, chemistry upgrades became available, and they finished the project on the P5-C3 chemistry. In total, they generated 194 gigabases of sequence data with a mean read length of 5.1 kilobases.

That data gave "a nice assembly" with long contigs, but "we had no idea how they went together to form chromosomes," Phillippy said. So, they turned to both the Hi-C scaffolding technique, which was performed by Phase Genomics, to get chromosome structure, as well as Bionano Genomics' Irys for optical mapping.

The PacBio and Hi-C combination yielded contigs with the same contig N50 as just the PacBio data, but was able to place the contigs into 31 scaffolds with a scaffold N50 of 88.8 megabases. "Essentially, you have each chromosome in a single scaffold," Phillippy said. Adding in the optical maps from the Irys system reduced the total number of contigs to 1,780, with a contig N50 of 10.2 megabases. The contigs were also placed within 31 scaffolds, with a scaffold N50 of 87.3 megabases.

"The optical mapping increased the quality and confidence of the initial scaffolds," Phillippy said. The three technologies—PacBio, Bionano, and Hi-C—ended up being complementary to each other, he added.

The PacBio data was able to sequence through short tandem repeats and shorter stretches of complex regions, Phillippy said. The optical maps from Bionano helps resolve the large structural variants, like segmental duplications that are 50 kb to 100 kb or longer, but it doesn't resolve the smaller complexity that PacBio's technology can sequence through, Phillippy said. Nonetheless, optical maps don't elucidate chromosome-scale assemblies, which is where the Hi-C data comes in.

Next, Illumina data is used to polish and make error corrections at the base level. The final genome, ARS1, included 680 contigs assembled into 31 scaffolds with a contig N50 of 18.7 megabases and a scaffold N50 of 87.3 megabases. The assembled genome was about 2.9 gigabases in size.

The previous goat reference genome was de novo assembled in 2012 by a team of Chinese researchers who used a combination of Illumina and fosmid-end sequencing, as well as optical mapping technology developed by OpGen.

Compared to that assembly, the ARS1 genome was able to fill 94.3 percent of the gaps. The remaining gaps appeared to be false gaps due to errors in the assembly, the authors wrote. The ARS1 genome still has 649 sequence gaps, the authors noted. In addition, the ARS1 genome "has 1,000-fold fewer ambiguous bases and improves even the core gene annotation over the short-read assembly," compared to the previous reference, the authors wrote.

"It's a night and day difference," Phillippy added. Contig size, contiguity, and overall accuracy are improved, he said.

Phillippy anticipated that combining technologies would continue to be the best way to generate high-quality de novo reference genomes, at least for the time being. There are three important factors to consider: long contigs, chromosome-scale scaffolds, and accurate base calls.

Phillippy said he is now involved with the Genome 10K (G10K) initiative to sequence 10,000 vertebrate genomes and the Bird 10,000 Genomes initiative, both of which aim to generate high-quality reference genomes.

Based in part on what the researchers learned from the goat project, Phillippy said the consortium members are looking to use a variety of technology combinations for de novo assembly. In the ongoing pilot, a group led by Rockefeller University professor Erich Jarvis used essentially every technology possible, including not just the ones in this study, but also Sanger sequencing, linked reads from 10x Genomics, and nanopore sequence data.

Phillippy said one goal of the project is to try to figure out which combination of technologies produces the best genome most cost-effectively. The ultimate goal is to get a high-quality de novo assembly for less than $10,000, he said. In the Nature Genetics paper, the authors estimated that the assembly cost around $100,000, but Phillippy noted that the researchers were using older versions of PacBio chemistry. "We're not quite there for a reference genome, but getting close" to $10,000, he said.

Recently, a number of other researchers have discussed the benefits of high-quality de novo assemblies, and how using multiple technologies in combination can help improve such assemblies. For instance, at the Advances in Genome Biology and Technology conference in Hollywood Beach, Florida last month, groups from the Broad Institute, Ontario Institute of Cancer Research, and Calico Life Sciences discussed their efforts to assemble the mosquito, human, and naked mole rat genomes.

Daniel Neafsey, associate director of the Broad Institute's Genomic Center for Infectious Diseases described work his group has done on the Aedes aegypti genome, the mosquito responsible for transmitting the Zika virus.

The genome was first assembled in 2007, and was estimated to contain be about 1.38 gigabases in size. The first assembly was in 36,000 pieces, Neafsey said. The genome is highly repetitive and only about one-third of it is single copy or low copy number, he said.

In an attempt to improve on that assembly, Neafsey's team sequenced the genome to 100x using PacBio technology. Neafsey's team then tested two assembly algorithms, FALCON-Unzip, an open source algorithm designed for PacBio data, and Canu, an algorithm designed for single-molecule sequencing that will work on both PacBio and Oxford Nanopore Technologies data.

Using FALCON-Unzip and Canu, the researchers were able to reduce the number of contigs to 3,642 and 1,504, respectively, from 36,000. In addition, the contig N50 was increased to 1.67 megabases and 2.48 megabases, respectively, from 83 kilobases.

Neafsey said the team is now testing a variety of scaffolding techniques, including from Bionano Genomics, 10x Genomics, and Dovetail Genomics. Dovetail Genomics offers Hi-C-based scaffolding services.

"We're still looking at how these approaches are complementary and hope to use all three in the final version," he said during his presentation.

Margaret Roy, head of de novo sequencing at Calico Life Sciences, a California-based research and development firm funded by Google that studies the biology of aging, also saw benefits from combining technologies to assemble the 2.54-gigabase naked mole rat genome. The naked mole rat has a long life span and is "extraordinarily cancer resistant," she said in a presentation at AGBT, explaining Calico's interest in de novo assembling the genome.

Her team first performed sequencing and assembly with PacBio technology to 130x coverage. That generated 493 contigs with a contig N50 of 22.6 megabases. Next, they used Bionano Genomics' technology to help with scaffolding. That was able to generate 126 scaffolds with a scaffold N50 of 42.3 megabases. Roy said the team is still working on the scaffolding and also plans to integrate data from 10x Genomics' Chromium system. So far, she said, they have only been able to phase around 240 megabases, or 9 percent of the genome. Phasing has been difficult in part because the naked mole rat is inbred, she said.

Jared Simpson, an informaticist at the Ontario Institute of Cancer Research, described work he has done as part of a consortium to use Oxford Nanopore Technologies' MinIon to sequence the human genome.

After performing the initial sequencing on the MinIon, Phillippy and Sergey Koren at the NHGRI assembled the data using Canu, producing a contig N50 of 3 megabases. Next, Simpson said, he used Hi-C data to scaffold the contigs, increasing the N50 to 45.8 megabases. The researchers are also adding in Illumina data to increase the base accuracy. Thus far, homopolymers continue to be "one of the main sources of residual errors."

The assembly is a "work in progress," Simpson said. "But, the results are promising."