SAN DIEGO (GenomeWeb) – Members of the international Vertebrate Genomes Project (VGP) team this week provided details on the technology and assembly strategies being used for the first phase of an ambitious effort to establish reference genome assemblies for all existing vertebrate species.
As part of the Genome 10K organization — and in partnership with investigators from efforts such as Bird 10K, Bat1 K, and others — the VGP team has set its sights on "near gapless," chromosome-scale genome assemblies for around 66,000 extant vertebrate species, which will be phased whenever possible.
In a pair of posters presented at the Plant and Animal Genome conference here this week, investigators from the Rockefeller University Vertebrate Genomes Laboratory (VGL) outlined the anticipated pipeline for phase 1 of the effort, which will focus on mammals, birds, reptiles, amphibians, and fish from 260 vertebrate orders.
The VGL will be one of three sequencing hubs for the VGP, along with the Wellcome Trust Sanger Institute in the UK and the Max Planck Institute of Molecular Cell Biology and Genetics in Germany. It was established at Rockefeller last spring and is co-directed by Rockefeller researchers Olivier Fedrigo and Erich Jarvis, who also chairs the G10K organization.
G10K started out as a project to sequence 10,000 vertebrate genomes but changed last year into an organization that oversees several vertebrate genome projects.
The VGL researchers spelled out their current sequencing protocol — as well as the broader VGP's assembly approaches and progress to date — in one of the posters.
Starting from long reads generated on two Pacific Biosciences Sequel instruments — at least 60-fold coverage for each of the vertebrate genomes — for example, they typically add 10x Genomics linked reads, optical mapping data produced with the Bionano Genomics Saphyr, and Arima Genomics' Hi-C profiles.
Those data are uploaded to a DNAnexus storage and analysis platform, where they move through an assembly pipeline designed to reach the VGP's ambitious quality control standards, Fedrigo and his co-authors explained on the poster. Among other criteria, the team aims to achieve contig N50 lengths of at least a million bases for each genome. Finally, the researchers plan to annotate the genome assemblies with the help of Illumina or PacBio Iso-seq transcriptome sequencing data produced from two tissues per species.
The precise technology and approaches are expected to evolve as the project progresses, Fedrigo said. Even now, the sequencing team has weekly discussions about what works and what doesn't. For example, he noted that future iterations of the pipeline may rely on additional Oxford Nanopore, Dovetail Genomics, or Phase Genomics data.
The researchers have already generated various levels of sequence data for several phase 1 species, including the Anna's hummingbird, which has been used to benchmark a wide range of sequencing and assembly methods since 2015; the kakapo; the Canadian lynx; and the barnacle goose.
Sadye Paez, a project manager with Jarvis' neurogenetics of language lab at Rockefeller, provided more details on VGP methods development, motivations, and other aspects of the project in a related PAG poster this week. She noted that the researchers expect to generate data for roughly a dozen genomes per week for phase 1 of the VGP.
At that pace, she said, the first phase of project would be complete in under two years, even after getting ahold of some additional samples and analyzing the data. Over the longer term, the researchers expect to tackle around 1,045 genomes for VGP phase 2, which will be focused on representatives from each vertebrate family. The third, genera-focused phase will expand that to nearly 9,500 taxa, while the full collection of 66,000 or so taxa is anticipated to be completed in phase 4.
As the effort ramps up, the VGP investigators are also finding ways to spread the word to the public and other researchers about the potential benefits of creating what they call a Digital Genome Ark. The genome collection "will be used to address fundamental questions in biology and disease, to identify species most genetically at risk for extinction, and to preserve genetic information," Paez and her co-authors wrote on their poster.
The project is funded in part by participating researchers with an interest in one or more of the species, the researchers explained, though the VGL team has also set up a crowd-funding page to help push the massive sequencing, assembly, and analysis effort along.
In an open letter to G10K participants last month, Jarvis said that the project has commitments of more than $2.1 million so far that will cover the cost for approximately a quarter of the 260 species to be sequenced in phase I. "To make this project affordable, we negotiated large discounts with sequence companies for reagents and other items for the VGP," he wrote.