NEW YORK (GenomeWeb) – The Vertebrate Genomes Project, an international effort under the auspices of the Genome 10K consortium, today publicly released its first 15 high-quality reference genome assemblies, representing all five vertebrate classes, which serve as a proof of principle for the project's ultimate goal to generate similar assemblies for all 66,000 vertebrate species on Earth.
Insights gained from these genomes could help with species conservation efforts, functional studies of the genetics underlying vertebrate traits and diseases, and phylogenomic analyses, according to the organizers.
At the Genome 10K annual conference yesterday at Rockefeller University, project coordinators and participants provided an update on the VGP's status, and discussed challenges encountered, such as obtaining permits for samples, extracting ultra-high molecular weight DNA, and speeding up genome assembly and annotation processes.
Earlier this year at the Plant and Animal Genome conference, VGP investigators provided a first outline of the project. The VGP plans to proceed in four phases: the first phase, to be completed by the end of 2020, will encompass about 260 species across vertebrate orders; the second phase will cover all 1,000 or so vertebrate families; the third phase will expand to the roughly 10,000 vertebrate genera; and the fourth and last phase will cover all 66,000 or so vertebrate species. The estimated cost for the entire project is about $600 million, of which a fraction has been raised to date.
The project is closely coordinating with a number of genome sequencing initiatives that have some overlap, including the Bird 10,000 Genomes (B10K) project, the Bat 1K initiative, and the Earth Biogenome Project.
For phase I, the VGP has chosen 89 fish species, 58 mammals, 52 birds, 33 reptiles, and 29 amphibians, as well as four invertebrate species. These represent all vertebrate orders which diverged at least 50 million years ago from their most recent common ordinal ancestor, meaning they developed soon after the last mass extinction 66 million years ago that wiped out the dinosaurs. They include about 10 critically endangered species, for example the kakapo, a flightless parrot from New Zealand. Whenever possible, a species member that carries both types of sex chromosomes will be sequenced. The goal is to generate near-error-free, haplotype-phased and complete genome assemblies for all species.
"We need to get through phase I to show this is a serious project and we're getting it done," said Erich Jarvis, chair of the G10K consortium and a professor at Rockefeller University who spearheads the VGP.
For some species, time is of the essence: in the last few years, four bird species alone have gone extinct without sufficient DNA being collected from them, Jarvis said.
For phase I, the project has decided to use four complementary technologies to generate sequence and long-range genome data: Pacific Biosciences long-read sequencing technology; 10X Genomics linked reads, which involve the use of Illumina short-read sequencing technology; Bionano Genomics optical DNA maps; and Hi-C proximity ligation data from Arima Genomics. In addition, it is generating RNA-seq or PacBio Iso-seq data to help annotate the genomes.
Other technologies, in particular Oxford Nanopore's long-read sequencing platform, are currently being evaluated for future phases, according to the organizers. Also, Hi-C proximity ligation data from Dovetail Genomics and Phase Genomics might be used going forward.
The cost for each species is estimated to range from $2,400 to $73,500, depending on genome size, which varies from 0.16 gigabases to 6 gigabases, according to the VGP website. Consumables costs alone are currently on the order of $15,000 per gigabase but could fall to less than $1,000 per genome if current trends continue, according to Gene Myers, a G10K council member and a director at the Max Planck Institute of Molecular and Cell Biology and Genetics (MPI-CBG) in Dresden, Germany.
Jarvis said that while a VGP genome assembly is currently about five to eight times as expensive as a short-read assembly, it is worth the extra cost. For example, students of his spent up to a year in the past to clone and sequence genes for functional studies because the gene's sequence was inaccurate in a draft genome assembly. Also, according to Myers, the high-quality assemblies will provide new insights into repetitive sequences, which were not available in the past.
To complete phase I, the VGP needs to raise about $6 million, of which it had collected about $2.5 million as of last month. According to Jarvis, this includes philanthropic donations and contributions from G10K scientists' budgets. In addition, the Max Planck Society just awarded €1 million ($1.17 million) in funding that will go towards reagents for the project, he said.
So far, the VGP has funding commitments for 60 percent of the fish species and 70 percent of the bird species, he said, but mammals, reptiles, and amphibians "need help." One challenge is that mammalian genomes tend to be two to three times more expensive than other vertebrates, he said, because they require more sequencing data due to their size and complexity.
Data for the project is currently generated at three sequencing hubs: the Vertebrate Genomes Lab at Rockefeller University, a lab at the Wellcome Sanger Institute, and a lab at the MPI-CBG. Together, these three institutions have contributed at least $4 million for instrumentation, Jarvis said. Other sequencing hubs may join the project going forward, he said, for example BGI in China, which has collaborated with the G10K consortium in the past on sequencing vertebrate genomes with short-read technology, and which recently ordered 10 PacBio instruments.
According to Olivier Fedrigo, who heads the sequencing hub at Rockefeller, the three labs will each emphasize different species — for example, his lab will mostly do birds and reptiles, the Sanger Institute will focus on fish, and the MPI-CBG will mostly do both bats and fish — but they will share sequencing mammals. The Rockefeller lab is equipped with four PacBio Sequels, one 10x Genomics Chromium instrument, and one Bionano Genomics Saphyr instrument. It outsources Illumina sequencing to the New York Genome Center and currently sends out Hi-C analyses to Arima Genomics.
The data generated by the four technologies runs through an assembly pipeline that is in part supported by DNAnexus in the cloud. In short, PacBio reads are first assembled into phased contigs. Following that, the other data types are sequentially added to generate scaffolds and put the contigs into chromosomes. All data and assemblies are uploaded to the cloud using Amazon Web Services.
Once the assemblies have reached a certain quality standard, they are submitted to public repositories for annotation and alignment, including the National Center for Biotechnology Information's GenBank and RefSeq databases, the European Bioinformatics Institute's Ensembl, and the University of California, Santa Cruz's genome browser.
According to Adam Phillippy, chair of the VGP Assembly Working Group and head of the Genome Informatics Section of the National Human Genome Research Institute, the VGP requires a contig N50 size of at least 1 megabase and a scaffold N50 size of 10 megabases. In addition, 90 percent of the genome must be assembled into chromosomes, the base quality must reach at least Q40 (one error per 10,000 bases), and the data must be haplotype phased.
The first 15 genome assemblies released today represent the most complete genomes for 14 species, including four mammals (two bat species, the Canada lynx, and the duck-billed platypus), five fish species, three bird species (with two genomes for the zebra finch — from a male and a female), one reptile, and one amphibian. Among them is the genome of the kakapo, a flightless parrot found only in New Zealand, of which fewer than 150 individuals are alive.
The data are available through the Genome Ark, an open-access database that is hosted by AWS free of charge. According to the VGP's data use policy, which is based on the Sanger Institute's, scientists are free to use the data but VGP members reserve the right to publish analyses first for a certain period of time, with some exceptions. Three of the genome assemblies have already been transferred to a public archive, and others are currently being submitted.
One of the challenges the project has encountered is the slow speed of the assembly process. "We have a lot of data queued up for assembly while we're waiting to speed up our assembly pipeline," said Phillippy. A big improvement this year came from the addition of PacBio's FALCON-unzip assembler to the DNAnexus environment, he said, which made it possible to assemble PacBio data from several species in parallel in the cloud. "This will be key for scaling to thousands of genomes," he said.
Over the coming year, his team will work on moving scaffolding into the cloud, as well, which currently represents a bottleneck. In addition, the group continues to collaborate with PacBio and Phase Genomics on a new algorithm, called FALCON-Phase, to combine data from both platforms into a phased assembly. Other improvements will come from optimizing the order in which the four technologies are added in the assembly process and using, when available, trio sequence data, with short-read sequence data for the parents, Phillippy said.
What makes the assembly process difficult is that all species differ in their degree of heterozygosity, repeat sequences, sequence composition, and DNA quality, he said. For extreme cases, he hopes new technologies can help, such as Oxford Nanopore's sequencing platform, which has produced megabase-long reads.
Another challenge for the VGP has been the extraction of ultra-high molecular weight DNA from tissue or blood samples, which is required to generate long sequence reads and long-range mapping data. This starts with the collection of specimens and their storage, followed by the dissection of tissues and their shipment to the sequencing hubs. All DNA is currently extracted at the hubs, Fedrigo said, because shipping DNA can cause it to break into smaller fragments.
According to Jacqueline Mountcastle, a scientist at Rockefeller who heads the VCP sample preparation working group, the goal is to obtain genomic DNA fragments of at least 250 kilobases, and to do a single DNA prep that is suitable for all four technologies. This currently takes between seven and 10 days and is a largely manual process, she said. VGP members have compared different tissue types and preservation methods and have recommendations for different kinds of species, for example spleen or muscle samples for mammals, and whole blood for birds. For small animals, for example frogs, they have extracted DNA from an entire body to obtain sufficient material.
Jarvis said an additional bottleneck has been the ability to obtain permits from governments for shipping samples to the sequencing labs, both within the US and from other countries. According to one VGP member, it took nine months to receive permission to ship a sample from Colombia to the US, for example.
At the moment, the VGP generates about two new genome assemblies per week, Jarvis said, but it hopes to speed up to six genomes per week by the end of this year, and to 125 per week in the future.