NEW YORK – The Genome 10K's Vertebrate Genomes Project (VGP) and the Earth BioGenome Project (EBP) are moving ahead in their quests to sequence the genomes of all living vertebrate species and all eukaryotic species, respectively.
During a presentation at a joint G10K-VGP/EBP meeting at Rockefeller University yesterday, members of the VGP, EBP, and affiliated projects outlined their progress and plans, including recent funding. They also discussed the significance of high-quality reference genomes for conservation projects, countering criticism that the genomic resources the projects generate may not come fast enough to address the rapid extinction of species through climate change and other human activities.
The VGP, first announced in early 2018, aims to generate high-quality genome assemblies that are nearly error free, complete, and haplotype-phased for all 71,500 or so living vertebrate species — a number that was recently revised from a previous 66,000. It plans to proceed in three phases, at an estimated total cost of about $600 million. The first phase will include one representative species for each of 260 vertebrate orders. VGP members decided to combine data from several technologies for this part of the project, including long reads from Pacific Biosciences, linked reads from 10x Genomics, optical maps from Bionano Genomics, and Hi-C proximity ligation data from Arima Genomics. Most of the data is being generated at sequencing labs at Rockefeller University, the Wellcome Trust Sanger Institute, and the Max Planck Institute of Molecular Cell Biology and Genetics in Dresden.
A year ago, the VGP announced the completion and release of the first 15 reference genome assemblies, representing 14 species and 13 orders. Since then, the project has generated genome assemblies for another 101 species (100 vertebrates and a starfish), representing 77 additional taxonomic orders, which are either finished or in the final stages of assembly. About 60 assemblies have already been made available through the Genome Ark database and the remainder will be posted within the coming weeks. They will also be annotated and displayed in public genome browsing and analysis databases such as the UCSC Genome Browser, which recently added 24 assemblies.
Erich Jarvis, chair of the VGP and a professor at Rockefeller University, said that his team has learned two lessons: genomes "are full of repeats" and require sequence data that spans those repeat regions, and the two haplotypes of each genome are difficult to assemble, which is why the project has used a trio approach for some of the recent assemblies, utilizing sequence data from the parental genomes.
In order to scale the assembly process, the project started to move some of the algorithms to the cloud, he said, which has also been challenging. "We spent a lot of time this year retooling the algorithms," in addition to training more scientists in using them, he said.
Adam Phillippy, head of the VGP assembly group and a researcher at the National Human Genome Research Institute, confirmed that his group has been "struggling with scaling up to thousands of genomes" and stressed the need for investments in new analysis tools.
The assembly tools are "still under active development," said Arang Rhie, also a researcher at NHGRI, who encouraged other scientists to use the VGP dataset to help develop better methods.
Rhie told GenomeWeb that the project is considering adding ultralong nanopore reads to the second phase of the VGP, noting that they have shown good results in other projects, though she cautioned that those reads are still difficult to scale at the moment.
According to Gene Myers, a researcher at the MPI in Dresden, another challenge for scaling up is "feeding the machines", which he said takes too much human effort at the moment and will require more automation and lab information management systems going forward. He also said that the project is "not taking full advantage, informatically, of the data generated" and new analysis methods could potentially yield even better assemblies.
The VGP continues to include critically endangered species, according to Jarvis, such as the vaquita porpoise, of which only a few dozen animals are left. "If it goes extinct, at least we will have its genome for eternity," he said.
While there are no centralized funding sources for the project, the Howard Hughes Medical Institute, the Wellcome Sanger Institute, Rockefeller University, the Max Planck Institute, and the National Institutes of Health have invested in sequencing infrastructure. In addition, VGP scientists have raised $4.8 million of the $6 million required to complete the first phase of the project through crowdfunding and continue to do so.
The VGP is just one of 21 projects affiliated with the Earth BioGenome Project, a network of 26 partner organizations in 14 countries that has the overall goal of sequencing and annotating the genomes of the 1.5 million known eukaryote species within 10 years. The estimated price tag of the project is $4.7 billion, which EBP chair Harris Lewin, a professor at the University of California, Davis, pointed out is less than the $5.4 billion (in 2012 dollars) cost of the Human Genome Project.
The first phase of the EBP aims to sequence one representative species for each of the approximately 9,300 eukaryotic taxonomic families.
Gary Schroth, vice president and distinguished scientist at Illumina, said during the meeting that his company will donate "100 genomes worth of Illumina data" for high-quality reference genomes. The in-kind donation will come in the form of reagents for the sequencing centers producing the data, he told GenomeWeb, and the sequence reads will be used to generate 10x Genomics linked reads and Hi-C proximity ligation data.
Another EBP-affiliated project is the Darwin Tree of Life project at the Wellcome Sanger Institute, which recently won £8 million ($9.8 million) over two years to get the effort off the ground. According to Mark Blaxter, who recently joined the Sanger Institute to lead the project, the focus will be on species native to the British Isles, which he said are "a perfect ecological laboratory" and could become a test case for the larger EGP.
The Darwin Tree of Life project is collaborating with a number of institutions across the UK to collect samples and develop new techniques for sequencing single-cell organisms and complex genomes, including the Natural History Museum in London; the Royal Botanic Gardens, Kew; the Earlham Institute; Exeter University; and Edinburgh University.
Blaxter said the goal is to sequence 60,000 UK species over the next 10 to 12 years, starting with about 1,000 species per year in the first few years and about 5,000 per year after that.
The California Conservation Genomics Project, another EBP-affiliated project, just won a $10 million grant from the state of California, according to Brad Shaffer, director of the UCLA/La Kretz Center for California Conservation Science, who spoke at yesterday's meeting via video link. The primary goal of the project is to help conserve threatened and endangered species in California. The initial three-year phase aims to sequence 100 individuals for each of 150 species and use the results about their genomic diversity to set conservation goals and priorities.
In contrast to the VGP, the project will not produce high-quality reference genomes but instead focus on generating large numbers of lower-quality genomes, Shaffer said, while also utilizing the high-quality reference genomes from the VGP and other EBP projects.
While "super-high-quality genomes" are not always useful for conservation projects on their own, Shaffer explained, they provide a reference that, in combination with other genomic data, can yield information about a species' genomic diversity.
Blaxter mentioned that the high-quality genome of the golden eagle that was finished last year, for example, is already used as part of a species conservation project.
Also, according to Lewin, long runs of homozygosity in a genome, which reflect how inbred an individual is, can only be assessed with the long reads used in the high-quality assemblies. Homozygosity runs may even serve as a measure of a species' extinction risk, he added, and could be "extremely helpful" as a conservation tool.
Oliver Ryder, director of conservation genetics at the San Diego Zoo, added that high-quality genomes may also help to develop genetic tests. The California condor population, for example, carries a mutation for a lethal autosomal recessive disorder, he said, and to develop a carrier screening test, high-quality genomes were needed.