NEW YORK (GenomeWeb) – Which parts of the genome are important for building a mammal? With new grant funding in hand, members of the 200 Mammals Project are continuing to work away at a genomics resource for answering this and many more questions.
Much of the mammal sequencing effort is undertaken with short-read sequencing and DISCOVAR de novo assembly, making it possible to sequence nanogram-scale DNA samples, though the researchers plan to include some representative genomes from each mammal family that have more contiguous assemblies.
The effort is intended to build on findings from a genome project that was focused on conserved sequence sites in 29 mammalian genomes, picking up sites of conservation down to a resolution of 12 base pairs via comparisons with a human reference genome. Results from the 29 mammals genome study appeared in Nature in 2011.
"In the years since then, it turned out that this measure of conservation is one of the most powerful measures that you can use for identifying which positions in the genome are actually important," explained 200 Mammals Project co-principal investigator Elinor Karlsson, director of vertebrate genomics at the Broad Institute.
"This has been fantastic," she said, "because when you do these big disease studies — genome-wide association studies and things — you tend to get these regions of the genome out, and you need to figure out what in that region is important and what isn't."
With the 200 Mammals Project, the goal is to creep closer to an understanding of conservation at individual sites in the genome by sequencing species from across the mammalian tree, focusing on species diversity and maximized branch lengths, or distances between different species.
"The more branch length, the more capacity you have to resolve that conservation down to the single-base level," Karlsson said. Some species were selected to coincide with ongoing research by collaborators, she noted, while others were "particularly cool," such as the screaming hairy armadillo or a scorpion-eating mouse species.
Project collaborator Oliver Ryder, director of conservation genetics at San Diego Zoo Global, provided many of the samples used in the study. As Ryder's participation in the project might suggest, the research is expected to boost species conservation efforts, even as it initially just informs on geneticconservation.
"For species conservation, you can use genetics to do things like measure the diversity of the population and whether animals are able to move around an area the way they need to," Karlsson said. "A genome assembly is the first thing you need to have in order to be able to use many of these other tools."
Karlsson outlined the project in a poster presented at the annual Plant and Animal Genomes conference in San Diego in January, noting in the abstract that data from the 200 Mammals Project effort "will allow comparisons across lineages, analyses of the evolutionary history of different variants or binding motifs, and correlations between candidate functional variants and different constraint patterns and elements."
In mid-August, the National Human Genome Research Institute provided notice of a nearly $800,000 award to principal investigator Bruce Birren at the Broad Institute to support the effort from the beginning of this month through the end of August, 2018. The project officially kicked off late last September, and NHGRI funded its first year to the tune of nearly $834,000.
Initially, the researchers set out to sequence 150 new genomes to analyze alongside 50 available genome sequences. So far, they have around 137 samples sequenced and the collection available for studying sequence conservation will likely exceed 200, given the rate that new mammalian genomes are appearing in the literature, Karlsson said.
For the newly sequenced genomes, the team typically produced one lane of sequence data on the Illumina HiSeq 2500 instrument and assembled the genome using DISCOVAR de novo assembly software developed by Broad researcher David Jaffe and colleagues.
The approach has two main advantages for the 200 Mammals Project, Karlsson explained: it's relatively inexpensive and requires only about a nanogram of DNA. The latter feature was particularly advantageous, since many of the samples were small or hard to come by.
"We were able to get genomes that had very good contig N50s — so they've got pretty good contiguity at the contig level, but they don't have the longer-range level assemblies that are available with some of the technologies that are out there," she said.
University of California at Santa Cruz biomolecular engineering researcher Benedict Paten noted in an email message that the DISCOVAR approach being used by the 200 Mammals team "seems to give us coverage of the large majority of the genomes at excellent base accuracy. We're finding that most genes are assembled into single contigs or scaffolds, making these assemblies pretty good for studying gene evolution."
But sequence assembly strategies have also jumped forward since investigators first planned the mammalian project. The investigators are now planning to collaborate with Harris Lewin at the University of California at Davis Genome Center to get long-range assemblies for one species per mammalian family using Dovetail technology to scaffold the data.
And despite the advantages of using DISCOVAR assembly for the broader mammal set, Paten pointed out that that approach alone does not produce chromosome or chromosome arm-level assemblies and may miss — or misrepresent — some repeat expansions or recent duplications. The resulting genomes are also monoploid, he added, muddying the view of the underlying diploid haplotypes it represents.
"All of these limitations will largely be solved by the [Genome 10K] approach," explained Paten, who is also involved with that project. The G10K team is focused on producing "a much higher quality set of genomes, albeit at significantly higher cost," he said.
Earlier this week, 10x Genomics announced that its Chromium de novo assembly approach had been selected in the G10K Project, which Karlsson said is "putting more emphasis on longer-range, higher contiguity genomes."
The 10x Genomics technology will reportedly contribute to genome scaffolding, phasing, and error correction in the upcoming phase of the G10K Vertebrate Genome Project.
While many of the 200 Mammals Project investigators are also involved in the G10K effort, the two projects remain independent, Karlsson explained. Still, the mammal-focused effort is "showing what you can do by getting this number of genomes from this number of species" and "laying the groundwork for what the next steps will be as we start upgrading the genomes to longer contiguity genomes."
As sequencing for the 200 Mammals project starts wrapping up, the team is now turning its attention to data analysis. The search for sequence conservation will hinge on some of the same approaches used for the 29 mammals data, along with reference-free genome alignment software called Cactus that was developed by Paten, David Haussler, and colleagues at the UCSC.
"In the 29 mammals project, we aligned everything against the human genome, which meant that we got conservation for any position in the human genome," Karlsson explained. "If there was a piece of the genome that was missing in humans but present in all 28 other species, we wouldn't actually know anything about it because we referenced everything on humans."
With Cactus, "we're going to be able to look at conservation from the perspective of any species in the project," she said.
While it's too soon to think about follow-up studies, Karlsson said she's keen to see more population genetic data for mammals included in the current analysis, particularly for at-risk species, to investigate everything from species relationships to potential avenues for conservation.
"Both in terms of the number of species we're looking at and how much information we're getting in each species, this is really just the first step," she said. "I'm really excited going forward to figure out how we can take these genomes and, within a species, what we can learn about that species. How can we help species conservation efforts?"