NEW YORK — Researchers from the Dog10K Consortium have generated a dataset of nearly 2,000 canids including breed dogs, village dogs, wolves, and coyotes to capture their genetic diversity and generate a panel of that variation.
The international Dog10K project, first announced in 2015, aims to sequence thousands of canids to better understand dog domestication, breed formation, disease susceptibility, and more. As they reported Tuesday in Genome Biology, researchers from the consortium have now sequenced 1,987 canids to develop a catalog of 144,000 structural variants, 14.4 million indels, and 34 million SNVs, which they used to explore canine diversity and breed relationships and to create a panel to enable genotype imputation.
"We collected 2,000 samples from all over the world, engaging investigators from as many places as we could," co-senior author Elaine Ostrander, chief investigator at the Cancer Genetics and Comparative Genomics Branch at the US National Human Genome Research Institute, said in an email. "We specifically sought out investigators who had worked on various dog breeds or other canids of interest to join the consortium. We tried to get dogs of many different phenotypes and disease susceptibilities."
In all, the samples encompassed 1,649 breed dogs, 18 mixed or other dogs, 336 village dogs, 68 wolves, and four coyotes. The breed dogs represented 321 different breeds. The researchers sequenced 1,987 of these canids using the Illumina HiSeq X Ten platform and aligned the reads using a German shepherd dog genome assembly and three Y chromosome assemblies from a Labrador retriever.
In all, they identified more than 48 million SNVs, indels, and structural variations. They also annotated the SNVs they uncovered using estimates of evolutionary constraint, which they said would help infer function.
Based on the SNVs they detected, the researchers estimated the total portion of genetic variation they captured within the various dog breeds. For 22 breeds, they estimated that they identified more than 90 percent of the genetic variation that would have been found had they analyzed 100 dogs from that breed.
However, for 20 other breeds, less than 75 percent of the total variation predicted has been identified. This a wide range is likely due to how the breeds were formed, Ostrander noted.
She and her colleagues also examined relatedness among the dogs, finding that they formed 25 major clades based largely on shared occupation, morphology, or geographic origin. While dogs from the same breed were more likely to have higher levels of haplotype sharing, dogs from breeds that fall in the same clade also exhibited higher haplotype sharing. The researchers in particular noticed increased haplotype sharing between the terrier and mastiff clades, which they said could reflect breed development through admixture or recent ancestry involving multiple clades.
Because of the size and diversity of this dataset, the researchers also investigated whether it could be used for genotype imputation, finding it could enable the generation of high-confidence calls across different genotyping platform densities, even for dog breeds not included in the cohort.
"Imputation will allow investigators to ask many more questions, and I envision the dataset will be used in many ways," Ostrander said, adding that she could imagine studies into cancer susceptibility, canine behavior, canid evolution, and more. "The dataset is so rich … that I think Dog10K will serve as a reference for virtually any type of canine mapping study we can think of."
The next step for Dog10K, she added, will be to fold in dog genomes that have been generated by other groups as well as to expand sample collection, particularly of village dogs in Africa or India and of wild canids.
"Fortunately, there are many interested and motivated investigators — many more than when the project was originally conceived — that would be more than willing to participate in a version 2.0," Ostrander said.