NEW YORK (GenomeWeb) – Genome-wide association studies on non-human organisms have been difficult to conduct due to the lack of haplotype reference panels for such organisms. To get around this problem, researchers at the University of Oxford, the Wellcome Trust Center for Human Genetics, and the University of California, Los Angeles have developed a bioinformatics approach that uses low-coverage whole-genome sequence data to impute genotypes.
The team described its method this week in Nature Genetics, demonstrating it on genomic data from more than 2,000 outbred mice sequenced to a coverage of just 0.15-fold, as well as on genome sequence data from more than 11,000 Han Chinese individuals.
In a second study, published in the same issue of Nature Genetics, the researchers showed the utility of the method in the 2,000 outbred mice, linking genes to specific traits.
Robert Davies, a statistical geneticist at the Wellcome Trust Centre for Human Genetics and the lead author of the study describing the method, called Sequencing to Imputation Through Constructing Haplotypes, or STITCH, told GenomeWeb that it could help open up genome-wide association studies to a whole range of organisms for which there are no haplotype reference panels currently available. Already, he said, his laboratory is testing it in wheat and pigs.
Davies said that the group developed the method out of a specific need. Colleagues had sequenced 2,000 mice to low coverage and had tried to use existing software to genotype the mice, but found that the methods were not working.
As a result, they began developing an alternative method, Davies said, that could use the NGS data without additional microarray data or a haplotype reference panel.
The method is based on a standard Hidden Markov model, Davies said, and is similar to a genotype imputation method previously developed for microarray data. The main difference is that it operates at the sequencing read level, whereas algorithms like Beagle that work on microarray data operate independently on each SNP. However, with sequencing, several SNPs are often joined on one read. "Reads can span four or five SNPs," Davies said. With STITCH, "we essentially changed the model to accommodate for the fact that the SNPs on the read are not independent of each other."
The researchers ran STITCH on the mouse sequence data and imputed genotypes at 7.1 million SNPs. They validated the accuracy of the results by comparing them to four mice genomes that were sequenced to 10x coverage, as well as 44 mice genomes they had genotyped on an array platform of more than 21,000 SNPs.
After filtering the initial results, they found that their method was around 98 percent and 97 percent concordant with the microarray and high-coverage whole-genome sequence data, respectively.
Davies said that when they ran STITCH in a "read unaware" mode, accuracy dropped to about 88 percent, demonstrating the importance of accounting for the fact that each read could contain multiple SNPs.
The team also tested Beagle, the algorithm developed for microarray data, on the NGS data and found that accuracy was greatly reduced. That algorithm was only 8 percent and 22 percent concordant with the higher coverage WGS data and array data, respectively.
Next, the researchers tested STITCH on more than 11,000 genomes of Han Chinese individuals that were sequenced to 1.7x coverage, comparing STITCH results to genotypes from 72 array-genotyped individuals and nine individuals sequenced to 10x coverage, and again found their method to be highly concordant.
In the second study, researchers led by Jonathan Flint's and Richard Mott's labs at the Wellcome Trust Centre for Human Genetics described how when they used STITCH to impute ancestral haplotypes of nearly 2,000 mice, they were able to link a number of genes to different phenotypes, including genes related to sleep, cage activity, a startle reaction, bone mineral content, and wound healing.
The study is "the first to use extremely low-coverage sequence to generate accurate genotypes without a reference panel," the authors wrote.
Davies said that STITCH would be especially relevant in non-human genomes, for which there are no haplotype reference panels available, and could be more cost effective than microarrays. He estimated that low-coverage sequencing costs around £45 ($60) per mouse while a microarray would cost around £65 ($87) per mouse. In addition, sequencing provides much more data than microarrays and does not require any prior knowledge about the variants and how they segregate in a population.
"Instead of needing population-specific arrays, no customization is necessary," Davies said. However, STITCH requires a high quality reference genome and large numbers of samples. "If the reference genome is incorrect, the method will suffer," Davies said, adding that nonetheless, it is "very promising."
STITCH is currently available for free to academic researchers. Going forward, Davies said, it would be interesting to see how it is used and in what species. Researchers in the ag-bio space might be especially interested in using the method, he said.