Mice have been used for decades to study the role that genes play in a range of diseases. But precise data on variation in gene sequences of mice used in the past has remained elusive. Now, thanks to a group of computer scientists at the University of California, Los Angeles, an international effort to map mouse genetic variation in 17 commonly used strains of mice is nearly complete.
The study, which was led by groups from the Wellcome Trust Sanger Institute and the Wellcome Trust Centre for Human Genetics in Oxford, England, needed some innovative computational assistance to measure and catalogue the full set of variants for all 17 sequences — just the first task on the road to identifying the disease-causing variants. Fortunately, UCLA associate professor Eleazar Eskin had already developed the computational technique called imputation that makes variant predictions where a sequencer fails to do so.
"We were involved in this project due to our previous involvement in a large-scale sequencing study of mouse strains that was performed by Perlegen and the National Institute of Environmental Health Sciences," Eskin says. "All sequencing projects occasionally fail when trying to measure a variant. In the context of the Perlegen project, we developed an imputation, which predicts data that the sequencer was unable to collect. The idea behind imputation is that there is a lot of correlation in the genetic variation, and if a portion of it is missing, the correlation structure can be utilized to predict the missing data."
In a September Nature paper, the researchers report data confirming that mice have a complex and rich evolutionary history, and demonstrate how the new map can be used to identify allele-specific expression that describes the activity level of a specific gene.
"The study discovered a very large number of variants and the majority of the variants in the mouse strains," he says. "The main advantage of this resource is that it will finally allow us to connect traits to actual genetic variants that cause the differences. Because previously, it was only possible to connect a trait to region of the genome that contained the variant, but not the actual variant."