LOS ANGELES – Two studies out of Mexico highlight the contributions being made by that country to diversifying large population datasets and the benefits of doing so.
At the American Society of Human Genetics annual meeting here this week, Mashaal Sohail of the Universidad Nacional Autónoma de México in Cuernavaca showed how the Mexico Biobank could be used to generate genotype-phenotype data on underrepresented populations, revealing poorly known genetic histories and generating biomedically relevant findings.
Meanwhile, Andrey Ziyatdinov of Regeneron demonstrated scalable genotype and haplotype-based approaches to characterize fine-scale population structure and admixture through the Mexico City Prospective Study (MCPS).
Latin American populations are underrepresented in genomic research, despite the sizeable populations those ancestries represent — approximately 8 percent of the global population by some estimates. Because diverse genetic histories impact the variation seen in complex genetic traits and diseases, establishing detailed links between Mexicans' genotypes and phenotypes can both shed light on the country's genetic history and provide medically relevant insights.
Using data from the Mexico Biobank, Sohail and her colleagues developed a suite of methods for ancestry deconvolution and inference of identity-by-descent (IBD) segments, with which her team inferred detailed ancestral histories going back 200 generations in different Mesoamerican regions. She tied these to a range of complex traits, whose variation could be explained by significant genetic and environmental factors identified through the biobank data, which predicted variation in traits such as height, body mass index (BMI), and triglycerides. In addition to the ASHG presentation, the team's work also appears as a preprint in BioRxiv.
To chart the history of admixture in Mexico while ensuring for diverse Indigenous and rural representation, Sohail enriched for those individuals that can speak an Indigenous language and who came from rural localities, as reported during recruitment through the National Health Survey conducted in 2000.
Evaluating approximately 1.8 million single nucleotide polymorphisms (SNPs) from 6,057 individuals from all of Mexico's 32 states, Sohail and her team found that modern Mexicans' ancestry arises largely from ancestries that would have been found in Central America, Western Europe, and West Africa prior to the 15th century, as well as a smaller group with East Asian ancestry, located mainly in Guerrero state and linked to the Manila Galleon trade which brought goods from China and the Philippines to Mexico. They also noted a distinct population substructure distinguishing the Mayan region of southern Mexico from the rest of the country.
While Sohail's findings are of obvious anthropological interest, they also show patterns of genetic and complex trait variation.
Sohail and her team observed, for instance, that runs of homozygosity, contiguous regions of the genome where an individual is homozygous across all sites, reflected demographic histories as well as changes in allele frequency distributions with fewer rare variants appearing among individuals with higher Central American ancestries relative to those with more Western European and Western African ancestries.
These distributions corresponded to population bottlenecks, which appear to have led to differences in complex traits such as height, BMI, triglycerides, and cholesterol, among others.
By examining genetic ancestry, Sohail was also able to tease out likely environmental contributions to complex traits. While individuals with more Central American ancestry tended to be significantly shorter than individuals of other ancestries for example, younger people with any range of Central American ancestries tended to be taller than older individuals with the same ancestries.
Similarly, Sohail found significantly lower cholesterol in individuals who speak an Indigenous language, while those living in urban environments, at high altitude, or of a higher age tended to have higher cholesterol. Lower HDL and LDL levels also correlated to individuals who speak an Indigenous language regardless of ancestry, pointing toward cultural and other factors such as diet that likely outweigh some genetic factors.
"Our work is a demonstration of the value of generating genotype-phenotype data on underrepresented populations to reveal lesser-known genetic histories and generate findings of biomedical relevance," Sohail's team wrote in the preprint.
Using the larger but more geographically constrained MCPS dataset, Andrey Ziyatdinov and a team of researchers from across industry and academia developed two tools to facilitate future research using ancestry-specific variants, work that was also described in a recent BioRxiv preprint.
While covering less geographic area than the Mexico Biobank, the MCPS is a rich dataset consisting of a prospective cohort of over 150,000 adults recruited from Mexico City's Coyoacán and Iztapalapa districts. The dataset includes genotype and exome sequencing data for all participants, with whole-genome sequencing for 10,000 selected individuals.
"While the Mexico Biobank tries to get the whole country," Ziyatdinov said, "ours is just Mexico City, but the dataset is large."
Their tools consist of the MCPS Variant Browser, an ancestry-specific allele frequency browser, and the MCPS10k panel, an imputation reference panel used to estimate population-specific allele frequencies leveraging local ancestry and variant information within the MCPS whole exome and whole genome sequencing datasets. Their approach increased both the number of variants with ancestry-specific allele frequencies and the effective sample size of Indigenous Mexicans used for estimating allele frequencies from WES data.
Without reference datasets of population-specific allele frequencies, diagnosing and interpreting genomic variants — particularly in the context of rare disorders — are hampered by the difficulty in distinguishing previously unreported or undersampled population-specific variants from potentially pathogenic ones.
The Regeneron-led team wrote in the preprint that accounting for genetic ancestry and admixture is crucial in GWAS and can be used to boost power and to explore how well polygenic risk scores can be applied across populations.
In their study, the MCPS10k outperformed TOPMed for variants with minor allele frequencies greater than 0.1 percent and for individuals with higher amounts of Mesoamerican ancestry. Although there is not yet a specific date, both the MCPS Variant Browser and the MCPS10k panel will soon be available online, the latter via the Michigan Imputation Server.
"It's a unique dataset, and we try to share everything that can be shared," Ziyatdinov said.
The team has now made allele frequencies for over 141 million MCPS variants publicly available on the Regeneron Genetics Center website, which they claim increases by a factor of 10 the number of allele frequencies resolved by local ancestries compared to the gnomAD browser.
"It's a good example of academic and industry collaboration," Ziyatdinov said, "similar to the UK Biobank."
In addition to providing an example of the value of genetic studies in populations with diverse ancestry, the imputation reference panel provides a useful resource for future genetic studies, such as investigations into the genetic bases of disease, in Mexico and in the US, wherein the majority of the Hispanic/Latino population is of Mexican descent.