NEW YORK (GenomeWeb News) – In a study appearing online last night in PLoS Genetics, a Stanford University-led team described the "ethnicity-specific" reference genome approach it used to analyze whole genome sequences from four members of a single family.
By incorporating estimated allele frequency data from the 1000 Genomes Project into the existing human reference genome, the researchers came up with three synthetic human genome references containing the major alleles identified in European, African, or East Asian populations — a strategy that's intended to more accurately represent the genetic variation present in each of the major HapMap populations.
"There has been a large focus, at least in the genome-wide association study space, on Caucasian populations," first author Frederick Dewey, a researcher at Stanford University's Center for Inherited Cardiovascular Disease, told GenomeWeb Daily News. "What we hope to show is that ethnicity certainly matters — it begins at the point of genome assembly and carries all the way through variant interpretation and annotation."
"In terms of the medical interpretation of sequence variant accuracy, I think it's a really potentially important way to start thinking about this," senior author Euan Ashley, director of the Center for Inherited Cardiovascular Disease at Stanford, told GWDN.
The team demonstrated the utility of this major allele-adjusted reference approach by using it to analyze whole-genome sequence data for members of a family affected by a blood coagulation disorder called thrombophilia, tracking suspected disease-associated loci and looking for variants influencing treatment. The phasing and inheritance information in the genomes also proved useful for finding SNPs to tag different human leukocyte antigen types, which can be difficult to discern due to high recombination rates in the HLA region.
Because the existing human reference genome was generated using DNA from just a few donors, researchers explained, it does not yet reflect the genetic variation present in human populations. Consequently, there is the risk of misinterpreting the information in newly sequenced human genomes that are compared to this reference.
"It doesn't, in any way, reflect the broad array of population variation and certainly does not reflect ethnicity-specific population variation," Dewey explained.
Indeed, the team found more than 4,000 disease-associated positions in the human reference genome where the allele differs from the predominant allele in the three main HapMap populations. They then turned to 1000 Genomes Project data to improve the reference, using allele frequencies to generate synthetic reference genomes representing individuals of European, African, and East Asian ancestry.
"We took the population variation data from the 1000 Genomes, estimated the major allele at every position for each of the three HapMap populations, and inserted that allele at every position in the reference sequence at which it differed from the base," Dewey said.
Once they had incorporated this major allele information into the reference genome, he explained, the researchers found that genotyping errors decreased by some 40 percent at the common, disease-associated variants they looked at.
From there, the team selected the most appropriate reference to help interpret genome sequence data from a family affected by thrombophilia: study co-author and former Solexa CEO John West, along with his wife, daughter, and son.
West has experienced two pulmonary embolisms, one while taking the anticoagulant warfarin. Data from consumer genomics firm 23andMe and whole-genome sequences hinted that West's daughter — who had identified haplotype patterns and possible blood clot risk factors from the family's genome sequence data for a high school project — might also carry risk variants for the clotting condition.
To get more in-depth genetic information, West joined forces with the other researchers to analyze the family members' genomes relative to the ethnicity-matched reference. The genomes were sequenced with the Illumina GAII to an average of more than 39 times coverage over 92 percent of the genome.
Along with the major allele reference, the researchers also relied on improved haplotype phasing algorithms and an interpretation pipeline developed to assess genome sequence data in a clinical context in the analysis, Dewey explained. "The idea was really to marry those three approaches into something that might yield some useful information for the family."
Using this approach, the researchers not only tracked down blood clot-related mutations and variants implicated in conditions such as coronary artery disease, obesity, and psoriasis, but also teased apart recombination and haplotype patterns in the genomes.
The analysis highlighted the advantages of the modified reference sequence for picking up rare variants, Dewey noted, particularly in situations where individuals are homozygous for rare variants that are also present in the standard reference sequence.
In the West family, for example, two individuals carried a copy of the factor 5 Leiden mutation, one of several mutations implicated in blood clotting disease. The standard human reference genome contains this mutation, which would have made it challenging to detect the mutation if either affected family member carried two copies of the allele.
"For identification of rare, private mutations within a family, having the reference sequence correct at each one of those positions is extremely important," Dewey said.
Moreover, by bringing genome sequence data together with information from the most recent version of the Pharmacogenetics and Pharmacogenomics Knowledge Base, or PharmGKB, which is headed by co-author Russ Altman, the researchers predicted an anticoagulant dose for West that jived with the empirically derived dose prescribed by his doctor. They also made predictions about anti-platelet therapies that might benefit family members down the road.
The team is continuing to work on ways to improve the accuracy and interpretation of genome data in a clinical setting and plan to update their ethnicity-specific references as more and more sequence data becomes available. Given their findings so far, they are optimistic about finding relevant genetic variants using the major allele reference approach.
"The nice thing is that the sequences that we ultimately generated for each of the three HapMap populations are directly applicable in any variant identification pipeline," Dewey said.
For their part, Ashley, West, Altman, and their co-authors Atul Butte and Mike Snyder recently started a California-based company called Personalis. West has been named CEO of the firm, which will use some of the same approaches outlined in the current study to analyze genome data, initially focusing on the academic market and eventually targeting the clinical market.
"The company will work towards managing and analyzing human genomes in an ethically responsible way and doing it within mainstream medicine," Ashley said.