By Andrea Anderson
An international team is working on methods to reconstruct genome sequences for an ancestral population using sequence data from admixed individuals assessed through the 1000 Genomes Project.
"By being involved with 1000 Genomes, we get the benefit of the deluge of data they're generating," Jake Byrnes, who began working on the project as a post-doctoral researcher in Carlos Bustamante's Stanford University laboratory, told In Sequence.
Byrnes, now with the genealogical resource company Ancestry.com, presented data from the Taìno Genome Project at the combined International Congress of Human Genetics/American Society of Human Genetics meeting in Montreal last week.
Researchers involved with the project are relying on 1000 Genomes data and ancestry information to begin reconstructing the variation that was present in the genomes of Native American ancestors of present-day Puerto Ricans.
The Taìno population, native to the Bahamas, Greater Antilles, and Lesser Antilles, declined dramatically following the arrival of Spanish settlers in the 1500s. But in Puerto Rico today, individuals' genomes contain genetic sequences passed down from European, African, and Native American populations that met and mixed in the region.
By identifying sequences in each genome that correspond to Taìno, European, and African ancestry, researchers hope to not only learn more about the historical population patterns in the area, but also to get a picture of the genetic diversity that existed within the Taìno population when early Europeans arrived in the Caribbean.
"We believe that there are novel genetic variants that exist in these present-day admixed populations — the Puerto Ricans, Dominicans, and so on — that trace back to the indigenous population that were there at the time of contact [with Europeans]," Bustamante, one of the study leaders, told IS.
Given the genetic divergence that is thought to exist in Native American populations from the Caribbean, researchers are keen to document and understand that variation, both for historical reasons and to aid in designing more appropriate genetic studies in these populations in the future.
"If we find that there is this greater genetic divergence among the Caribbean Native population … then they really need to be studied as a separate entity," Bustamante said, since such population sub-structure can confound genetic studies.
"Indigenous populations of the Americas have historically been some of the most divergent, one from the other, even though they largely descend from populations that crossed the Bering Strait 15 thousand to 20 thousand years ago," he explained.
A Source of Ancestral Sequence
Thirty-five parent-child trios were sampled in Puerto Rico through the 1000 Genomes Project. As part of the international effort, all three family members for each trio were genotyped and all parents from these trios also had their whole genomes sequenced to low coverage and exomes sequenced to higher coverage.
Byrnes, Bustamante, and their collaborators from the University of Puerto Rico at Mayaguez, the University of California at San Francisco, and Cornell University are now starting to tap into that 1000 Genomes data as a source of ancestral genome sequences, first using genotyping data to determine which bits of each genome were inherited from ancestral Europeans, Africans, and Native Americans.
"Because this is an admixture of three fairly distinct populations, we can take modern-day reference panel individuals and try to tease apart the chunks of the genomes that come from each of the three ancestries," Byrnes explained.
Along with clues about the genetic patterns in ancestral populations, the length of chromosomal segments related to ancestry, or ancestry tracts, can be used to gauge the timing of admixture, he noted. Older ancestry tracts are thought to be shorter because they've been subjected to more rounds of recombination, leaving increasingly smaller chunks of ancestral sequence intact, while ancestry tracts inherited more recently are longer.
In preliminary analyses, for instance, the researchers used ancestry tract length and other genetic signals to look for clues about the African populations contributing to Puerto Rican ancestry. So far those data seem to be consistent with two waves of African ancestry corresponding to an initial influx of Africans from Senegal's Mandinka population, likely as a result of slave trading through the Cape Verde islands, and more recent ancestry from Africa that represents a broader distribution of the African populations tested.
For the ancestry tract identification stage of the study, researchers relied on reference panels for two HapMap populations — CEU (Utah residents with ancestry from northern and western Europe) and Yoruban — to identify European and African ancestry tracts, respectively.
The Native American reference was a bit trickier to select, Byrnes explained, since none of the populations tested so far were obvious candidates to serve as a Caribbean Native population reference.
So far, the team has been relying on a reference panel comprised of samples from native populations in Mexico and from a subset of the native and/or admixed population samples that their collaborators have collected in Central and South America, particularly from the Andes region.
"We kind of knew, a priori, that we didn’t have a really close reference panel for the Taìno ancestry, so we thought we would tap into as diverse a panel of native ancestry as we had on hand," Byrnes explained.
Researchers then applied the ADMIXTURE algorithm developed by researchers at the University of California, Los Angeles, to do hierarchical clustering of admixture patterns across the autosomal portion of the genome using 1000 Genomes genotyping data. They also used a principal component-based algorithm developed in-house to do local ancestry mapping and call individual ancestry segments one by one.
In their initial analyses, the team detected around roughly 10 percent to 15 percent Native American genome-wide ancestry overall. African ancestry typically makes up around 15 percent of the ancestry in the Puerto Rican samples tested and European ancestry makes up the balance.
Byrnes noted that the researchers are finding lower proportions of Native American ancestry when they use local ancestry analysis to try to assign individual pieces of the genome to each ancestral population — a method that relies on comparisons between the reference datasets and smaller slices of each genome.
That suggests that sequences in the Native American reference panel used for the study may not be similar enough to the ancestral Taìno population to distinguish all of the Taìno patterns in the modern-day genomes from European sequences, he explained.
"We're really only calling about five percent [Native American ancestry], which suggests that we're missing a lot and I think a lot of that has to do with not having a very good reference panel, due to the population history," Byrnes said.
"From the analyses that we've done in the past in trying to find similar populations in the Americas — say we look at populations from Peru or populations from Venezuela or populations from Mexico — we really find that the component of Native American ancestry that we see in present-day people from Puerto Rico, the Dominican Republic, and other Caribbean populations is actually pretty different," Bustamante added.
The admixture patterns detected to date may also reflect the population structure within Puerto Rico, since several of the samples taken through sequencing and analysis in the 1000 Genomes Project appear to have come from the western part of the island.
Nevertheless, the researchers believe they will be able to find enough Taìno ancestry tracts in the genomes of the 70 Puerto Rican participants sequenced to cover most of the genome, since these tracts have been randomly distributed in the genomes over several generations due to recombination.
"Although the average Taìno ancestry proportion is only five percent per haploid copy of the genome, that five percent can be uniformly distributed across the genome for each individual," Byrnes explained. "So that means we can get a very large proportion of the genome covered."
Assessing Population Variation
Still, while Byrnes said assembling a Taìno consensus genome sequence would be an "interesting exercise," he noted that more immediately useful information is expected to come from analyses of the overall genetic variation in ancestral Taìno genomes.
"The idea is to look at overlapping pieces and get some sense of allele frequencies and haplotype structure to get some sense of population variation for this group," Byrnes said. "So the next step is to link what was done with the chip data to the available sequence data."
Byrnes said the integrated phase 1 data released by the 1000 Genomes Project last week highlight the additional types of information that will be included in the overall datasets for each individual sampled in the study, including genome and exome sequences, data on small indels and large structural variants, and some population-phasing information.
"Given that information, then, we can apply our ancestry tracts to these individuals, once we have phase-corrected, and then extract each piece of the sequence data that maps to our three distinct ancestries," he explained.
For example, the researchers are exploring strategies for incorporating 1000 Genomes population-phasing data, which is expected to offer good short-range phasing, with genotype-based phasing of all three family members from each Puerto Rican trio, which is predicted to produce better long-range phasing.
Incorporating short- and long-range phasing information will likely rely on an algorithm being developed by Bustamante's group called SeqPhase, Byrnes noted, which folds paired-end sequence data into the FastPHASE algorithm.
The researchers have also been talking to University of Chicago post-doctoral researcher Bryan Howie, developer of the phasing algorithm IMPUTE, about strategies for doing haplotype phasing in situations where some haplotype information is already known, as is the case for Puerto Rican families phased from chip data.
Once such phasing has been done, the availability of higher-coverage exome sequence data is expected to aid in determining the site frequency spectrum and functional patterns of variation in the ancestral Taìno genome.
Along with the 1000 Genomes data, Bustamante noted that Complete Genomics has also sequenced the genomes of a Puerto Rican trio and will likely generate high-coverage genome sequence for additional individuals from the country.
Though such additional genome sequence data from Puerto Rico will likely be needed to get a refined view of ancestral genome patterns, researchers involved in the Taìno Genome Project believe the information available from the 1000 Genomes effort will yield at least a basic view of the variation across most of the Taìno genome.
"Depending on the population genetic question of interest, you may need more than one copy or two copies that cover [the genome]," Byrnes noted, "so we'll have to look at a more reduced set of information."
"But for a number of questions that's OK," he added. "We can get a fairly accurate picture from the subset of the genome where we have many copies that cover it."
At this point, the team is most concerned with gauging autosomal variation, since several studies have been done on mitochondrial DNA and Y chromosome variation.
"We need to reconstruct and understand that diversity, both to empower medical genomic studies in these under-studied populations and to understand the genetic history — and we want to make the distinction between the genetic history and the cultural history — of the Caribbean, particularly pre-[European] contact," Bustamante said. "In fact, that was one of the arguments for [Puerto Rican] inclusion and sampling for the 1000 Genomes Project."
Along with their work on the Taìno Genome Project, the team is interested in doing similar ancestry analyses on other admixed populations from the Americas that have been or will be sequenced for the 1000 Genomes Project, including Mexican Americans from Los Angeles; African Americans from Norman, Okla., and Jackson, Miss.; and populations sampled in Colombia, Peru, and Barbados. The team is also collaborating with investigators in Mexico.
"These seven populations from the Americas we chose specifically because they were admixed," Bustamante said. "We wanted to study admixed populations and gain this insight into the ancestral populations that contributed."
"If we can align the paired-end read data and get reconstructed haplotypes from each individual, that would be the ultimate goal," he explained. "By aligning those across individuals we can get a sense of haplotype diversity, nucleotide diversity, and, if we can, structural variation diversity."
Have topics you'd like to see covered in In Sequence? Contact the editor at anderson [at] genomeweb [.] com.