Investigators with the Rockville, Md., and San Diego, Calif., branches of the J. Craig Venter Institute, together with collaborators at the University of Toronto's McLaughlin Centre and the Hospital for Sick Children, have come up with a sperm cell-based scheme for determining chromosome-length haplotypes in an existing diploid genome.
"At the moment, how we're viewing this approach is to take a genome that's been sequenced conventionally — where the variants have been called from a regular diploid genome — and then layer this on top in order to haplotype," JCVI genomic medicine researcher Ewen Kirkness told In Sequence.
In a study appearing online last week in Genome Research, Kirkness and his colleagues demonstrated the feasibility of the method, which relies on a combination of genotyping and low-coverage sequencing on multiple single sperm cells collected from an already sequenced individual.
As they explained in the new paper, the researchers used genotyping data on 16 sperm cells to piece together a low-resolution, breakpoint-based haplotype map of the HuRef genome, the diploid sequence of Venter's DNA. They then filled in finer details of the map through low-coverage sequencing on 11 of the sperm cells, making it possible to place almost all of the known heterozygous variants in the diploid genome into chromosome-length haplotypes.
The ability to fully understand variation within and between chromosome pairs in the genome is expected to facilitate studies on everything from epigenetics to compound heterozygosity.
And there are hints that some sets of variants on the same chromosome may share closer ties with certain disease risks or drug response profiles than do individual SNPs, authors of the new study noted. So far, though, relatively few sequenced human genomes have been fully haplotyped, leaving much to be learned about the influence that various haplotype blocks have on gene expression or other aspects of an individual's biology.
"Although the sequencing of individual human genomes to reveal personal collections of sequence variants is now well established," Kirkness and his co-authors wrote, "there has been slower progress in the phasing of these variants into pairs of haplotypes along each pair of chromosomes."
Likewise, questions remain as to the most straightforward strategies for haplotyping a genome.
Some have tackled the haplotyping problem by using genome sequence data from parent-child trios, which makes it possible to track blocks of maternal and paternal sequence in a couple's offspring.
Other teams have used approaches that involve physically separating paired chromosomes — be it by flow cytometry, microdissection, or microfluidics (see IS 12/21/2010; IS 7/24/2012) — and still others have turned to large insert cloning coupled with sequencing to delineate haplotype profiles (IS 12/21/2010).
At the same time, researchers have started to tap into sperm as an easy-to-access source of multiple cells with haploid genome content. At the Biology of Genomes meeting last spring, for instance, researchers reported on the possibility of unraveling recombination events in the genome by sequencing multiple single sperm cells from the same individual (IS 5/15/2012).
And last week in Science, Harvard researchers outlined methods for more efficiently amplifying genomic DNA within single cells using "multiple annealing and looping-based amplification," or MALBAC — an approach that they applied not only to copy number analyses of cancer cells, but also to a sperm-based recombination study (IS 1/2/2013).
The sperm-based strategy employed by the JCVI-led team is somewhat similar to that described by the Harvard group, Kirkness explained, though the details of the recombination and haplotyping analyses differed.
"Basically, they derived the maternal and paternal haplotypes first and then used that to detect the breakpoints," he noted, "whereas in the method that we used, we detected the breakpoints first and then reconstructed the maternal and paternal haplotypes with knowledge of where the breakpoints were."
The work sprouted from ongoing efforts to haplotype the diploid HuRef genome, first described in PLoS Biology in 2007.
Though variants have been thoroughly cataloged in the genome — including roughly 1.95 million heterozygous SNPs — researchers had a tough time figuring out which heterozygous alleles fell on the same version of each chromosome.
Some haplotype information could be teased out of available Sanger sequence data generated for the genome, Kirkness noted, though the level of resolution afforded by the existing data left something to be desired.
"It was reasonable. It gave fairly large blocks of haplotype," he said, "but it wasn't perfect and it wasn't chromosome-length haplotypes."
"This idea of using sperm cells came up," he added. While the researchers "never really pursued it vigorously," they eventually decided that the haploid germ cells were the best available window into genome-wide haplotype patterns in the HuRef genome.
And what started as a preliminary, small-scale haplotyping analysis became increasingly complete as the team generated additional sperm sequence data during the paper's preparation and review process, Kirkness explained.
After using micromanipulation to isolate 96 single sperm cells, the researchers amplified genomic DNA within each cell by multiple displacement amplification, or MDA.
The MDA method is somewhat notorious for amplifying different parts of the genome with variable success, study authors explained. And this case was no different: quantitative PCR analyses at a dozen different loci in the genome picked up between four and 11 of the targeted sites in each amplified haploid genome after MDA.
Because the MDA biases are generally random, though, the team was able to put together recombination breakpoints across the individual's genome by bringing together genotyping data for multiple cells.
Of the 57 sperm cells with positive MDA reactions, the team used qPCR data to select 16 sperm cells to take forward for genotyping analyses with Illumina's HumanOmni-Quad v1.0 BeadChip.
On top of that, the team folded in low-coverage genome sequence information for 11 of the haploid cells, generating between 1.5- and 3.7-fold average coverage across each amplified sperm cell genome with the Illumina GAIIx instrument.
Again, the portion of each haploid genome represented by the sequence data varied owing to MDA-related biases. For some cells, just 28 percent of the individual's heterozygous SNPs were represented by sequence reads. In other cells, researchers had reads spanning as many as 43 percent of these variants.
Together, though, sequence data for all 11 cells made it possible to fill in missing SNPs in the first draft of the haplotype map.
The proportion of haplotyped SNPs is expected to inch up further as information is available from more sperm cells from the same individual. But with genotyping information on 16 sperm cells and sequence data on 11 of the cells, the researchers believe they have successfully placed some 94 percent of the heterozygous SNPs in the HuRef genome in an accurate haplotype context.
"To get up to 100 percent is probably impractical," Kirkness said. "Two or three sperm will get you halfway there, but to get the other half you really have to do a lot of sperm cells."
The chromosome-length haplotypes cobbled together for the Genome Research study showed no obvious biases related to the chromosome considered, he added, though researchers did see a subtle under-representation of sequences rich in guanine and cytosine residues. Some tricky-to-haplotype areas also clustered in parts of the genome believed to contain segmental duplications.
"If you look at the consensus of sequence across multiple sperm cells, you get a reliable consensus," Kirkness said. "But if you then pick out individual sperm cells and compare them one-on-one, there's a significant percentage of what appear to be maternal SNPs on a paternal background — it's like they've been switched — and they appear to be clustered along the chromosome."
That is more or less consistent with findings from some past haplotyping studies, Kirkness said, adding that it remains to be seen whether such patterns have biological significance or whether they are technical artifacts related to the mapping or other methods used.
Although the investigators relied on both array- and sequencing-based data for the current study, Kirkness said it should be possible to haplotype existing diploid genomes by adding in only the sperm sequence data if enough of the haploid cells are included in a given analysis.
In contrast, though, the resolution available by sperm cell genotyping in the absence of auxiliary sequence data is expected to be more limited, since just a fraction of the SNPs assessed by the array provide information that can be used to place variants within one haplotype or another.
"[Genotyping data alone] will give you a low-resolution haplotype map of the genome, because only around 200,000 of the million SNPs that are on a chip are actually informative and heterozygous in the individual you're looking at," Kirkness explained.
"The low-pass sequencing then allows you to get much higher resolution, with knowledge of the breakpoints that genotyping has given you," he added. "But as you increase the number of sperm cells that you look at, the requirement for the genotyping actually goes away — you can work out where the recombination break points are directly from the sequence data."
In the case of the HuRef genome, the group already had a catalog of Venter's heterozygous SNPs and their positions going into the study.
The availability of sperm sequence data could conceivably afford the opportunity for calling SNPs as well, though Kirkness cautioned that such an approach is expected to be "quite tricky" given the anticipated depth of coverage and cell numbers needed for variant detection genome-wide.
"If you were using sequencing data to actually call the SNPs — if you didn't know what the SNPs were and you were calling them for the first time — you would obviously require deeper sequence," he said. "We'd have to sequence deeply and we'd have to sequence a lot of sperm cells to have confidence in calling SNPs de novo."
Moreover, he and his co-authors explained that the MDA process that the team used for its current study is likely "too biased for comprehensive variant discovery across the genome."
"It is possible that shallower sequencing of more independent sperm cells could reduce this deficit," they added, "but that remains to be tested."
Kirkness said the haplotyping approach, in its existing form, should be platform agnostic, both in terms of the genotyping and sequencing technologies used. Going forward, though, he explained that there may be an advantage to incorporating long read data that can help in unraveling not only SNP haplotypes, but also structural variant patterns in human genomes.
"If we want to extend this further, which I think we really need to do, to incorporate structural variants into the haplotype … [it] is going to be more challenging," Kirkness said. "I can imagine that technologies that have much longer read lengths would be useful.
"We could use the SNPs as anchors to tell which haplotype a long read should lie on," he continued, "but then the long reads would allow us to look at structural variants in addition to the SNPs."
Indeed, the team's current focus is on extending the sperm-based approach to encompass other, more difficult-to-discern types of variation in a genome, including some forms of structural variation.
And what about women with sequenced genomes? Kirkness explained that the germ cell-based approach isn't feasible for haplotyping female genomes just yet, since egg harvesting is difficult and invasive. That may change down the road, however. In particular, he and his colleagues speculated that it may eventually be possible to do similar haplotyping with female sex cells derived through stem cell-related processes.
For now, though, Kirkness said the inability to apply germ cell haplotyping to women is "obviously the major limitation of this approach."
"There are other methods that can be used for haplotyping [female genomes], but obviously germ cells is not really a practical alternative for female genomes at the moment."