A team led by scientists from the San Diego branch of the Ludwig Institute for Cancer Research has adapted a sequencing method designed for mapping long-range interactions in the genome as a means of generating chromosome-scale haplotype maps.
As reported in Nature Biotechnology last weekend, the researchers relied on a so-called Hi-C sequencing method originally developed by the University of Massachusetts Medical School's Job Dekker and colleagues to glean sequence data from chromosome territories that could be used for phasing variants in each chromosome.
"Hi-C [sequencing] was originally done to study the three-dimensional structure of the genome," Siddarth Selvaraj, co-first author on the study and a graduate student in Bing Ren's Ludwig Institute lab, told In Sequence.
"We have kind of repurposed it [for] a new application," said Selvaraj, who is also affiliated with the University of California, San Diego.
In their proof-of-principle study, Selvaraj and colleagues demonstrated that their approach — dubbed "haplotyping using proximity ligation and sequencing," or HaploSeq — could discern chromosome-wide phasing patterns for roughly 95 percent of variants with more than 99 percent accuracy in a highly heterogeneous hybrid mouse genome sequenced to 30-fold coverage.
In a human cell line that contained fewer variants and was sequenced to a lower depth of coverage, the method produced a lower resolution seed haplotype.
Starting from that framework, investigators ultimately phased around 81 percent of alleles at about 98 percent accuracy, with the help of local conditional phasing and linkage disequilibrium profiles garnered from 1000 Genomes Project data.
Such phasing information is expected to find favor in a wide range of future research applications, Selvaraj said, noting that the HaploSeq approach can provide haplotypes at relatively low cost using a sequencing method that provides several types of sequence and structural data.
"I think the usefulness of [HaploSeq] will be for really getting the haplotype blocks to span the entire chromosome," University of Washington genomics researcher Jay Shendure, who was not involved in the study, told IS.
Even so, Shendure noted that it is often important to get phasing information for rare variants that may be missed when using local conditional phasing to fill in haplotype gaps. In those instances, "you can also see the need to combine it with locally dense molecular phasing to capture the rare variants," he explained. "It's important to remember that phasing rare variants is part of the reason that we want phasing information."
Shendure and colleagues developed a fosmid-based haplotyping approach described in Nature Biotechnology in 2010 (IS 12/21/2010). More recently, he and his team have been developing a short-read assembly method that incorporates Hi-C sequence data.
Generally speaking, current phasing methods fall into two general classes, Shendure noted: those providing dense phasing profiles over a relatively small block of sequence — such as the fosmid-based method or the long-fragment read scheme developed by Complete Genomics (IS 1/24/2012, IS 7/17/2012) — and those offering chromosome-scale haplotypes that are somewhat sparser.
The latter class includes the newly described HaploSeq approach, he argued, as well as methods that involve mechanically separating chromosomes prior to sequencing (see IS 12/21/2010, IS 7/11/2012).
Both types of data are crucial for getting comprehensive, genome-wide phasing, Shendure said, "but you're still going to need the dense methods, otherwise you're missing a lot of stuff."
There have been other methods for haplotyping, too, including sequencing studies of parent-child trios, large populations, and haploid germ cells. Early this year, for example, researchers from the J. Craig Venter Institute did low-coverage sperm cell sequencing and genotyping to put together a haplotype map for Venter's previously sequenced diploid genome (IS 1/8/2013).
In coming up with their HaploSeq method, though, Selvaraj and colleagues aimed for a technically streamlined approach that could be applied to individuals from the general population in the absence of parental information, sperm samples, or specialized equipment for separating chromosomes.
Following from past haplotyping methods based on bioinformatically reconstructed blocks of linked alleles, the group decided to tap sequences obtained by Hi-C sequencing as a source of very long fragment information.
As in other Hi-C applications, the researchers fix cells obtained from blood or cell lines using formaldehyde, freezing DNA strands wrapped around chromatin structures such as histones. After removing such proteins, they fragment the DNA and ligate neighboring bits together to form molecules representing the DNA's original spatial relationships.
Those steps produce an artificial fragment containing bits of DNA that come from sites that are either linearly close to one another on a given chromosome and/or found near each other in the nucleus.
"Sometimes two segments of DNA that are actually linearly far apart happen to be pretty close in space," Selvaraj said. "So that's the advantage we have in using this technology — we can try to look at which pieces of DNA are spatially closer to which other pieces of DNA."
For the HaploSeq application, Selvaraj noted that investigators are interested in exploiting these three-dimensional interactions to get sequence data for both long and short sequence fragments, which can be built up to produce haplotype blocks representing SNPs or small insertion and deletion variants.
Given the sorts of chromosome interactions Hi-C picks up, the sequence fragments obtained in this manner are typically much farther apart than those obtained by mate-pair or fosmid-based methods, Selvaraj said, "and you can use that information to build longer haplotypes."
The approach is generally expected to be effective as long as the inter-chromosomal interactions that occur are not between maternal and paternal copies of the same chromosome, he added.
In preliminary experiments done for the current study, the team demonstrated that such interactions between homologous versions of the same chromosome occur relatively infrequently compared to intrachromosome interactions.
Rather, when the researchers applied the method to cells from an embryonic stem cell line from a hybrid mouse — the offspring of two inbred parental lines with well-characterized genome sequences and haplotype profiles — they found that most interactions occurred within chromosomes.
"From a low-resolution perspective, each chromosome occupies its own space in the nucleus," Selvaraj noted. "So when you do proximal ligation, fragments within a particular chromosome are actually interacting — in other words, they're spatially closer.
"The probability of a particular fragment from one chromosome [ligating] to another chromosome is extremely small," he said. "Therefore, most of the data is actually within a particular chromosome."
Meanwhile, the team generated haplotype blocks stretching across mouse chromosomes by analyzing Hi-C sequence data for the same hybrid mouse line using a slightly modified version of HapCUT software designed to build haplotype blocks from mate-pair sequence data.
Along each of the mouse chromosomes, at least 95 percent of the heterozygous variants fell into that chromosome's main haplotype block, making it possible to phase the variants with more than 99 percent accuracy.
Even so, Selvaraj noted that the mouse genome in question contained a relatively robust representation of high-density, heterozygous SNPs, making it easier to predict haplotype patterns than it was in a human lymphoblastoid cell line that the team sequenced to 17-fold coverage with Hi-C libraries in subsequent experiments.
In an effort to flesh out the haplotype profiles for the human sample, he and his colleagues bolstered local phasing predictions with a tool called Beagle and linkage disequilibrium data from the 1000 Genomes Project.
With that approach, the researchers phased some 81 percent of alleles in the human genome with around 98 percent accuracy, according to comparisons with existing phasing information for the individual, who was sequenced as part of a parent-child trio for the 1000 Genomes Project.
"We use this [HaploSeq] technology to build an initial haplotype graph and then we try to use traditional linkage disequilibrium methods to fill in what we call gaps," Selvaraj said. "But essentially, we tried to predict the haplotypes for most SNPs in the genome."
Though the current proof-of-principle study was done using Illumina's HiSeq 2000, Selvaraj said the approach should be compatible with any high-throughput sequencing instrument that can perform paired-end sequencing.
In particular, he noted that longer reads spanning stretches of sequence between heterozygous SNPs in the human genome may be especially helpful when phasing variants.
In its current form, the HaploSeq approach appears to work quite well on genomes sequenced to between 25- and 30-fold depth, study authors noted. "[Twenty-five- to thirty-fold] usable coverage with 100 [base] paired-end reads is sufficient to achieve chromosome-spanning haplotypes with [around 20 to 30 percent] resolution … and allow accurate local conditional phasing using HaploSeq analysis," they wrote.
While the cost of the method is difficult to pin down precisely, Selvaraj estimated that it adds a couple thousand dollars to the cost of sequencing a human genome at that depth with typical Illumina sequencing — in part due to manual steps currently required for the Hi-C library preparation.
Going forward, he noted that it may be possible to streamline that protocol so that it requires fewer manual steps. At the moment, though, he said the technology is "working very reasonably at a very low cost."
Members of the team have filed a patent related to some aspects of the HaploSeq strategy.