By Julia Karow
While many researchers agree that resolving both haplotype sequences of a human genome will be essential to relate genetic variation to phenotype and disease, most large-scale genome sequencing projects neglect this today due to the technical challenges of haplotype phasing.
To give a taste of what the future might look like, and to demonstrate the potential impact of haplotype on gene function and disease, researchers at the Max Planck Institute for Molecular Genetics in Berlin recently created what they say is the most complete haplotype-resolved genome to date, through next-generation sequencing of fosmid pools. Earlier this month, they published an analysis of the genome, dubbed "Max Planck One" or MP1, online in Genome Research.
They now plan to use their approach on a larger number of individuals from a German population cohort, as well as to characterize a cancer sample.
The published study represents "the first real workup of an individual molecular haplotype architecture, which should lay the foundation for a diploid biology," Margret Hoehe, a group leader at the MPI for Molecular Genetics and the senior author of the paper, told In Sequence. While previous efforts have attempted to phase individual genomes, they have been less comprehensive than this one.
"In the future, once even more powerful technologies will be developed, haplotype-resolved genomes will become the norm," she said.
For their study, the researchers analyzed the genome of a participant of PopGen, a German population genetics research study. At the time of sample collection, he was a healthy 51-year-old with no history of severe disease, and his sample is one of 100 for which fosmid libraries had been created. While PopGen participants were phenotyped initially, no follow-up data is available for them.
The researchers created pools of about 5,000 fosmids, each with haploid inserts of about 40 kilobases, and combined three of these into super-pools of 15,000 fosmids. They sequenced these to a genome coverage of 47x using the SOLiD platform, and called variants on data from combined fosmid pools and on 30x genomic DNA. Fosmids were tiled into contiguous haplotype sequences using an algorithm the researchers developed specifically for this task. Consumables costs for the project today would be on the order of €6,000 ($ 8,600), Hoehe said.
In all, they were able to phase more than 90 percent of MP1's autosomal genome, assembling the two haploid genomes into about 6,300 contiguous sequences up to 6.25 megabases long, with an N50 length of almost 1 megabase.
They determined the phase context for almost all SNPs, including more than 99 percent of heterozygous SNPs, and haplotype-resolved 132 large as well as about 80,000 small indels. Those results, they wrote, are expected to improve with further development of the phasing algorithm.
Almost 60 percent of genes – and three quarters of genes when including upstream regions – contained novel variants, which cannot be phased correctly using statistical approaches that infer haplotypes from population genetic data such as the 1000 Genomes Project, they showed.
Last year, a team at the University of Washington School of Medicine used a similar approach as the Max Planck researchers – sequencing pools of fosmids on the Illumina platform – to create the haplotype-resolved genome of an Indian individual. That analysis, published in Nature Biotechnology, was less comprehensive, phasing about 94 percent of heterozygous SNPs into haploptype blocks with an N50 length of about 390 kilobases (IS 12/21/2010).
Though the German group's assembly is "impressively long," according to Jacob Kitzman, a researcher at UW and the first author of the earlier paper, both studies suffer from the size limitations of 40-kilobase fosmids. Those clones, he said, are not long enough to bridge "unmappable" regions, such as segmental duplications, gaps, and centromeres. The MPI researchers "appear to have been more aggressive in assembling through these regions in their analyses, but there remains a need for experimental methods to bridge these gaps without introducing errors," Kitzman told In Sequence.
[ pagebreak ]
One Gene or Both?
In order to gauge the potential importance of phase with regard to disease, Hoehe and her colleagues looked at 171 genes that contained at least two deleterious mutations.
For 159 of these genes, they were able to phase the mutations, determining whether they are present on the same copy of the gene — leaving the other one intact — or affect both copies.
MP1, for example, has several mutations in the breast cancer predisposition gene BRCA1, but only one copy is affected, so he could still pass on a healthy copy to his children.
Most clinical genetic tests today only test for the presence or absence of mutations, according to Hoehe. But examples like this show that "it could make a big difference whether the disease mutation is located on the same or different chromosomes."
For example, for carriers of BRCA1 mutations, knowing whether one copy of the gene is free of mutations could change their assessment of risk for breast cancer, or impact treatment decisions, for example to undergo preventive surgery, according to the paper.
"It would be really worthwhile to conduct large cohort studies and test [the impact of phase], because there could be huge benefits to individuals," Hoehe said.
In another example, the researchers found that MP1 has two amino acid changes, one in each copy, of the CYP4F2 gene, which is known to influence warfarin metabolism. The mutations on the two copies may have "important implications for dosage" of warfarin treatment, the authors noted.
Haplotypes of the MHC region may play an important role in transplant medicine "and literally decide over health and disease," Hoehe said. For MP1, her team resolved haplotypes across the MHC region and identified more than 10,000 genetic differences, including 221 novel heterozygous SNPs, that are relevant in tissue matching or disease.
MP1 was healthy at the time his sample was taken, "but you can imagine that this sort of information is sorely needed in the disease studies now generating thousands of genomes and exomes," Kitzman said.
Sequencing pooled clones also helps to resolve "difficult" regions of the genome, he added, such as segmental duplications. "As we obtain whole-genome sequencing for more and more organisms, this sort of tool will be necessary to fully describe the genetic diversity that we find and understand how it connects to biological diversity."
Hoehe and her team now plan to sequence and analyze up to 20 additional genomes from the PopGen cohort using their clone-based sequencing approach, both to test whether it is scalable in practice, and whether MP1's haplotype architecture is representative of the population.
In addition, they plan to sequence a breast cancer sample because there is "reason to believe that haplotype-specific effects may occur in breast cancer biology," she said. This project, for which they have already selected a patient, will also involve clinical data on treatment and follow up.
In the future, Hoehe said, clone-based sequencing might be replaced by other methods to resolve haplotype sequences. Complete Genomics, for example, is working on a fragment-based approach, which does not involve cloning (IS 4/19/2011), and third-generation sequencing technologies also promise to provide haplotype information from long reads.
Have topics you'd like to see covered in In Sequence? Contact the editor at jkarow [at] genomeweb [.] com