By Monica Heger
Researchers from the Leibniz Institute for Age Research at the Fritz Lipmann Institute in Germany have performed a proof-of-principle study demonstrating that 454 amplicon sequencing can determine haplotypes in the beta-defensin locus — a highly variable region of the human genome that harbors variations that have been implicated in a number of diseases.
The results, which were published this month in BMC Genomics, demonstrate that 454 amplicon sequencing could be an effective method for haplotyping copy number variable loci in cohorts, according to the authors.
"Copy number variation is an important feature of the human genome," Matthias Platzer, senior author of the study, told In Sequence. He noted that the beta-defensin gene cluster is "one of the most copy number variable regions of the human genome," and generally contains between two and 12 copy numbers per genome.
The roughly 87-kb region, located at chromosome 8p23.1, is known to be involved in inflammation and innate immunity and has been associated with psoriasis, Crohn's disease, and prostate cancer, "but these are still just the tip of the iceberg because the region is hard to tackle," Platzer said. "So we are trying to extend the range of technologies to get new insights into that region."
According to the paper, approaches such as PCR, cloning, and Sanger sequencing could be used to characterize multisite sequence variations and copy number in highly variable regions, but "these methods are labor and cost intensive as well as prone to methodological bias introduced by bacterial cloning." As an alternative, Platzer and his colleagues opted for amplicon sequencing of pooled individual PCR products by the 454 platform.
Platzer added that the region has been difficult to characterize because it is so variable. He said that next-generation sequencing technologies could be helpful because of their much larger sequencing output, and he chose 454 in particular because of its comparatively long read lengths.
The researchers evaluated six PCR products covering a total of 1,498 base pairs that spanned approximately 87 kb of the beta-defensin locus for 11 different HapMap samples. All samples were derived from lymphoblastoid cell lines and the amplified regions were selected to contain a high number of known multisite sequence variants.
They used PCR and fusion primers to generate eight amplicons specific to the beta-defensin locus. Then they pooled the samples and sequenced the amplicons on Roche's 454 GS FLX. They generated about 142,000 reads that could be assembled to the amplicon reference sequences, with average read lengths of 225 base pairs.
They identified 22 haplotypes, between two to seven per amplicon. Twenty four known SNPs, or multisite variations, were identified, as well as an additional two novel sequence variations.
They then verified their haplotypes with plasmid subcloning and Sanger sequencing, which confirmed all but two of the results. Additional 454 sequencing confirmed that the two rare haplotypes were real, indicating that 454 sequencing may be better able to find rare haplotypes due its deeper coverage.
Platzer said that while his team identified 22 haplotypes, much more genetic variation in that region likely exists. "We've analyzed only a few samples. There is probably more genetic variation out there in different populations," he said.
He said that he and his colleagues wouldn't have been surprised had they found more variation in the samples they evaluated. "We didn't aim at a comprehensive catalog, just a proof of principle to show that 454 could determine such haplotypes."
Omer Gokcumen, a population geneticist at Brigham and Women's Hospital, said that the method is a good first step towards characterizing an extremely complex region, but that it was still "far from complete."
One aspect that would be important to improve is the primer set. According to the paper, two out of eight amplicons were underrepresented by sequence reads, with about 10 times fewer reads than the other six amplicons. Thus, the authors excluded data from those two amplicons.
Gokcumen noted that the authors did not offer a good explanation for why the primer did not work to capture the targeted region. It could be that there were complex repetitive regions that complicated the sequencing or the assembly, he said.
"Because there's all this variation [in the beta-defensin locus and other highly variable loci], they may be more likely to have genotypes that have clinical outcomes. So, it's very crucial to study them correctly, and I don't think we have a good way of doing that yet," Gokcumen added.
Another important aspect of the study was to show that the method could determine both the copy number and the sequence variations in the region, said Platzer. "Both the copy number and the sequence variation in these copies determines the gene expression of that region," he said. "Only the knowledge of both will help to understand how they are related to the mechanisms of disease."
In the future, he said that the data could be used to look for association with disease, but first a more comprehensive catalog of haplotypes and copy number variation would need to be determined, which will require closer study of the beta-defensin locus. In the current study, the researchers only looked at about 1 percent of the region.
"They are only amplifying a very small portion of the 87-kilobase region where they know there are variants," Gokcumen said. "It's helpful, but at the end of the day, what we need in order to do [disease association studies] is copy number information for the entire region. And honestly, I don't know how to [get] that."
Gokcuman said that the beta-defensin gene cluster is so complex, with highly repetitive and homologous regions, that even within the same locus on the same individual, there are some areas that could have different copy numbers.
Platzer acknowledged that characterizing the entire region will be a challenge. He said that one thing his team is considering is other types of enrichment methods.
In the BMC Genomics study, they used a PCR-based approach, but he said that targeted enrichment and capture methods would allow them to enrich a larger portion of the region. He is planning to evaluate both NimbleGen and Agilent's reagents and protocols for targeted resequencing and will continue to use 454 for sequencing because of its long read lengths, though he added that he will also evaluate Illumina's sequencing technology.
Platzer said that for the foreseeable future, it will be difficult to characterize the entire region, but he said that with current technologies, it should be possible to accurately characterize between 30 percent and 50 percent of the region.
Gokcuman said that targeted enrichment plus sequencing could be a good approach, but would have limitations and challenges of its own. For instance, he said it would be difficult to design unique capture probes due to the large number of repeat elements in the region. And if the capture probes were not unique, biases would be introduced into the method.
"I think it's doable, but tricky," he said, adding that a hybrid approach combining an array capture method plus the technique described in the current paper could also help.