Researchers at Stanford University have demonstrated the use of Illumina's Moleculo technology for phasing human genomes and shown that it can determine allele-specific methylation patterns in a human genome and identify hundreds of previously unknown differentially methylated regions.
Moleculo was a Stanford University spinout, formed in 2012 by Steve Quake and purchased in early 2013 by Illumina. In essence, Moleculo's technology aims to generate long reads by fragmenting genomic DNA into 10-kilobase pieces. It then tags those pieces with unique barcodes, breaks them up further, sequences with short-read technology, and then assembles.
Currently, Illumina offers the technology as a service, but plans to introduce Moleculo kits this year.
Stanford University was an early-access user of Moleculo technology, and researchers there detailed in a Nature Biotechnology publication last week the technology's ability to phase human genomes and determine allele-specific methylation patterns.
Volodymyr Kuleshov, a former consultant with Moleculo, now a member of Mike Snyder's Stanford laboratory and also the lead author of the paper, told In Sequence that there are two main components of the technology — the long-read component and an algorithm especially suited for dealing with those long reads. Kuleshov developed the algorithm while he was a consultant at Moleculo. The group dubbed the whole approach statistically aided long-read haplotyping (SLRH). The algorithm portion uses both the long reads as well as linkage information to piece together haplotype blocks, resulting in a five-fold improvement in accuracy over traditional statistical methods, Kuleshov said.
In the study, the researchers demonstrated that with an additional 30 gigabases of sequence, they could phase genotypes identified by 50x coverage whole-genome sequencing, and that by using SLRH, they could phase 99 percent of SNVs from three human genomes into haplotype blocks between .2 and 1 megabase in length.
Next, they showed that having haplotype information helped them to obtain a base-resolution map of methylation across the human genome.
Kuleshov said that the method is "similar in spirit to other methods" such as Complete Genomics' long-fragment read technology or fosmid-based approaches.
The first step is to create long DNA fragments, he said, in this case around 10 kb in length. Those fragments are then diluted into a 384-well plate such that each well has between 3,000 to 6,000 fragments. Within each well, the fragments are amplified and cut into shorter fragments and barcoded.
Where the Moleculo technique differs from say Complete Genomics' LFR approach, said Kuleshov, is that it uses PCR for amplification as opposed to MDA. "The advantage is we have less bias, but our long-read [fragments] are shorter than those generated from MDA," he said. The PCR-generated fragments are rarely longer than 10 kb, while MDA-generated fragments can be as long as 80 kb.
Next, the researchers created well-specific sequencing libraries using Nextera's DNA transposase library prep kit before pooling the wells together for sequencing on the one lane of the Illumina HiSeq. They then align the reads to the reference genome and map them back to their original wells. Then, reads within each well are clustered into groups thought to belong to the same fragment, and variants within each fragment are called based on the individual's genotype. In the Nature Biotechnology study, the researchers determined the genotype from 50x whole-genome sequencing. "Fragments called at this stage have N50 lengths of about 7 to 9 kb and cover the genome to a depth of about 4-8x," the researchers wrote.
"We sequence each fragment to a very low level," Kuleshov said. While this provides a cost advantage — an additional 30 gb of sequence data in addition to whole-genome sequencing compared to a fosmid-based approach or the LFR technique, which require between 100 and 400 gb of additional sequence data — it also generates reads that are "short and very sparse," Kuleshov said.
As a result, "if we just use off-the-shelf phasing algorithms, we get haplotype blocks that are almost 10 times shorter than existing methods and we miss SNPs."
Thus, a key portion of the technique is the algorithm, dubbed Prism, which combines the long-read information with statistical methods. Other algorithms typically use either long-read information or statistical methods, but not both in combination, Kuleshov said. But using only long-read information would result in "many holes."
The Prism algorithm also has advantages over existing statistical methods because it can take into account the information generated from the long reads. Previous statistical methods were designed primarily for microarray data, Kuleshov said. But, "if you know that two positions are linked together by a long read, then you know something about their phase," he said. "We can now use this prior information and it improves our accuracy by five-fold over traditional statistical methods and it can bring the quality of our long read phasing to a level that's equal or superior to existing technologies," he said.
Prism works in two stages. First, it assembles fragments locally into haplotype blocks by connecting them at overlapping heterozygous SNVs. Next, it uses linkage information to piece together the local blocks into longer haplotype contigs. "Such contigs can phase up to 99 percent of heterozygous SNVs and up to 95 percent of heterozygous variants," the authors wrote. Additionally, Prism also produces a confidence score between each local block about the likelihood of introducing a phase-switch error due to statistical phasing. This gives users the ability to tune the metrics of the algorithm for either higher accuracy and shorter haplotype blocks or longer haplotype blocks with a lower accuracy. "The ability to make a trade-off between accuracy and completeness is a feature of Prism that, to our knowledge, is not provided by other phasing algorithms, and which we expect to be useful in applications that demand great precision," the authors wrote.
The researchers demonstrated the ability of SLRH to phase a HapMap trio that had previously been phased using familial information. They prepared two phasing libraries for each member of the trio and evaluated SLRH at different accuracy thresholds. Running Prism on each library, the researchers demonstrated that at a.9 accuracy threshold, between 98 percent and 99 percent of all SNVs were phased in haplotype blocks with N50 lengths of 400 to 500 kb. And accuracy was between around 99.87 percent and 99.9 percent. Additionally, the replicate libraries were highly concordant with each other.
Finally, the team demonstrated that by obtaining a fully phased genome, they could analyze differential DNA methylation. On one of the HapMap samples, the researchers performed a MethylC-seq experiment and assigned the methylated short reads to their haplotypes. They identified 216,034 allele-specific methylation events in 992 differentially methylated regions that ranged in size from 6 bp to 3,181 bp. Ten of the differentially methylated regions were located in previously studied areas of the genome.
Aside from this technique, researchers elsewhere have also developed haplotype methods. Recently, a group from the University of California, San Diego, developed a method called HaploSeq that relies on Hi-C sequencing to generate chromosome-scale haplotype maps.
In general, the Moleculo approach described in the recent Nature Biotechnology study generates dense phasing profiles, but those phasing profiles tend to be contained within shorter haplotype blocks than methods like HaploSeq, which generates chromosome-scale haplotypes that tend to be sparser.
According to Siddarth Selvaraj from UCSD, who helped develop the HaploSeq method, the Moleculo method and Prism algorithm is "a very good advancement to the field."
Selvaraj added that it would be interesting to see how the Moleculo method could be applied to plant genomes. The Prism algorithm relies on using linkage information from a reference panel to generate haplotype blocks. "If you don't have a comprehensive reference panel, Prism can't make the longer haplotypes," he said.
Compared to HaploSeq, the technique generates somewhat shorter haplotype blocks, but at a "very high resolution," Selvaraj said. He said that he has not yet tested the Moleculo technology, so could not comment on its cost or ease of use, but said that from the study it "seems to be elegant and usable."
Additionally, he said that it could be complementary with HaploSeq, which generates chromosome-scale haplotype blocks but at a significantly lower resolution. "Combining the two could yield chromosome-sized haplotypes with much higher resolution," he said.
Kuleshov agreed that the two methods could be complementary. HaploSeq "gives you very long range information, but it's less accurate," he said. Additionally, the two methods yield different types of errors, he said. "In our case, we have very few times that a single SNP slips, but sometimes there's a switch between the mom and dad," he said.
One way to combine the two methods would be to "pre-phase using our blocks and then use the [HaploSeq] technology to connect our blocks," Kuleshov said. "That would combine the strengths of both."