By Monica Heger
A new sequencing technique developed by researchers at the University of Oregon could be useful for de novo genome assembly of short-read sequencing fragments and could also help researchers obtain haplotype information.
The technique, published last month in PLoS One, is based on paired-end sequencing of restriction-site associated DNA fragments, or RAD-PE. For each fragment, one end is the sequence of the restriction site, while the other is a unique sequence. The restriction site sequence serves as a marker by which to localize fragments to a specific region of the genome.
"The idea is that instead of trying to take all the sequence reads from whole-genome shotgun sequencing and assemble them all together, [there is a] way to extract out a subset of the sequences that belong to the same genomic region," said Eric Johnson, senior author of the paper and associate professor of biology at the University of Oregon.
RAD-PE simplifies assembly, added Johnson, because it enables a local assembly of a subset of reads that all have the same cut site sequence. Additionally, haplotype information can be determined by figuring out which cut sites are heterozygous, he added.
In the proof-of-principle study, Johnson and his team tested several modifications of the method, optimizing it for various applications, and demonstrated that it can simultaneously assemble long contigs, identify SNPs between two stickleback fish individuals, and determine haplotype information.
The method is similar to a subassembly developed by Jay Shendure's team at the University of Washington, in which a tags identify short reads as being from the same longer fragment (IS 1/19/2010).
In the first iteration of the protocol, the team tested to see if the method could identify SNPs between two threespine stickleback individuals from a phenotypically polymorphic population. The team created barcoded RAD libraries for paired-end sequencing on the Illumina Genome Analyzer with 60 base pair reads and obtained around 4 million reads per sample.
To create the libraries, the team first digests the DNA with a restriction enzyme, and then adds a barcode and sequencing adapter. Next, they randomly shear the DNA and add a second adapter to the sheared end. After sequencing, the reads can be grouped by their specific RAD tags and assembled.
The reads associated with each RAD site were separated out, and only the RAD sites that had at least 30 reads but no more than 1,000 reads were assembled. The team determined that a site with more than 1,000 reads was likely to be a repetitive sequence, while those with fewer than 30 were unlikely to have sufficient coverage to call polymorphisms.
The team identified 40,441 high-quality SNPs between the two individuals in 15,132 contigs, with an average of 2.6 SNPs per contig. They selected 13 polymorphisms to be validated, all of which were confirmed by PCR.
The method also enabled them to identify haplotypes. In cases where the RAD site sequences were heterozygous, containing a polymorphism specific to one of the homologous chromosomes, the team was able to assemble contigs with haplotype information. They identified putative haplotypes in contigs from one of the fish, and then confirmed them with Sanger sequencing.
[ pagebreak ]
Johnson said the ability to do haplotyping across the entire genome would depend on the heterozygosity rate of the specific individual, and also the cut frequency of the restriction enzyme used. If you cut with a high frequency enzyme, "you could get long, overlapping contigs and you [would] find a heterozygous RAD tag within all regions," he said.
The team also modified the method to obtain high coverage of the whole genome of E. coli. To do this, they used a high-frequency restriction enzyme, which produced overlapping fragments several kilobases long.
The team achieved 2 million sequence reads, identifying 52,917 unique RAD sequences. After assembling the reads into contigs, they assembled 70,319 contigs with an N50 of 649 nucleotides. Mapping the contigs back to the reference, they found that 54,189 of the contigs had no errors, and of the 13,850 contigs with a single error, most were located near the end of the contig in a region with low coverage.
Over 99 percent of the genome was covered by at least one contig, while over 91 percent of the genome was covered by at least five contigs.
While the team demonstrated it is possible to do whole-genome sequencing using the method, Johnson said the ideal applications are more for population genomics, performing local assemblies to compare genetic markers, and finding haplotype information. The method could also be used to fill gaps in whole-genome sequencing studies.
Other groups have already begun using the method for population genomics. John Davey, a postdoc in Mark Blaxter's lab at University of Edinburgh, and his colleagues recently published a paper in PLoS One demonstrating that the method is useful for comparative genomics and linkage mapping in non-model organisms.
"The advantage of RAD is that in one sequencing run, we can discover and genotype thousands or tens of thousands of markers in non-model organisms," he said.
The University of Edinburgh team recently tested the method on 24 individuals of the diamondback moth, a common crop pest, to see if they could identify insecticide resistance. Sequencing all 24 individuals on one lane of the Illumina GA with 51 base paired-end reads enabled them to identify 23 genes on the chromosome associated with resistance.
Additionally, the team was then able to construct a linkage map, locating the involved genes on specific chromosomes. While the study was a proof-of-principle, because some markers of resistance were already known, Davey said the method revealed allowed them to see the area of resistance at a much higher resolution.
"For each of these markers we were able to discover, we could assemble a 500 to 600 base pair contig around that marker," Davey said. "That's extremely useful for putting a species into genomic context."
He said that the method should be extremely useful for researchers working in ecology and evolutionary biology because it allows them to find genomic markers of non-model organisms without having to do whole-genome sequencing. "You only need the recombination spots," he said.
Eventually, whole-genome sequencing will be cheap enough that the method won't be necessary, but it's not there yet, he said. "This will be an important method for the next few years."