NEW YORK (GenomeWeb) – Plant scientists from the US and the Czech Republic have developed a targeted sequencing and genome skimming strategy aimed at assessing both low-copy plant genes and high-copy genetic elements such as sequences encoded in plant organelles.
The method, known as Hyb-Seq, scours transcriptome and genome sequences from a representative plant in a given lineage and uses information from these sequences to design probes for targeted capture and sequencing in related plants, Oregon State University botany and plant pathology researcher Aaron Liston told In Sequence.
The idea was to come up with a targeted sequencing approach that was "specific enough that it would pick up primarily low-copy genes," he explained, "but broad enough that it could work across a range [of plants] — across an entire genus or a group of related genera."
As they described in a protocol note appearing in the journal Applications in Plant Sciences, Liston and his co-authors used Hyb-Seq to look at sequences from plants in the same lineage as the milkweed plant Asclepias syriaca. Using milkweed transcriptome sequences and a draft version of the plant's genome, they designed 80 to 120 base pair probes to capture and sequence thousands of exons from the genomes of a dozen related plant species or genera.
Through Illumina MiSeq sequencing on these enriched samples, the team successfully assembled sequences coinciding with the majority of the genes and exons targeted. From the off-target reads, meanwhile, it put together sequences representing the genomes of the plants' plastomes and other high-copy sequences.
Although targeted sequencing has been used for humans and other animals for many years, the approach is more difficult to apply across plant species due to the enormous complexity of plant genomes, which are prone to duplication.
"For targeting, you want to target things that are 'single-copy' in your genome," Liston explained. "That's one of the big challenges: whole-genome duplication."
Plants also lack the sort of phylogenetically informative ultra-conserved elements that are often targeted for animal studies, he noted. Consequently, targeted capture techniques and related phylogenetic analyses have generally been developed on a species-by-species basis.
"The challenge was to make something that would work across a range of close relatives — different genera, for example," Liston said.
To that end, the team turned to its own genome and leaf/bud transcriptome data for the milkweed — a representative plant from its initial lineage of interest.
By mining this unpublished sequence data, the researchers designed targeted capture probes corresponding to 3,385 milkweed exons. These coding sequences, in turn, coincided with coding sequences for 768 genes suspected of being present in single copies in the milkweed genome.
Possible gene paralogs were weeded out of the probe design process by excluding targets within milkweed that share 90 percent sequence similarity or higher. Likewise, the team tossed potential single-copy targets that spanned fewer than 120 bases or so in an effort to enrich for plant sequences that were at least as long as the original probes.
Through solution hybridization, the investigators used the final probe set to enrich for related sequences from 10 other Asclepias species and two plants from nearby plant genera, called Calotropis procera and Matelea cynochoides before sequencing the resulting libraries with Illumina's MiSeq.
With the help of a reference guided assembly approach, they then aligned the targeted capture reads to the original milkweed sequences.
The approach made it possible to pick up some part nearly 93 percent of the exons the team had targeted. Together, those sequences offered a look at 99.7 percent of the genes initially sought after.
From the 760 or so loci that they began with, for instance, the investigators lost 60 to 70 percent of markers when looking at more distant members of the family, Liston noted. Nevertheless, enough information remained to begin looking at relationships within the milkweed lineage.
Between almost 2 percent and 13 percent of sequences spanning the original 768 genes varied amongst the plants included in the study, for example, offering clues to phylogenomic relationships in the milkweed lineage.
It remains to be seen whether there is an optimum number of Hyb-Seq markers for delineating relationships in this and or other plant lineages, Liston noted. He and his colleagues are currently putting together a phylogenic tree for this plant group that's built around sequences coinciding with roughly 1 percent of the milkweed's exome.
While that may be modest in a whole-genome context, the Hyb-Seq strategy provides far more resolution than that available from plant family trees built with data at just one or a few genes, Liston argued. "Just getting the sample size up to 1 percent is, I think, going to lead to much more robust phylogenies in the future."
Even if some markers cannot be detected in some or all of the plants tested, the remaining loci continue to provide phylogenetic clues, he explained, whereas existing PCR-based gene-by-gene approaches to looking at such plant relationships are "very onerous, very time-consuming."
Given that the targeted sequencing technique currently has 50 percent efficiency or so, around half of the sequences coming out of Hyb-Seq experiments correspond to targeted regions of the plant genomes, Liston explained.
The remaining sequences detected typically stem from high-copy sequences found in plant organelles such as ribosomes, mitochondria, and plastids. Plastid sequences tend to be particularly common, he noted, given that these genomes are between 10 and 100 times as prevalent as mitochondrial genomes.
Those high-copy sequences "can be readily assembled" from the sequence data, Liston said, noting that the analytical pipelines used to look at low-copy plant genes from the nucleus and high-copy organellar sequences are slightly different.
At the moment, there do not seem to be differences in the applicability of Hyb-Seq within different types of plants or plant lineages. Prior to the paper's publication, the authors shared details of the Hyb-Seq approach with investigators working on grasses, legumes, and a range of other land plants.
"You could do this for any plant of interest," Liston said. "With a transcriptome and a genome skim, you can basically have all your data to design a set of probes that will work across that entire genus."
For the current study, he and his colleagues generated their own draft genome sequence for milkweed. Genome quality is not particularly important for this particular application, according to Liston, as long as the bulk of the plant's gene space is represented.
Nevertheless, the availability of both transcriptome and genome sequences from the plant used for probe design is important for accuracy, he explained, particularly given the variability that tends to spring up in the intronic sequences that fall between coding portions of related plant genes.
Introns "are too variable to reliably use as a probe when you want to go across species," Liston insisted, explaining that probes designed to match these variable sequences are less likely to effectively enrich for sequences across different plant species. By targeting the rapidly evolving introns, "you're going to lose the flanking exon too."
Instead, he recommends targeting adjacent coding sequences, since the exon-targeting probes will inevitably pick up at least some neighboring intron sequences. "It's part of the whole splash zone idea. You target the exon, but depending on your read length … you can easily pick up 250 base pairs on either side of your exon," Liston said.
On the other hand, probes designed using transcriptome data alone are apt to inadvertently span sequences interrupted by introns in the genome.
"Computationally, you can predict where the introns are," Liston said. "But … the introns change rapidly, so having the genome is the best way to find the introns."
For the time being, Hyb-Seq probe design is expected to be most effective when both genome and transcriptome data are available, he argued, though that may change as more and more plant transcriptomes and genomes become available and intron predictions improve.
For their part, members of Liston's Oregon State University group are applying Hyb-Seq to a wide range of phylogenetic and targeted gene experiments. For instance, they hope to get a better look at not only the nature of plant relationships with one another, but also the extent to which certain gene duplications are shared or distinct between closely related species.
The team believes the technique has potential for other applications as well, including efforts aimed at exploring the biological features of the organelles producing the high-copy sequences detected by Hyb-Seq. Population genetic studies are another possibility, Liston pointed out, though the extent to which it can be applied across large sample sizes may still be somewhat prohibited by cost.
The price of the approach changes over time and depends on the technology used. Generally speaking, though, the team has been able to do the Hyb-Seq analysis for between $50 and $100 per sample.
Most of the samples the researchers are considering at the moment are multiplexed at both the hybridization and sequencing levels.
So far, they have relied exclusively on Illumina sequencing technologies, though the same general approach is expected to be compatible with other high-throughput sequencing technologies such as Ion Torrent.
The team is pleased with the way the method itself is performing, though Liston noted that there may be room for improvement in curbing the cost of library preparation and/or making the analytic side of the pipeline more straightforward.
"The [analytical] methods are so much still in development," he said. "Making the methods more available to a broad audience would be good."