NEW YORK (GenomeWeb) – Researchers at Ohio State University recently published a paper that describes how they used a version of the MAKER pipeline to identify genetic markers in Penstemon, a genus of flowering plants from the Plantaginaceae family, from low-coverage whole genome shotgun sequencing data.
These markers could help researchers reconstruct the evolutionary relationship between various species within the genus, accoridng to the study's authors.
The researchers published the details of their methodology in a recent issue of Applications in Plant Sciences, a journal published by the not-for-profit organization BioOne. MAKER is developed by researchers in the laboratory of Mark Yandell, a professor of human genetics and an adjunct professor of biomedical informatics at the University of Utah. Its developers describe it as "a portable and easily configurable genome annotation pipeline" for annotating small eukaryotes and prokaryotes. The software, which comprises multiple analysis tools, "identifies repeats, aligns ESTs and proteins to a genome, produces abinitio gene predictions, and automatically synthesizes these data into gene annotations having evidence-based quality values," according to its creators.
The OSU researchers used MAKER2, an updated version of the initial pipeline. They selected this particular pipeline because it offered a "practical framework for identifying contigs containing gene regions, even when the majority of the sequences are short (~400–500 bp) and genomic resources are only available from distant relatives of the target organism," they wrote. It also supports "direct characterization of genomic features, such as exon boundaries, which can be used to design primers for future PCR-based sequencing efforts."
Specifically, according to the paper, the researchers used MAKER2 to identify genetic markers useful for phylogenetic studies from six extremely low-coverage Penstemon genomes generated by 454 sequencing. Although this particular genus has been the subject of multiple studies, determining the phylogenetic relationships among its 280 species has been difficult because of its recent rapid evolutionary radiation, the researchers wrote. Some studies have successfully highlighted aspects of the genus' phylogeny but "relationships for taxa within and among subgenera, sections, and subsections were not always consistent with current taxonomy, and relationships within clades having strong support were largely unresolved," according to the paper.
Essentially, "[we were] trying to tease apart drivers of this radiation in a phylogenetic context," Paul Blishchak, a doctoral student Ohio State's department of evolution, ecology, and organismal biology, and the lead author of the study, explained to GenomeWeb. To do that, "we needed molecular markers and sequencing loci to be able to infer a phylogeny."
Large multi-locus datasets can help researchers suss out these sorts of intra-genus relationships, and high-throughput sequencing techniques like low-coverage WGS, or genome skimming, can be used to generate the requisite marker sets. However, discovering low-copy nuclei is more difficult with this sequencing approach since it may produce only small fragments, the researchers wrote, and sifting through these is challenging.
A potential solution is to leverage pipelines that combine functionalities from different programs like MAKER2, which includes tools for repeat masking, ab initio gene prediction, protein alignment, and more, according to the researchers. Moreover, the pipeline can "act as a wrapper program for the training of gene prediction algorithms, such as SNAP or Augustus," the paper states.
For the study, the researchers collected over 40,000 contigs from the six Penstemon species and trained a gene prediction algorithm — called SNAP — in MAKER2 to make predictions specifically for the genus. In total, they fully annotated nearly 1,900 genes, and predicted over 8,400 genes. They were also able to design primers for chloroplast, mitochondrial, and nuclear loci. The amount of useful data that the research were able to identify from relatively sparse datasets was one of the exciting finds for the researchers, according to Blischak. He and his colleagues note in their paper that an alternative to using low coverage data might simply be to sequence the genomes at a higher coverage. However in cases where financial constraints are a concern, "it's nice to know that you can still get useful information from [low-coverage data]," he said.
The researchers also compared the MAKER2 pipeline's performance to standard approaches such as Blast, which rely on sequence similarity. They looked at how much potentially useful data resulted from using each approach and also annotation/alignment length. They reported that on average MAKER2 identified longer gene regions and returned fewer sequences than Blast searches. They also reported an average amount of 7.14 percent sequence variation among the six species of Penstemon studied.
The researchers plan to use the markers identified in this study for additional phylogeny studies in Penstemon, Blischak said.
Moreover, "we now have a gene prediction model that has been designed specifically for Penstemon," they wrote. "Such a model will be useful for any future WGS or other NGS projects involving the genus, and has the capability of being continually updated as we gather more data from transcriptome sequencing and higher-coverage WGS efforts."