NEW YORK (GenomeWeb News) – In a new set of studies, members of a team led by investigators at the University of California, Davis, the University of Maryland, and Johns Hopkins University described efforts to crack the genome of the loblolly pine, the largest genome sequenced and assembled so far.
In a paper appearing online today in Genome Biology, the team outlined the innovative assembly approaches it used to tackle a haploid version of the conifer plant's massive genome, ultimately putting together a 23.2 billion base loblolly pine draft genome.
The researchers provided more details on the sequencing and assembly steps in a recent Genetics study. In another paper in that journal, they touched on the tens of thousands of predicted protein-coding genes annotated in the newly generated sequence, along with repetitive sequences and other genome features behind the plant's biology.
For instance, the work indicated that some 80 percent or more of the loblolly pine's large genome is composed of repeat sequences. It also offered a preliminary look at some of the genes contributing to wood formation, the ability to withstand abiotic stresses such as drought, and disease resistance.
In particular, the group identified a genetic variant that appears to contribute to resistance to the fusiform rust-causing pathogen Cronartium quercuum, which may augment future efforts to breed plants resistant to that economically important disease.
"It was known that there were genetic determinants to resistance," University of California, Davis plant sciences researcher David Neale, a leader of the loblolly pine genome project and co-author on all three papers, told GenomeWeb Daily News.
"But actually having those genetic determinants in hand … would accelerate the breeding process," Neale said, explaining that such factors provide the opportunity of developing genetic tools to screen for rust resistance — something that's traditionally been done phenotypically.
Genome sequences from the project have been publicly released incrementally over the past two years, but today's papers mark the first formal publications spelling out the process of developing the complete loblolly genome assembly as well as the initial findings from it.
Members of the team published a study in PLOS One last fall that characterized repeat sequences found using some of the assembled BAC sequences and fosmid scaffolds for loblolly pine.
The loblolly pine is commercially important as a paper product source in the US. But material from the plant has also been proposed for use in feedstock and biofuel development.
In the interest of meeting demand for the plant and finding ways to breed loblolly varieties with enhanced traits of interest, Neale and his colleagues began the arduous task of tackling the full loblolly pine genome — something that was not financially feasible using Sanger sequencing alone.
"There was never an opportunity to sequence these genomes in the Sanger era," Neale said. "Funding agencies would not support Sanger sequencing of such a large genome."
He and his colleagues used paired-end Illumina sequencing to sequence genomic DNA from haploid seed material from a tree used in past loblolly pine breeding programs. They also made long-insert linking libraries with genomic DNA from diploid pine needle material and used fosmid reads and existing BAC sequences to bolster the assembly.
But because the genome is so large, spanning roughly seven times as many nucleotides as the human genome, the team had to come up with creative approaches for stitching these sequences together.
The team accomplished this task by pre-assembling short reads into super reads before putting them into a roughly 22 billion base genome assembly containing 20.1 billion sequenced bases covered to an average depth of more than 63-fold.
That assembly was improved using deep transcriptome sequence information that the researchers generated using RNA from a few dozen loblolly pine tissues taken at different developmental stages.
The transcriptome data also proved useful annotating the genome, which contains some 50,172 gene models.
Comparisons with sequences from more than a dozen other plants pointed to the presence of some 20,646 plant gene families in the loblolly pine genome. Of those, more than 1,500 are believed to be specific to conifer plants.
The team's analysis confirmed that the majority of the loblolly pine genome — some 82 percent — comprises repeat sequences. Many of those could be traced back to retrotransposons, though other repetitive elements were identified as well.
"We've discovered lots of new element types and the number, quantity, copy number, and all those things is now better described than it was before," Neale noted.
The loblolly pine genome is also expected to serve as a reference for future efforts aimed at sequencing the 100 or more other plants in the Pinus genus, which often have genomes running in the 20- to 40 billion base range.
More broadly, Neale explained, the loblolly pine represents a large plant group known as the gymnosperms — non-flowering plants that produce exposed seeds — that has been under-represented in plant sequencing studies published so far.
"Having sequenced, now, the first high-quality reference genome enables [additional] sequencing and reference-guided assembly for a very large and important group of plants that hereto now has been ignored or recalcitrant to sequencing," he said.
The team is continuing to improve the loblolly pine genome. It also has underway sequencing studies of the sugar pine and Douglas fir.