NEW YORK (GenomeWeb News) – Two new genome sequencing and analysis studies are providing insights into the genetic variation present in the model plant Arabidopsis thaliana — and adding to the genome tally for the 1001 Arabidopsis Genomes effort.
In the first of these studies, researchers from the UK, Germany, and the US sequenced, assembled, and annotated the genomes of 18 A. thaliana accessions that were selected, in part, for their worldwide distribution and phenotypic differences. By incorporating information from RNA-sequencing experiments of the plants, the researchers verified their SNP data and got clues that helped in annotating the genomes. The team described their findings online yesterday in Nature.
"Our project has a number of aspects which go beyond just genome sequencing," senior author Richard Mott, a bioinformatics and statistics researcher at the Wellcome Trust Centre for Human Genetics at the University of Oxford, told GenomeWeb Daily News. "It was a lot of annotation and transcriptome sequencing, so we got a very clear idea of what the consequences of [genetic] variations are."
Mott and his colleagues tackled the genomes of 18 Arabidopsis diploid accessions using Illumina GAII paired-end sequencing to get between about 27 and 63 times coverage of each genome. The team subsequently assembled the genomes using a combination of de novo assembly and iterative assembly to a 119 million base reference genome known as Col-0.
Crosses between the 18 strains assessed have been used to create a set of more than 700 strains known as the Multiparent Advanced Generation Inter-Cross, or MAGIC, collection, Mott noted. And by characterizing the genomes of these parental strains, he explained, it should be possible to deduce information about those descendent strains as well.
"The 18 genomes we sequenced, along with the reference genome Columbia, are the progenitors of a large population of recombinant inbred lines called the MAGIC population," he said. "By sequencing the genomes in this Nature paper, we can now infer, effectively, the genome sequences of all of these other lines."
All told, the team found 1.2 million insertions and deletions and more than three million SNPs, with each of the 18 newly sequenced accessions harboring between 497,668 and 789,187 single base variants compared to the Col-0 reference.
The researchers verified almost all of the SNPs found in expressed protein-coding sequences using RNA-sequencing data generated from seedling material for each accession. They also did transcriptome sequencing using RNA from floral bud and root tissue for one of the accessions.
The RNA sequence data was useful not only for gauging gene expression in seedling tissue and verifying polymorphism patterns in coding sequencing, Mott said, but also for helping to re-annotate each genome, pointing to places where splicing varied from one strain to the next.
For instance, whereas genome sequence and SNP data suggested that roughly one-third of Arabidopsis genes contained major changes in at least one of the accessions, the researchers' expression data showed that subtle splicing changes ameliorated the effects of most of these changes.
The team estimated that each of the accessions contains nearly 24,700 coding genes, on average, while RNA sequencing identified an average of more than 42,300 transcript sequences for a given accession. Overall, the new analyses uncovered 717 genes not found in Arabidopsis previously, including 496 sequences that do turn up in the reference genome but had not been annotated.
Most of the variations in gene sequence and expression levels from one strain to the next fell within genes that are predicated to contribute to processes such as disease resistance and environmental response, researchers reported.
Bringing together genome sequence and gene expression data also provided an opportunity to find variants influencing gene expression differences, including some variants that appear to be causal expression quantitative trait loci.
In another study appearing in Nature Genetics online yesterday researchers from Germany and Spain sequenced 80 A. thaliana strains from eight regions in the world to get new insights into the role that geography plays in A. thaliana genetic variation. Their findings suggest that A. thaliana genetic variation is most pronounced in parts of the world where the plant has been around the longest, but decreases in regions where it was introduced more recently.
"What we learned from this was that, similar to what you have in humans, the diversity in the places where the species has been around for a long time … is much greater than the diversity in regions which have only been recently populated," Max Planck Institute for Developmental Biology researcher Detlef Weigel, senior author on that study, told GWDN.
Weigel and his colleagues used the Illumina GAII to generate 10 to 20 fold coverage of 80 A. thaliana genomes, representing strains collected in Central Asia, the Caucasus, South Tyrol, Swabia region of southwestern Germany, Spain's Iberian Peninsula and North Africa, Eastern Europe, Southern Italy, and Southern Russia. Between seven and 14 Arabidopsis accessions were tested from each area.
"Arabidopsis thaliana is native to Europe and Asia. You now find it in other regions as well, but it's very likely introduced," Weigel explained. "The way we chose the regions in Europe and Asia was that we wanted them to represent different aspects of the history of this species."
By comparing sequences from the 80 sequenced strains to an A. thaliana reference genome, the team found more than 4.9 million SNPs, nearly 810,500 small insertions and deletions, and 1,059 copy number variants in the new genomes.
The extent of the genetic diversity within the strains varied with geographic origin, Weigel explained. For instance, plants sampled from the Iberian Pensinsula, North Africa, and southern Germany sites, where the oldest Arabidopsis populations are found, had much higher genetic diversity than those from the Alps and Central Asia, where it was introduced relatively recently.
On the other hand, the rate of genetic changes predicted to be deleterious was higher in plants from these recently settled locales, the team reported, consistent with shorter exposure to selective forces that would remove these mutations from the genome.
"That's all consistent with migration," Weigel explained. "When you're a pioneer, you start with a small population and you have little diversity and selection hasn't yet removed new mutations that are deleterious."
By looking at the mutation patterns in the A. thaliana genomes and comparing them to those in the A. lyrata genome, published earlier this year, the researchers were also able to distinguish between old and new mutations in the Arabidopsis genome and their relative roles in adaptive processes in the plant.
"We've now been able, for the first time, to disentangle the effect of mutation in other processes, not just diversity," Weigel said. "What we find is that for new mutations, the spectrum is not too dissimilar from the one that we find in the laboratory and when we look at the old mutations the spectrum looks very different."
From these patterns, researchers concluded that processes such as biased gene conversion and recombination have had a larger role than mutations in recent genetic variation and adaptation in Arabidopsis.
Based on the genetic variations found so far, Weigel and his co-authors estimated that it should be possible to capture most common variation in the Arabidopsis genome by sequencing around 100 carefully chosen strains.
"When it comes to major variants — the vast majority of variants that are reasonably common throughout the populations — one already captures with 80 strains," Weigel said.
In addition to the roughly 100 Arabidopsis genomes described in published studies so far, he noted that research teams have released data on another 300 or so Arabidopsis genomes. With several teams already working to sequence additional genomes, including a team from Monsanto that plans to sequence up to 500 Arabidopsis genomes, Weigel said he's optimistic that the 1001 genome goal set out by contributors to the 1001 Arabidopsis Genomes project can be achieved by the end of the year.
Information about the 1001 Arabidopsis Genome project and genome sequence data released so far can be found online here.