Skip to main content
Premium Trial:

Request an Annual Quote

1001 Arabidopsis Genomes Effort Hones Assembly, Annotation, Analysis Strategies to Capture Diversity


By Andrea Anderson

Research teams affiliated with the international 1001 Arabidopsis Genomes project are relying on whole-genome re-sequencing of carefully selected strains of Arabidopsis thaliana — combined with RNA sequencing, partial de novo assembly, and re-annotation, in some cases — to get a handle on the model plant's genetic and geographic diversity.

"It's very important to move beyond just a catalog of variants and trying to interpret what those things do relative to the reference," senior author Richard Mott, a bioinformatics and statistics researcher at the Wellcome Trust Centre for Human Genetics at the University of Oxford, told In Sequence.

Two new studies published this week, including one by Mott and his colleagues, are also ratcheting up the Arabidopsis genome count, moving researchers a bit closer to the goal of sequencing 1001 Arabidopsis genomes (IS 10/7/08).

For their study, appearing online in Nature this week, Mott and colleagues from the UK, Germany, and US sequenced, assembled, and annotated the genomes of 18 A. thaliana accessions known to have a worldwide distribution and range of phenotypic features. RNA-sequencing data generated for seedlings from these accessions not only provided information on gene expression in the plants, but also helped in verifying coding SNPs, annotating the genomes, and identifying loci that influence gene expression in Arabidopsis.

"Using the genomes, seedling transcriptomes, and computational gene predictions we have characterized the ancestry, polymorphism, gene content, and expression profile of the accessions," the researchers wrote. "We show that the functional consequences of polymorphisms are often difficult to interpret in the absence of gene re-annotation and full sequence data."

Mott and his co-authors generated between about 27 and 63 times coverage of each genome, using the Illumina Genome Analyzer platform to sequence both 200-base pair and 400-base pair libraries for most of the strains.

In general, each of the new genomes was about one to two percent smaller than the 119-million base Col-0 reference genome, which represents an accession known as Columbia that was sequenced in 2000.

The Columbia accession and the 18 accessions sequenced in the Nature study have been crossed to make more than 700 Arabidopsis lines in the Multiparent Advanced Generation Inter-Cross, or MAGIC, collection, Mott explained, and analyses of the parental strains is expected to help interpret data for MAGIC descendants in the future.

The team's analyses uncovered 1.2 million insertions and deletions, along with millions of SNPs in the new genomes. Compared to the Col-0 reference, each accession tested contained between 497,668 and 789,187 single-base variants. Of these, about 100,000 SNPs per strain turned up in coding sequences that were also interrogated by RNA-sequencing of seedling tissue.

"[RNA-sequencing] gave us an independent test of whether the polymorphisms we had discovered were correct or not," Mott said. "We were able to replicate 99.7 percent of SNPs — of those SNPs which were inside genes, for which we had RNA-seq data."

Together, the team's findings indicate that disease resistance and environmental response genes are most apt to exhibit variation and expression differences from one strain to the next. By bringing together their genome sequence and gene expression data, they also started looking for variants that act as expression quantitative trait loci, influencing gene expression in the plant.

"This differs somewhat from an association study, in that here we aim to have a very, very complete catalog of sequence variation so that we can actually test, in many cases, association with the causal variant rather than a tagging SNP," Mott said. "So we have quite a lot of power for doing that."

A Global View

Researchers from Germany and Spain, meanwhile, used the Illumina paired-end sequencing approach to get 10- to 20-fold coverage of another 80 A. thaliana genomes. For that study, which appeared online in Nature Genetics this week, the team focused on strains collected in eight different parts of the world.

Analyses of those genomes uncovered more than 4.9 million SNPs, nearly 810,500 small insertions and deletions, and 1,059 copy number variants in the new genomes. The work also illustrated how genetic variation differs in these plant populations, peaking in plants from areas where Arabidopsis has been found for a long time and waning in places where it was introduced more recently.

As such, the study provides a look at the genetic diversity of Arabidopsis plants across and within different geographic locations to get information in the plant's range — one of the goals of the 1001 Arabidopsis Genome project, Detlef Weigel, a researcher at the Max Planck Institute for Developmental Biology and senior author on the study, explained.

The plan for the 1001 Arabidopsis Genome effort initially called for a hierarchical approach, with initial efforts focused on deep sequencing and whole-genome assembly of a small number of strains followed by deep sequencing of some strains without whole-genome assembly, and, finally, more superficial sequencing of a very large strain set, Weigel explained.

"That plan was, of course, drawn up three years ago and … in the meantime, sequencing has become a lot cheaper, so it's questionable whether it still makes sense to sequence the strains at different coverages," he said.

Although 11 research institutes are contributing to the project, he added, most 1001 Arabidopsis Genome-related research is being done by labs working more or less independently, primarily because the project does not have a single funding source.

"This initial idea to do it in a very coordinated way didn't come quite to pass," Weigel said.

"It's not the same as the [human] 1000 Genomes Project, in that it's not a singly funded entity," Mott added. "It's more of a kind of umbrella. Many groups around the world are sequencing Arabidopsis genomes and this is a way of bringing everything together."

Assembly and Annotation

Even without a centralized funding source, the project is continuing to move forward. And as more and more Arabidopsis genomes are generated, researchers are working on strategies to assemble and analyze the genomes in ways that accurately represent the range of the plant's diversity and coding potential.

[ pagebreak ]

Earlier this year, Weigel and his co-authors described their strategy for doing whole-genome assembly for four Arabidopsis strains. As they reported in the Proceedings of the National Academy of Sciences, they created these assemblies by aligning Illumina short-read sequences to an Arabidopsis reference genome first and then adding in de novo information using reads that didn't map to this reference.

That study built on earlier work by Weigel and colleagues at Max Planck and the University of Utah, published in Genome Research in 2008, showing that it was possible to capture SNPs and sequences not present in the reference genome using short-read sequence data. For that study, researchers re-sequenced the genomes of the Arabidopsis reference and two other strains.

In their new Nature study, Mott and his colleagues used a similar "hybrid" assembly strategy, using the algorithm Stampy to align short-read sequence data to the Arabidopsis reference, when possible, and then using the SOAPdenovo algorithm to perform de novo assembly for parts of the genome where the reference and re-sequenced strains differed from one another.

"We did align to the reference in a kind of iterative fashion," Mott explained. "So you would align to the reference, you would alter the reference genome wherever you were very confident about where differences were, and then you'd repeat this process about five times, until you didn't really get any more changes."

Even so, he explained, the iterative alignment approach alone misses chunks of the genome that differ dramatically from the reference. To try to capture these regions, the team did de novo assemblies in these parts of the genome, which produced "a lot of contigs, but varying lengths," Mott said.

The researchers then combined their iterative and de novo assembly approaches, mapping the newly assembled de novo contigs onto the iterative assemblies, he explained, generating genome assemblies that were better than those assembled by either iterative or de novo approaches alone.

When disagreements arose between the two assembly methods, the team used computational strategies to determine which assembly was more accurate. Overall, Mott said, the approach "worked pretty well when we compared it to bits of sequence which were determined by more classical methods."

These whole-genome assemblies are expected to serve as a resource for helping to detect even more of the variation present in Arabidopsis as more genomes are sequenced and analyzed down the road.

"One can exploit these whole-genome assemblies, using the sequences that are now being generated for hundreds of strains, to map the sequences from these other strains back to the whole-genome assemblies to capture more variants than you find by just comparing to the reference genome," Weigel explained.

In Genome Biology in 2009, for instance, he and co-authors proposed a strategy dubbed "GenomeMapper" to graph information from multiple Arabidopsis genomes into a single map containing individual representations of parts of the genome that are the same across sequenced Arabidopsis strains and multiple representations of areas of the genome where differences are found.

"The idea is relatively simple: that once you have many reference genomes, you could, of course, map not just against the original reference genome but you could map against all of the references," Weigel said.

The next step will be to do whole-genome assemblies for Arabidopsis that don't rely on the reference, he said. "Once we have true de novo assemblies, a few dozen, we can actually go back and use the sequences generated already to extract even more information," he said.

Along with these assembly strategies, careful annotation of these new Arabidopsis genomes is helping to make sense of how genetic variation affects gene content in the plant.

For example, by creating new gene models for each of the 18 lines they sequenced, Mott and his colleagues found that each accession codes for an estimated 24,700 genes, on average. Among them: 717 genes that had not been annotated in Arabidopsis previously. Their RNA-sequence data, meanwhile, identified more than 42,300 transcript sequences per accession, on average, or which around 2,300 were novel.

This transcriptome data also proved useful for annotating the genomes, by offering clues to alternative splicing models for genes. For example, while genome sequence data in the Nature study suggested that about one-third of Arabidopsis genes contained protein-disrupting changes in at least one accession, adding in RNA sequencing data from these strains showed that variations in splicing patterns actually corrected most of the predicted protein problems.

"We can find a surprising number of changes affecting a single gene," co-corresponding author Gunnar Rätsch and his group at the Max Planck Society Friedrich Miescher Laboratory in Tübingen, Germany, said in a statement. "However, they are often compensated for and therefore often have no significant effect on the gene products."

En Route to 1001

Together, the new studies bring the tally of published Arabidopsis genomes to around 100, Weigel said, though he noted that data has been released for about 300 more genomes sequenced by teams led by investigators at the Gregor Mendel Institute in Vienna and the Salk Institute in California.

A research team at Monsanto plans to sequence as many as 500 more Arabidopsis strains using Illumina's HiSeq 2000 platform. Monsanto researcher Todd Michael told In Sequence that that team may also use the Pacific Biosciences platform for some aspects of that work.

University of Chicago researcher Joy Bergelson provided the seeds for the Monsanto Arabidopsis sequencing effort, Michael explained, and the company plans to provide raw sequence data to Bergelson and her team as strains are sequenced.

Monsanto will also make Arabidopsis genome sequence data available to other members of the research community, he said. "We want to contribute to enable more Arabidopsis research."

With such efforts underway, Weigel said he is "quite hopeful" that all 1001 Arabidopsis genomes will be sequenced by the end of the year.

He and his colleagues are currently turning their attention to two related plants, A. lyrata and Capsella rubella. The team published a study on the A. lyrata genome in Nature Genetics earlier this year.

Mott and his team, meanwhile, are continuing to glean genome information for A. thaliana plants in the MAGIC population, doing low-coverage sequencing on these lines.

Have topics you'd like to see covered in In Sequence? Contact the editor at anderson [at] genomeweb [.] com.