A consortium of researchers led by scientists at the Max Planck Institute for Developmental Biology plans to sequence at least 1,001 Arabidopsis thaliana plants from around the world in order to enable genome-wide association studies that would link phenotypic differences with genotypic variation in the plant.
As a proof-of-concept for the project, planning for which began last year, the organizers recently published the genomes of three Arabidopsis strains, sequenced on Illumina’s Genome Analyzer platform.
The idea for the one- to two-year project followed a microarray-based genotyping study of 20 diverse Arabidopsis strains, a collaboration between Detlef Weigel, a plant scientist and director of the MPI in Tübingen, Germany; Joe Ecker at the Salk Institute in La Jolla, Calif.; Magnus Nordborg at the University of Southern California; Perlegen Sciences; and others that was published in Science last year.
“We realized that the SNP chip is great, but it’s not going to get us directly to the causal changes,” Weigel told In Sequence last week.
Early last summer, his lab first tested sequencing Arabidopsis on Illumina’s Genome Analyzer. The results were promising: Because of the genome’s relatively small 120-megabase size and its lack of repeats relative to the human genome, “even with the short reads produced by the Illumina machine, one can actually see a lot of differences” between genomes, Weigel said.
“From that, the idea [for the project] was born. And seeing that the human sequencers were aiming at 1,000 genomes, we thought that was a pretty good number, and we came up with 1,001,” he said.
The plan is to take 10 Arabidopsis populations from 10 regions across Europe and Asia and sequence 10 individual plants from each population. In addition, the researchers want to sequence at least one line originating from North Africa.
“Now, we are trying to get the [broader scientific] community behind this and get the funding for this,” Weigel said.
In a grant proposal he and Richard Mott, a mouse geneticist at the Wellcome Trust Center for Human Genetics in Oxford, UK, submitted to the European Research Council, Weigel estimated the materials cost for the entire project to be €1 million ($1.35 million). Weigel said he expects the funding decision in the near future.
With funding from the Max Planck Society and the German Research Council, Weigel’s lab has already committed to sequencing 80 strains within the next three months, some of which are already completed.
“And then we will see how quickly we are going to get to the 1,001,” he said.
His group currently owns a single Illumina Genome Analyzer, “but we must have gotten the best machine that Illumina has ever produced,” he said, noting that none of the almost 60 runs on the instrument, which have produced more than 100 gigabases of data, have failed so far. If additional funding for the project becomes available, he said, he will acquire another Illumina sequencer.
“They come back a few hours later and it tells them, ‘These are your best candidates for alleles that affect the trait of interest.”
Besides Weigel’s and Mott’s labs, currently participating in the project are researchers at the Department of Energy’s Joint Genome Institute, the Salk Institute, the Sainsbury Laboratory in the UK, the University of Lausanne, and the University of Southern California.
According to its website, the project has already completed 11 strains on Illumina’s platform, one using 454’s technology, and one by a combination of Illumina’s GA and ABI’s SOLiD system. All other strains chosen for sequencing so far are slated to be sequenced on Illumina’s sequencer.
“I assume it’s going to be distributed over maybe three to four [Illumina] instruments, so that it can be done in a year,” Weigel said. “Even with a single instrument, it would be possible to finish it within two years.”
But Weigel wishes other technologies would also be used in the project.
“We would not be unhappy if others would do 454 sequencing,” he said, explaining that de novo assemblies from 454 reads would increase the number of reference genomes against which to map the Illumina reads.
“Hopefully, there is going to be a smaller number of genomes — a few dozen — which are going to be sequenced with longer reads, and then we are going to do this much larger number with the shorter reads,” he said.
Ecker, in collaboration with Roche/454 Life Sciences, JGI, and Yale University, has already sequenced one Arabidopsis strain de novo using the 454 technology.
In a proof-of-concept study, which was published online in Genome Research last month, Weigel and his colleagues sequenced three Arabidopsis strains — including the Columbia reference strain — to 15- to 25-fold coverage, using single reads from Illumina’s Genome Analyzer.
For their analysis, they built a pipeline for aligning reads and predicting SNPs and indels up to three base pairs in size. That pipeline, called Shore, will soon be available for download from the project’s website.
They also developed a targeted de novo assembly method that uses the Velvet short read assembler developed by the European Bioinformatics Institute (see In Sequence 3/18/2008) to assemble unmapped reads together with reads that frame an uncovered region of the genome.
“Doing this, we have been able to generate thousands of what we call ‘targeted assemblies,’ which we then map back to the genome, and that are able to bridge deletions or insertions,” Weigel said.
He said his lab can now also generate paired-end reads on the Illumina platform, which offers a choice between short 200 base pair inserts and long 3-kilobase inserts. He and his colleagues currently plan to sequence the first 80 genomes with short paired-end reads, “and at the current technology, this is how we would do the 1,000 genomes,” he said. Velvet and other assemblers, he noted, can now also handle paired-end data.
Because Arabidopsis is homozygous, a low coverage on the order of four-fold will be sufficient, he said, enabling the scientists to sequence eight strains in a single Illumina run. The researchers then hope to figure out from common SNPs which haplotype blocks are shared, and to combine this information across strains.
In addition, they plan to sequence a small number of the lines with 3-kilobase mate pairs to get at larger structural variation, and use this information to infer structural variation in other strains, according to Weigel.
Mapping the reads and predicting SNPs and small indels will take approximately a day for several genomes, he estimated, no longer than the data production. The downstream analysis, on the other hand, will likely take more time.
Eventually, the sequence data will be available on a website on which Arabidopsis researchers will be able to enter their phenotypic data and automatically run association scans.
“They come back a few hours later and it tells them, ‘These are your best candidates for alleles that affect the trait of interest,’” Weigel said.
Weigel said he hopes the project could also inspire other plant researchers to take a similar route with their own species of choice. “Hopefully, in a year or two, we are going to have the 1,001 rice genomes, corn genomes, wheat genomes,” he said.