This article was originally published July 17.
A University of Arizona research team has been awarded a $1.5 million grant form the National Science Foundation to create a reference genome sequence of West African cultivated rice using pooled-BAC sequencing with the Roche/454 Titanium platform.
The approach, described by principal investigator Rod Wing as "old school meets new school," is expected to overcome a number of challenges that have limited the use of second-generation technologies in de novo sequencing projects for large, repetitive genomes.
Wing, a professor of plant sciences at UA, was a member of the international consortium that sequenced the first rice genome, the 389-megabase Oryza sativa L. ssp. japonica cv. Nipponbare — a project that took six years and cost $200 million.
Wing told In Sequence last week that he expects the new method, developed with UA colleagues and collaborators at 454, to generate the de novo sequence of the 809-megabase O. glaberrima genome in six months for around $500,000.
The effort is part of a larger project called the International Oryza Map Alignment Project, or I-OMAP, which aims to create reference genomes for 10 distinct genome types within the Oryza genus and ultimately sequence all 23 rice species.
"There are a lot of groups out there that think, 'Okay, rice is done. We'll just use that as the template for everything,' but our philosophy is we really need to go back and generate these reference genomes for the other species," Wing said.
The key reason, he said is that there is tremendous genomic variation within the genus. Genome size, for example, ranges from 350 megabases for O. brachyantha to nearly 1,700 megabases for O. minuta. Furthermore, Wing said, "we've found that even the [genomes of the] progenitors of cultivated rice differ by as much as 20 to 30 percent."
Because this difference is so great, "if you wanted to capture allelic diversity from a wild relative compared to a cultivated species, or let's say glaberrima versus … japonica, [without reference genomes for each species], you'll only capture what's the same, but for everything that's different you're not going to be able to define that because you don't know where it goes in the genome," he said.
While second-generation platforms promise to place the de novo sequencing of 10 reference genomes within reach, "technically it's not there yet," Wing said. "There might be some people who would argue it is, but I haven't seen any evidence of that yet. And even if it was possible to make an assembly, you have to look at the quality of the assembly."
He noted that the team decided to take a "conservative" approach, "instead of just resequencing glaberrima using Illumina, for example, and mapping it all back to Nipponbarre. We definitely don't think that's going to be that productive. The idea is to generate reference genomes with as high a quality as possible — everything is mapped properly, all the repeats are there, and so on and so forth — and then you can resequence until the cows come home."
In order to find middle ground between the need for a high-quality assembly and the pressure to keep costs as low as possible, Wing and his colleagues developed an approach that takes advantage of two things: the relatively long reads of the 454 Titanium technology and BAC-based physical maps for all the Oryza genomes that have been generated as part of I-OMAP.
"These are BAC libraries with fingerprints and BAC-end sequences," Wing said. "We call this more of an old-school approach, but we have essentially physical frameworks for all the genomes of cultivated rice species as well as the wild rice species."
The method, described earlier this year in the journal Rice, sequences pools of BAC clones from a minimum tiling path across a region of the genome.
"We essentially take BACs from a chromosome arm, pool those together, and do a combination of 454 Titanium long-read runs and paired-end runs and then we assemble that all together and we get a very nice chromosome arm with very high fidelity," Wing said.
[ pagebreak ]
In the Rice paper, which described a pilot project using the short arm of chromosome 3 of O. barthii, the researchers reported 2.2 errors for every 10 kilobases sequenced, which they said is "very close" to the 1 error per 10 kilobase standard established for the Human Genome Project.
The authors acknowledge in the paper that the availability of high-quality physical map contigs, from which they could select contiguous pools of BAC clones, was a key factor that "contributed to the success of this strategy."
Indeed, 454 recently announced that it had sequenced the highly repetitive oil palm genome with a combination of shotgun and BAC pool sequencing, but Michael Egholm, 454's vice president of research and development, told In Sequence at the time that the project was "not cheap" and required very high coverage in order to generate the shotgun and BAC pool data (see In Sequence 05/19/09). That genome has not been published.
A Roche spokesperson told In Sequence via e-mail last week that the pooled BAC approach being used for the rice genome project "is very similar" to the approach used for the oil palm genome. "The primary difference is that the rice genome is substantially smaller than the oil palm genome, making it more manageable to sequence and assemble."
The spokesperson said that the company is seeing a "significant" number of projects that involve de novo sequencing of plant genomes and noted that the assembly challenges are "not always directly proportional to the size of the genome."
Wing and his colleagues also credited the Titanium's long read lengths — an average of 367 base pairs for the O. barthii project — as being "invaluable in producing a high-quality assembly — particularly when combined with paired-end reads from the GS FLX platform."
In the paper, the authors acknowledged the drawbacks that the Atlantic salmon genome sequencing consortium recently reported in "pooling as few as eight BAC clones from the salmon genome for sequencing with the 454 GS FLX platform" — challenges that ultimately led to the decision to use Sanger sequencing for that project (see In Sequence 06/23/2008).
Wing and his colleagues posited, however, that "the major factor in the difference between our contrasting experiences with BAC pooling is the increased read lengths we were able to obtain with the newer platform."
The Roche spokesperson said that the company "continue[s] to push the read lengths of the system up to 1,000 base pairs" and believes that "the combination of long high-quality reads and the appropriate mix of our long span paired-end reads will have a significant impact on alleviating the assembly problems posed by the presence of many types of smaller repeats and duplicated regions."
Wing noted that even though the strategy for generating reference genomes relies on the 454 technology, he and his colleagues have not ruled out other platforms for additional sequencing.
"That doesn't mean that we won't use Illumina to fill in some gaps," he said. "We're definitely going to use Illumina RNA-seq to help us with the annotation."
In addition, he said, "when we're done with glaberrima, we can then take a number of glaberrima accessions … and resequence all those using Illumina or something. But then it's all mapped to the correct reference genome."
Wing said that other members of the I-OMAP consortium "have grants pending" to use the pooled BAC approach to sequence other Oryza genomes, but the timing of the overall project "depends on funding."
The first priority, he said, is to generate reference genomes for the eight "AA" genome types, which are most similar to the cultivated rice species. "The crossing is much easier to do with these species, so that would be the primary gene pool that you could use to integrate into cultivated rice," he said.
Ultimately, the researchers hope to identify genes in wild species that could enable cultivated species to grow in environments with poor soil, drought, a lack of pesticides, and other conditions that are not amenable to current crop plants.
Wing said he expects to complete the O. glaberrima reference genome by the end of the year and that he's "optimistically hoping" that the eight AA reference genomes are completed within two years.