As a staple crop for more than half the world’s population, it’s understandable that more than one research group would want to sequence the rice genome, and the April 5 issue of Science illustrated that with the publication of two papers on the draft sequences of two strains of rice.
Jun Yu and colleagues at the Beijing Genomics Institute, the University of Washington Genome Center, and 11 Chinese institutions detailed their draft sequence of the indica subspecies, the most common variety in China; while Stephen Goff and colleagues at Syngenta, Myriad Genetics, and several universities reported on the japonica subspecies, which is more popular in Japan.
These efforts add to projects already underway throughout the world under the auspices of the International Rice Genome Sequencing Project, as well as a rice sequencing project at Monsanto, but mark the first published results of the rice genome.
Back-to-back publication of two rice genomes in the same journal invites comparison, and for the most part the BGI and Syngenta papers were in agreement (see table, p. 5). “I think it’s remarkable that our two groups came to such similar conclusions when one considers how massive an analysis project this is,” Goff told BioInform.
The most glaring difference between the two groups’ findings was in their estimates of the total number of genes. BGI estimated between 46,000 and 55,615 genes, while Syngenta hit upon a more conservative 32,000-50,000. Each team attributed this discrepancy to one of the more persistent unsolved problems in bioinformatics — the inherent shortcomings of gene prediction programs.
Assembling the Bioinformatics Toolbox
Both BGI and Syngenta used a whole-genome shotgun approach, as opposed to the clone-by-clone tack of IRGSP and Monsanto, and diverged only in their selection of bioinformatics tools for assembly and gene prediction.
BGI developed its own assembly program, called RePS (Repeat-masked Phrap with Scaffolding), to identify and mask repetitive sequences and assemble the sequence reads into contigs and scaffolds. The highly repetitive nature of cereal genomes presented a particular hurdle in the assembly process, according to BGI researcher Gane Ka-Shu Wong, who said that approximately 42 percent of the indica genome is composed of exact repeats, but other cereals such as corn or wheat may be up to 90 percent repetitive. Traditional assembly tools like Phrap are unable to deal with such large numbers of repeats, Wong said. “Phrap doesn’t scale to the whole genome because it tries to examine all the repeats and then you wind up blowing out even the big supercomputers.”
RePS is based on Phrap, Wong explained, “then we use the clone end information like Celera described [in its paper] and we build on top of it.”
Wong said that a paper on RePS has been accepted for publication in an upcoming issue of Genome Research. The software is freely available (e-mail developer Jun Wang at [email protected]), but was written to run on Sun equipment and has not been ported to other platforms.
Syngenta used a proprietary tool developed by Myriad to assemble its version of the genome. Goff said Syngenta also created some of its own tools to align random fragment sequence contigs to the physical map.
The Big Problem
But from a bioinformatics standpoint, assembly was a piece of cake compared to the gene prediction process, according to Wong. “There’s a lot of hype about assembly algorithms but it’s just not that challenging a problem,” he said.
Gene prediction was difficult because the compositional properties of the rice genome don’t mesh with current gene prediction programs, which assume that the codon distribution is consistent within each gene. “What happens in rice that’s unusual is that the statistical properties have a gradient. The 5’ end of the gene is completely different than the 3’ end. And this was never built into the model of any of the gene prediction programs,” Wong explained.
Both BGI and Syngenta drew from a medley of gene prediction programs to annotate the genome, including Fgenesh, Genemark, Genscan, GlimmerM, and Rice HMM. Noting in its paper that “no single gene-prediction program was found to be highly accurate,” Syngenta chose to rank its predictions based on the fraction of the length with homology to known genes, predicted genes from other species, Prosite motifs, or Pfam domains. Syngenta also broke down its three confidence levels by gene length, finding that more that 78 percent of low-confidence genes were shorter than 500 bp, but only 42 percent of low-confidence genes and 28 percent of high-confidence genes were shorter than 500 bp.
“If you also add in size constraints, you can get a large number of genes by allowing them to be smaller and smaller. If you assume the average protein size is something like 30 kilodaltons, that’s going to take a gene size of 1-1.5 kbp. So the larger the number you predict, the smaller the average,” said Goff. Syngenta didn’t estimate the average gene size in its paper, but Goff said it may range from 4 kbp to 5 kbp, “or even larger.”
BGI, meanwhile, evaluated each of the available prediction programs, but hit upon Fgenesh as the most effective tool for its annotation. Other programs “are either very good at the 5’ end or very good at the 3’ end,” making them fine for genes with consistent properties, but unable to cope with the variable character of rice genes, Wong said. “If the statistical properties of the genes vary from the 3’ to the 5’ end, depending on the nature of the algorithm it could either focus on doing a good job on the 3’ end or the 5’ end. … What’s interesting about Fgenesh is it does really well for both ends.”
Of course, Fgenesh presented its own problems. “It only does really well for half the gene,” Wong said. “So it’s sort of like squeezing a balloon: You can either do really well at the 5’ end and screw up the 3’ ends of all the genes; you can do the reverse; or you could do well at both ends for half the genes and screw up the other half, and that’s Fgenesh.”
Taking these caveats into account, Wong said the BGI and Syngenta papers are in “surprisingly good agreement,” particularly in their comparisons to the Arabidopsis thaliana genome. Both projects found that more than 80 percent of annotated genes in A. thaliana are also found in rice. Each group also agreed that almost half of the genes in the rice genome are not found in A. thaliana or any other known genome.
While BGI took a stab at estimating an average gene size of 4.5 kbp, “we were referring to the combined size of all the exons and introns,” said Wong. Syngenta’s range of 300 bp to 1 kbp referred only to the software-predicted genes, which don’t include introns, Goff said. “It may seem like a big difference, but it’s due to the various uses of the word gene rather than a difference in the actual results, analysis, or beliefs,” he added.