Candymaker Mars and the United States Department of Agriculture have enlisted IBM Research’s pattern-recognition experts to help analyze the genome of the cocoa tree, Theobroma cacao L.
Isidore Rigoutsos, who manages the bioinformatics and pattern-discovery group at IBM Research’s Thomas J. Watson Research Center, told BioInform that the project partners plan to sequence many different cocoa cultivars in an effort to identify the genetic basis of desirable properties, such as flavor or disease resistance.
As a result, the project is expected to take up to five years due to the anticipated complexity of the data analysis.
“What makes the problem interesting in the computational and the biological sense [is that] you might be looking at thousands to tens of thousands of plants and thousands of properties,” Rigoutsos said, noting that when the numbers reach those heights, the association discovery problem becomes “very demanding.”
However, he added, “we have developed technologies that allow us to do these kinds of things very quickly.” Since the mid-1990s, his group has been developing pattern-discovery algorithms to analyze a range of molecular biology data.
He said that the cocoa project is in line with a broader “portfolio” of projects that IBM’s computational biology research group is involved in due to their societal impact. In the case of the cocoa genome, it’s expected that the genome sequence will enable farmers to plant better quality crops that deliver higher yields and are more able to withstand pest and disease.
In a joint statement, the project partners said the goal is to use genomic analysis to “help eliminate some of the guesswork of traditional breeding.”
In addition, the partners said that the initiative could help protect the economy in Africa, where 70 percent of the world’s cocoa is produced. The partners plan to make their results freely available through the Public Intellectual Property Resource for Agriculture, a non-profit humanitarian group that supports agricultural research.
While the USDA’s Agricultural Research Service has collaborated with Mars in the past, and IBM and Mars have also worked together, the cocoa genome effort marks the first time that all three parties are collaborating on the same project.
Mars is backing the project financially with $10 million and the sequence information will be made freely available through the PIPRA website.
Informatics for Traditional Plant Breeding
Rigoutsos said that his role in the project will be making sense of the data that his USDA colleagues will be generating, and then returning that information back to the field to help guide the next research steps.
He explained that the project will study many different cultivars, or plant groups with distinct properties, such as differing numbers of beans, flavors, or resistance to a particular fungal or bacterial pest. It has not yet been established how many cultivars will be sequenced.
“What you want to do is figure out what subset of plants shares what subset of properties and use that to drive traditional plant breeding, by telling the breeders, say, ‘Plant 25, plant 310, and plant 715 are the ones you should be focusing on for the next breeding iteration.’”
One plant will be used as a reference sequence and then the team will identify specific loci in cultivars linked to properties of interest in an attempt to link genetic sequences to the observed qualities. “For that we go back to association discovery and we are familiar with that,” he said.
“We basically want to use the traditional plant breeding, but guide it through the computational analysis that will be taking place.”
Characteristics of interest might include how to increase the product per unit of farmed land, because of the financial benefits for farmers, or how to make the plant resistant to pests that threaten it, Rigoutsos said. While people may first think of genetic engineering in this context, “this is not one of those cases,” he said.
“We basically want to use traditional plant breeding, but guide it through the computational analysis that will be taking place.”
Rigoutsos stressed that the logistics of the project have not been finalized. He said he expects it will rely at least in part on second-generation sequencing technologies, but did not provide further details.
USDA officials could not be reached for comment on the experimental aspects of the project.
Data will be sent to the IBM group on an ongoing basis from the field and from the USDA’s lab in Miami, and the analysis will propel the project and breeding forward with new questions, Rigoutsos said.
The Theobroma genome is small compared to other plants, with approximately 400 million base pairs. By comparison, the corn genome is around the same size as the human genome, at around 3 billion base pairs, and the wheat genome is approximately 16 billion base pairs. Rigoutsos noted, however, that it is unwise to associate simplicity or complexity with genome size. “These are not always related,” he said.
“I have spent a little over five years of my life analyzing the human herpes virus 5, which is only 230,000 bases,” he said, “and we still don’t understand how that genome works.” Likewise, he added, HIV has less than 10,000 bases and yet it still poses serious problems more than two decades after it was discovered.
While Theobroma cacao L. may be smaller than the human genome, it is a complex organism, so it is hard to forecast how the analysis of the genome will unfold. “It wouldn’t be prudent to underestimate the difficulty,” he said.
Rigoutsos’ group has recently been exploring emerging new ideas in gene regulation that hint at some of the complexity the project may encounter as it tackles the cocoa genome.
For example, in a study that was published in May in Nucleic Acids Research, Rigoutsos and his colleagues compared mouse and human genomic data to find that intronic regions play a much more functionally active role than some researchers had previously assumed.
This work foreshadows “the difficulties we will face as we move forward with the analysis of biological data,” he said.
“Intronic sequences from human and mouse are linked to the same functions in the absence of sequence conservation,” he said, noting that introns, peculiar and long ignored, appear to be linked to organism-specific regulatory functions in ways not previously associated with these genomic regions. “Basically you have different sequences doing the same thing”— a finding that runs counter to the conventional wisdom that functional motifs are conserved across orthologous sequences.
The NAR study built on a 2006 PNAS paper in which the IBM team used an automated pattern-discovery method to look at intergenic and intronic regions and found “unique functional connections between coding and noncoding parts of the human genome.”
In that study, the IBM researchers used Rigoutsos’ Teiresias pattern-discovery algorithm to identify motifs they dubbed “pyknons,” which are found in both the untranslated and coding regions of genes. “They are the first glimpse of organism-specific regulatory motifs,” he said. “We suspect they regulate genes, but they are not genes themselves, they are short RNAs.”
The underlying story of this work, he said, is that “much of the regulation we try to understand is driven by organism-specific motifs, and therefore having results from one organism doesn’t necessarily mean that we have results we can apply to a different organism.”
The positive news, he said, is that “we have ways of getting to the motifs nonetheless, even without comparative analysis,” essentially, he explained, using a “different kind of flashlight” to explore the genome.
This prior work “gives glimpses into a reality where cross-genome conservation is a limiting step and suggests that basically the complexity of the problems we have been trying to address may be substantially more complicated,” he said.
For the cocoa project, he said, there is not one particular tool that will be brought to bear, but rather “a general approach of doing things.”
Pattern discovery “liberates us in many ways,” he said. It also eliminates the need for cross-genome comparisons based on sequence conservation because these can “prevent you from seeing things.”
“Several years ago we decided we are going to attempt to tackle these kinds of questions without doing sequence conservation as a prerequisite,” he said.
“The protein-centric view of cell-process regulation is about to take back seat to RNA-driven regulation,” he predicted.