Some projects just won’t stay small: Pleased with the success of a comparative FlyBase-human genome project that revealed several novel genes, researchers at Organon wanted to “extend” the approach to other organisms, according to Ton Rullmann, project manager for bioinformatics at the Dutch pharmaceutical firm. So the company, together with Gene-IT, assembled a diverse team of collaborators, including the European Bioinformatics Institute, the Netherlands Organization for Scientific Research (NWO), and the University of Nijmegen, to tackle the task.
Now, 82 organisms, 48 billion comparisons, and 520,000 CPU-hours later, the project is complete. Organon and its collaborators will unveil the resulting data set, called Protein World, at the Genomics Momentum 2002 conference in The Hague on Dec. 4.
The all-against-all comparison of 82 proteome sequences in the Swiss-Prot/Trembl database is the most comprehensive comparative genomics project ever undertaken, according to its participants. The resulting data set maps conserved regions across nine eukaryotes, 15 archaea, and 58 bacterial species, serving as an “orthology index” for Swiss-Prot and Trembl, Rullmann said.
So, how did they do it? A 1,024-CPU SGI machine at the SARA computing center in Amsterdam ran Gene-IT’s parallelized Biofacet comparative genomics software for three months, crunching through approximately 70 million alignments. Biofacet adds a proprietary “Z-score” method to the Smith-Waterman alignment algorithm to assess the statistical significance of hits, resulting in clusters with a “high degree of biological relevance,” according to Ron Ranauro, executive vice president at Gene-IT.
In July, the company completed a similar comparison of 70 genomes in collaboration with French sequencing center Infobiogen and the Atomic Energy Commission (CEA). The results of that comparison, called Teraprot, are available through Infobiogen at www.infobiogen.fr/services/Teraprot/.
For the Protein World project, Gene-IT donated use of its software and managed the workflow and computation at SARA.
How to Get to Protein World
The data will be made available to the scientific community in several different formats. EBI will use the raw results as part of its Temblor project, which is building an integration layer called Integr8 to link biological data from several sources. Rullmann said that EBI would also include the data in its CluSTr database, which classifies protein families in Swiss-Prot and Trembl.
In addition, the data will be accessible through BioASP, a new organization created by the Dutch government to support bioinformatics in the Netherlands (see sidebar, below). Researchers will be able to access the data via BioASP’s portal, www.bioasp.nl, but terms of access for commercial users have yet to be finalized, said Jan Willem Tellegen, general manager of BioASP.
Ranauro said that Gene-IT also plans to package the Protein World data in the form of a service offering that will combine the data with customers’ in-house data sources. The advantage of this approach, Ranauro said, is that once new sequences become available, the company can easily add them to the collection. In addition, he said, Gene-IT can work with clients to “customize the way the data is used.” While the raw Protein World database offers a comprehensive view of the entire pairwise comparison of all 82 organisms, researchers will likely want to only focus on a few select comparisons, Ranauro said, and there’s where Gene-IT comes in: “A researcher might want to compare human against mouse, chimp against rat, human against rat, and chimp against mouse. The [raw] data will give you any two of those, but Biofacet will let you very quickly look at all four and determine what are the orthologous genes.”
This is where Organon is heading: The company has used Gene-IT’s software since 1999 and will now use it to focus in more detail on comparisons of particular organisms, such as human against mouse, Rullmann said. By initiating the project, Organon served as the “linking point” between the academic and commercial partners and the computer center, a role that Rullmann said was well worth the effort: The result is a useful source of raw data for the scientific community that will deliver further benefits for Organon upon downstream analysis. “We can mine the data ourselves,” he said. “Any conclusions we draw we can keep to ourselves.”
The raw results file from Biofacet takes up “a couple of gigabytes” using Gene-IT’s proprietary database technology, Rullmann said, adding, “what you build from that depends on how you want to use the data” — specifics of annotations, alignments, thresholds, and number of organisms can drastically change the size of the resulting data set.
With the comparison of 82 proteomes readily on hand, Rullmann and his team at Organon ought to have enough data to keep them occupied for a while, but apparently, that’s not the case. What’s still missing? “It would help to have phenotypic information about these organisms, as we did with the FlyBase approach,” Rullmann said.
Anybody up for annotating 82 organisms?