Sander and Friends Establish ‘Cashew Prize’ to Encourage Deposition of Knowledge
In an effort to stimulate development of software tools that will encourage authors of scientific papers to deposit biological knowledge into public repositories, Chris Sander of Memorial Sloan Kettering Cancer Center and a number of other bioinformatics scientists have created the “Cashew Prize.”
The prize is named for the nut, which is native to Northeastern Brazil and ubiquitous in Fortaleza, where this year’s Intelligent Systems in Molecular Conference was held.
Sander said during a session at ISMB that a jury of scientists will award the prize to a prototype software package that will “aid authors in submitting facts in computable form.” Sander defined facts as “A activates B when C,” or similar strings of biological entities and relationships.
The goal, he said, is to enable authors to deposit knowledge about pathways and other biological mechanisms in a manner similar to the way sequence data is deposited in Genbank. Software to enable this process should go a long way toward encouraging the practice, he said.
First prize will be one pound of cashews and $1,500; second prize is two pounds of cashews and $1,000; and third prize is three pounds of cashews and $500. All software should be available under an open source license.
Sander will chair the effort and the jury includes Amos Bairoch of the Swiss Institute of Bioinformatics, Dietrich Rebholz-Schuhmann of the European Bioinformatics Institute, and John Moult of the University of Maryland Biotechnology Institute.
Sander said that the initiative will be formally announced on Oct. 1. Submissions will be due in April 2007, and a report on the results will be presented at ISMB 2007, in Vienna, Austria.
Synamatix Collaborates with Wash U, Other GenomeCenters on Next-Gen Genome Assembly
Synamatix is partnering with the Genome Sequencing Center at Washington University and two other undisclosed US genome centers to develop a number of new applications for both Sanger and next-generation sequencing technology.
Arif Anwar, vice president of Synamatix, said that the firm, based in Kuala Lumpur, Malaysia, is working with these centers to build a suite of new applications that run on SynaBase, the company’s flagship system for storing large amounts of genomic data based on patterns rather than files.
In the last 12 months, Anwar said, Synamatix has developed SynaMer, which can quickly find overlapping regions in sequence fragments of 100-mers or more; SynaSearch, which maps reads from a newly sequenced genome to the human genome 219 times faster than BlastZ; SXProbe, which was initially designed for checking the placement of microarray probes, but has applications in mapping reads from Solexa’s sequencer onto a reference genome; and SXPat, a system for identifying overrepresented patterns in genome data to help identify contaminated sequences.
The company is also co-developing a de novo assembler for 454 sequence data in collaboration with Wash U, Anwar said.
The new products will be released in several months.
Next Release of SciTegic’s Pipeline Pilot to Integrate GCG, Lucidyx Tools
The next release of SciTegic’s Pipeline Pilot, scheduled for this fall, will include a number of new modules of interest to the bioinformatics community, according to Scott Markel, senior bioinformatics architect at the Accelrys subsidiary.
Markel said that SciTegic has collaborated with Lucidyx of Cleveland to create the BioMining module for Pipeline Pilot, which will add data-integration capabilities to the platform’s application-integration features. The module will provide access to around 50 data resources, Markel said.
The Lucidyx technology uses a gene ID as the “point of federation” for different resources, Markel said, and relies on an indexing scheme that provides rapid access between data sources.
The module is available through the BioMining Server, an in-house system that is updated nightly and can be integrated with proprietary data sources, or as an offsite data service, which does not provide access to in-house resources.
In addition, Markel said that the Accelrys bioinformatics R&D team in Bangalore, India, is currently wrapping the GCG package so that it can be implemented in Pipeline Pilot.
The sequence analysis module for Pipeline Pilot currently includes around 100 utilities based on a number of open source and publicly available tools. Markel said that SciTegic is considering several new capabilities for long-term development, including gene expression, proteomics, pathway analysis, and chemogenomics.
IBM’s Genographic Project Results in New Clustering Method
Researchers at IBM have developed a new classification method with applications in cancer diagnostics as part of the Genographic Project — an initiative that IBM kicked off last year in collaboration with National Geographic with the goal of collecting more than 100,000 DNA samples from indigenous populations around the world in order to map global human migratory history [BioInform 04-18-05].
Gyan Bhanot, a research staff member at IBM’s TJ Watson Research Center, told BioInform that the company has developed an approach called unsupervised consensus ensemble clustering that is able to classify large data sets into very distinct categories.
The method was initially developed to build a phylogenetic tree using data gathered from the Genographic Project, and was able to improve upon the phylogenetic tree that is currently used, which has misclassified two Eurasian groups, Bhanot said. The IBM clustering algorithm was able to analyze all the available data to reclassify the two groups in a “more geographically appropriate location,” he said.
The approach also indicated that the human population that migrated from Africa to Australia did so before the population that migrated to Eurasia and Europe.
The IBM researchers are currently preparing a paper on their findings that they plan to submit for publication.
The team has also successfully applied the method to gene expression data from breast cancer patients, Bhanot said. The approach was able to distinguish seven distinct subtypes in a set of normal and diseased samples — one normal cluster, two separate “low-grade” clusters with a good prognoses, and four “high-grade” clusters of patients with cancers that would be more difficult to treat.
In a validation test using a data set of samples from patients that had undergone a 150-month follow-up after treatment, the predicted subtypes matched the patient survival rates with a high degree of occurrence, Bhanot said.
IBM is also validating the approach in a collaboration with Yale University, he added.
PubGene Enhances Public Online Service
PubGene has released a beta version of the free online implementation of its literature-searching service (http://www.pubgene.org/).
The PubGene service uses text mining to automatically identify networks of co-located genes in the scientific literature. The resource currently includes hundreds of thousands of cross-references.
John Erik Stacy, applications specialist at PubGene, said that the free service offers access to the same number of references as the subscription service, but includes a limited number of analytical tools. Users of the free service will get the same number of hits per query as paying customers, but will not be able to analyze the results as easily.
The new service also includes a “smart” Google search that limits the results for a search to scientific references.