Microsoft Research is developing an open source toolkit that will serve as a "standard library" for bioinformatics tools that developers within Microsoft and elsewhere can build upon.
At the Association for Biomolecular Resource Facilities conference held this week in Sacramento, Calif., Simon Mercer, director of health and well-being external research at Microsoft, discussed the toolkit, called the Microsoft Biology Foundation. It includes parsers for common bioinformatics file formats; various algorithms for manipulating DNA, RNA, and protein sequences; and a set of connectors to key bioinformatics web services such as the National Center for Biotechnology Information's Blast.
A beta release of MBF is currently available on Microsoft's Codeplex site. Mercer said that a version one release is scheduled for the summer.
The software is released under the Microsoft Public License, which grants users a non-exclusive, worldwide, royalty-free license to reproduce the software and to prepare and distribute derivative works.
While Microsoft is seeding MBF with its own tools, the company's intention is for the framework to ultimately become a "community-curated" resource, Mercer said. Microsoft will remain a contributor to the project, but will "not be the only face," he said.
The company has already lined up around six "large code contributions" and is also in the process of setting up a technical advisory board for MBF that will include developers within Microsoft as well as external members, Mercer said.
Collaborators to date include the Computational Biology Service Unit at Cornell University, Queensland University, the University of Virginia, and the University of Texas at Austin. Mercer told BioInform after his talk that that Microsoft has also signed on several commercial partners, but that he did not have permission to disclose their names.
At a separate talk at the conference, Jaroslaw Pillardy of the Cornell CBSU said that his team plans to contribute its BioHPC package to MBF. BioHPC is a web-based system that provides users with access to 37 bioinformatics applications that run on CBSU's clusters. It is also available as downloadable open-source software that researchers can install on their own IT systems.
Mercer said that the impetus for MBF grew out of the desire to collect a number of disparate projects underway within Microsoft Research that fall within the life sciences and healthcare umbrella. For example, researchers at the company have developed a suite of computational biology tools for phylogeny-based association analysis, epitope prediction, and human leukocyte antigen analysis.
Microsoft has also been collaborating with Phil Bourne of the University of California, San Diego, on an ontology add-in for Word 2007 that is targeted at the biological research community. The add-in includes controlled vocabularies from the National Center for Biomedical Ontology and identifiers for GenBank, the Protein Data Bank, UniProt, and other databases. It automatically associates text in a Word document with the appropriate ontology term in order to mark up journal articles for improved data-mining, Mercer said.
[ pagebreak ]
The company has also been developing a drag-and-drop workflow tool called the Trident Scientific Workflow Workbench, which was initially designed for oceanography but that has been modified for bioinformatics tools; as well as the Research Information Center, a scientific front end for Sharepoint that was developed for the British Library and serves as a general-purpose scientific project-management tool.
Mercer said that all of these tools have been built separately, which has led developers to "reinvent the wheel" several times because there has been no "standard library" of shared components to draw from. MBF is intended to address that issue by serving as a standardized bioinformatics toolkit that developers within Microsoft and elsewhere can build upon.
Initial areas of focus for MBF will be sequence analysis and annotation, phylogenetics, genome-wide association studies, and haplotype reconstruction. In the longer term, Mercer said that MBF will also include more visualization tools.
In addition, Microsoft has developed an add-in for Excel that allows researchers to add MBF functionality to the spreadsheet program. The add-in enables users to perform all their genomic analyses directly within Excel without having to import and export data from the application.
All MBF functions will be enabled to run on Azure, Microsoft's cloud-computing architecture, Mercer said.
Mercer said that all the MBF tools will be open source and freely available to commercial and non-commercial users.
"We're building these tools to show the scientific community that Microsoft can work in this space," he said.
Microsoft's commercial arm has also been targeting the life science community through its Amalga Life Sciences product. The company acquired Rosetta Biosoftware last year with the aim of incorporating its genetic, genomic, metabolomic, and proteomics data-management software into the Amalga platform, but has been fairly quiet since then about its plans for the product.
Mercer said that there will likely be some overlap between the capabilities in MBF and those available in Amalga Life Sciences, but he noted that the two systems are targeting different ends of the life-science informatics spectrum.
MBF is modeled after academic tools, which are usually more "cutting edge" and "agile" than commercial tools because they are developed quickly to meet the demands of rapidly changing experimental environments, he said. On the other hand, they tend to lack the stability, scalability, documentation, and support that commercial tools like Amalga offer.
"The two ends have to meet," Mercer said, which is where the company sees MBF fitting in. The platform has the same "philosophy" as academic bioinformatics tools, but has been developed along very stringent guidelines to ensure a high level of stability and documentation, he said.
At some point, Microsoft might build a "bridge" between MBF and Amalga, but Mercer said that the company is waiting to see if there is demand for that capability from its customers before deciding to take that step.