Machine-learning approaches are gaining ground in the bioinformatics world — a fact that microprocessor giant Intel is hoping to turn into an opportunity by developing a new machine-learning software package called OpenML.
At the annual Neural Information Processing Systems conference (NIPS 2003) in Vancouver in early December, Intel released a beta version of the first OpenML component, called OpenPNL (Open-Source Probabalistic Network Library), under a BSD-style license. The software, written in C and C++ and freely available at http://www.intel. com/research/mrl/pnl/, adds to similar libraries the company previously released for computer vision and speech recognition.
Later this year, according to Gary Bradski, manager of Intel’s machine learning group, the company will release OpenSL, which will include statistical machine-learning features such as decision trees, support vector machines, K-means, and other clustering and classification approaches.
OpenPNL should be of particular interest to the bioinformatics community because it supports graphical modeling using Bayesian networks and other probabilistic networks that are commonly used in genomic sequence analysis, gene prediction, motif finding, and other bioinformatics tasks. Intel has already taken steps to optimize the library for genomics in a collaboration with Nir Friedman at Hebrew University in Israel and researchers from the University of Delaware and Tsinghua University in China. At NIPS 2003, the collaborators demonstrated features of the genomics application, among other application areas, using a “mini cluster” of five laptop computers, Bradski said.
The genomics project, which is using structured learning to predict control genes in the genome, has been underway for about a year, Bradski said. “It’s not folded into the library yet, but it will be,” he added.
So why is a chip-maker releasing software at all? According to Bradski, Intel is hoping the release of a “standard” library will help speed the adoption of machine learning, which currently requires high-end processors, but which will eventually run on desktop PCs. “Moore’s law will probably continue going on longer than people think — maybe to 2020 — but what hasn’t been scaling as fast, and has actually been falling off, is the increase in actual processing that you get with all these extra transistors,” Bradksi said. As the company looks into designing new computing architectures, “we’re studying these future algorithms in machine learning, optimization, simulation, and some more advanced graphic areas” that will likely be running on next-generation chips. Intel’s R&D team is using the code for its in-house research, but is releasing it to the broader community so that others can build on it and ultimately provide more code for Intel to test on its new chips, Bradski said.
Selling the machine learning library might have brought in “a couple of million dollars” for Intel, Bradski said, “but that’s nothing compared to if we can get machine learning going that will particularly run well on future architectures. So you start with the optimized library now, and the hardware is going to come up under it that will make the stuff really fast.”
Bradski acknowledged that there are other machine learning software libraries available today, but said that Intel’s is “more comprehensive” than other options, and has the least restrictive licensing model. The BSD license “allows free use in commercial or research [settings], and doesn’t force the user’s code to be open, so it doesn’t cause some of the same problems that the Gnu GPL does,” he said. “This is really meant to be used commercially, in any way — open or closed — that you want … A lot of the open libraries that are free have licensing encumbrances that a commercial entity might steer clear of,” he said.
Within a month of releasing OpenPNL, nearly 1,400 users downloaded the software. Bradski said he expects growth of the project to follow a similar pattern as that of the computer vision library, which was released in 2000, and now includes more than 500 algorithms and boasts more than half a million downloads. More than 450 academic organizations and 350 commercial entities are using the computer vision library today.
Those who download OpenPNL, however, shouldn’t expect it to solve their bioinformatics analysis challenges immediately. Bradski noted that the general-purpose engine “takes a lot of work” to modify for specific domain areas: “It’s an engine, so it’s sort of like, ‘We have a hammer, so how much work does it take to make a house?’” Intel is encouraging developers to build on top of the libraries to create specific toolkits for genomics, manufacturing, or other domain areas.
Bradski said he expects some results from the genomics collaboration with Hebrew, Delaware, and Tsinghua Universities by the second quarter of the year.