BOSTON — Launched in 2006 with 15 members, the Microsoft-led BioIT Alliance has grown to 77 members and plans to expand its scope in a world in which the road to interoperability for users or vendors is not yet fully mapped.
Rudy Potenzone, director of the BioIT Alliance and industry technology strategist for pharmaceuticals at Microsoft, said at a talk at this week's Bio-IT World Conference that one way the initiative plans to expand in the year ahead is by launching a SharePoint portal that will let users build and try out web components.
Microsoft officials said that they are seeing some evidence that the alliance, created to use Microsoft tools in order to foster interoperability in the life-science market, is delivering on its goals.
As an example, Les Jordan, Microsoft’s life-science industry technology strategist, pointed out during a conference session that Thermo Fisher Scientific recently announced it was starting to enable results from its lab equipment to be output into the Open XML format.
With Open XML, he said, scientists can pull data off an instrument via a web service and import it directly into the “scientist’s favorite tool,” Excel. It could then travel on to a high-performance computing cluster via a web service call, while users can access the data in a portal where it can be viewed, searched, and shared. “It is sitting in an open standard and people can access it though a Web service,” said Jordan. “This is going to allow people to innovate.”
Richard LeDuc, a bioinformaticist and co-director of Washington University’s School of Medicine’s proteomics and mass-spectrometry core facility, voiced concerns about applying this idea in his facility because the conversion from Thermo’s .RAW mass spectrometry files to Open XML files dramatically increases their size. He described himself as a heavy Microsoft user who also develops in the Microsoft environment.
Although he has not transformed files to Open XML, in the past he and his colleagues have written code to pull mass-spectrometry data off instruments and to transform the proprietary .RAW binary files into XML. “They tend to explode in my experience on the order of threefold,” he said. “Once you put [the data from a run] in an XML file, you need to wrap a tag on the beginning [and] a tag on the ending … [and] if you have multiple hierarchies to this, which you usually do, you are going to have to wrap that tag in a complete set of tags.”
LeDuc said he is worried that the same challenge might occur with Open XML, and cautioned that such file ballooning could present a storage problem for core facilities. At any given moment his facility has an average of 20 large projects going on, he said. He also has a 10-terabyte RAID array with a similarly sized backup for the data he and his colleagues generate and process. Although LeDuc does not feel he has been hit with a data avalanche just yet, he said he is concerned about a process that could create one.
A typical mass-spec scan might yield half a gigabyte of data, while the processed data he delivers to the end-user is much smaller, usually on the order of megabytes or even less, he said. What his facility sends back to the scientists “is at its heart an Excel spreadsheet of, ‘here are the proteins, here is information about what proteins are present.’ That is really what most of our end-users want,’” said LeDuc.
“In terms of breadth we have absolutely met our wildest dreams.”
While his facility keeps the .RAW files, particularly since many of the samples at his biomedical facility are precious, he said he is hesitant to increase the size of those files by threefold.
In the session at Bio-IT World, Jordan responded to LeDuc’s concern by saying “if it’s only in a format that only you understand, then it is only useful to you,” so the idea of the BioIT Alliance is to help find ways to get data in a common format so that other scientists can access and use it.
“The issue isn't so much of .RAW files to Open XML as it is .RAW to any XML format,” Jordan wrote in an e-mail to BioInform after the event. “That is a problem inherent with many of the standards based on XML.”
Jordan suggested that the answer to this problem lies in applying standards not to all the high-throughput data that scientists pass on, but rather “the metadata, or the more important parts of the data.”
Juggling massive amounts of data is typical in an era in which scientists now have the ability to park a computer cluster under their desks and wield, as Jordan said, an “amazing advance in computer power” with data that is not easily transportable to other analysis software. There is no standard with which to move data from point A to point B, and multiple users, software, and instruments get scientists “locked in” to a vendor’s analysis software.
“I don’t care if you are running a Linux box, a Mac, a Microsoft box, you should be able to take the data and move it from one point to another,” which is the point of the alliance, Jordan said in his talk.
Potenzone explained that he is glad to see a broad spectrum of organizations join the BioIT Alliance, including bioinformatics companies, large hardware and software firms, off-shore companies, lab equipment vendors, systems integrators, and text-mining firms. “In terms of breadth we have absolutely met our wildest dreams,” he said.
He is particularly happy about the International Union of Pure and Applied Chemistry’s participation in the alliance. IUPAC, which joined last fall, develops recommendations for the names of chemicals and developed the machine-readable International Chemical Identifier, or InChI, standard.
IUPAC “is the first in the class and other standards groups have queried us about what does it mean to belong,” said Potenzone. “It may be another aspect to the alliance: if we can get some of the standards groups interested it helps us foster standards that we all think are vital.”
Web services are becoming increasingly common as firms expose their client or program on the Web, but Jordan noted that the "part that's missing" from web services is the standards. “That is why the standards; the chemistry, the biology standards, are important.”
If, for example, a researcher in drug discovery has a new chemical compound and wants to know if there are similar chemicals to a new chemical structure, “the structure can be packaged and sent with InChI with a request,” he said. The request triggers a search with the results being returned to the scientist, with web services providing data transport and search of the data. “Data transport is the noun, the web service is the verb,” said Potenzone.
“As people build their cool tools, they don’t have to give you all kinds of specifications, all they have to tell you is what their web service is and how you formulate the syntax as you send it,” he said. “You don’t need to know anything about the application. It can be running on your machine, on their machine, it could be a cloud.”
Potenzone said in his talk that the alliance plans to build a SharePoint site through what Microsoft is calling an Office Business Application Composition Reference Toolkit, which will let users build and try out web components. SharePoint is not just a place to share files, he said, but also provides a collaborative workspace and lets users build portals and workflows, with a search environment and capabilities for business intelligence as well, he said.
The toolkit can be downloaded here.
As an example of a BioIT Alliance success story, Potenzone noted that ChemZoo has built a website called ChemSpider, with 20 million free, searchable chemical compounds that is “a bit like a Wikipedia for chemistry.”
“If you type a compound name into Microsoft Word … you can right click on that and query the ChemSpider database,” said Jordan. That functionality could also involve querying a proprietary database. This is an application for a rather underutilized feature in Word called Smart Tags, which Novartis has implemented for its own proprietary searches, said Potenzone.
In his talk Jordan explained to his audience that when data is shared it might help drug discovery firms to find the next blockbuster in a “discarded” portfolio. “The only way to know that is by making the data widely available between applications, and between collaborators,” said Jordan.
Given the fact that much scientific data generated in labs travels through the Microsoft tools in analysis, “we feel we have a responsibility to the community to make sure these tools are really able to handle what is needed,” said Potenzone.
Further information about the BioIT alliance website is available here.