Following several weeks of e-mails and conference calls between core members of the open source Bioconductor project and key Affymetrix informatics staff, Affy has decided to make its GeneChip file formats publicly available so software developers can access them directly.
Affymetrix has not previously supported public efforts to access its DAT, CEL, CHP, CDF, or EXP file formats directly, but instead has provided application programming interfaces for developers to access the files. This approach, which ensures that developers don’t have to rewrite their software every time the company changes its platform, has worked fine for third-party commercial software developers and others able to license the API.
But the Bioconductor micro-array analysis software package is built on the R statistics package, an open source project under the GPL, and is not able to use the compiled code for Affy’s APIs. So far, this hasn’t posed too much of a problem, since many of the file formats are in ASCII format, “So you could just open them with [Microsoft’s] Notepad,” said Rafael Irizarry, a Bioconductor developer at Johns Hopkins University.
However, earlier this year, Affy announced plans to convert the format for its CEL files, which contain summarized probe-level data, from ASCII to a proprietary binary format in order to reduce file size and speed data access.
Although the company planned to provide an API to read the format, as well as a MAGE-ML exporter to convert the data back into ASCII, neither of these options were suitable for the Bioconductor project, which found itself faced with the choice of discontinuing support for Affy arrays when the ASCII format is dropped at the end of the year.
Aside from legal or IP issues surrounding use of the non-open API, if Bioconductor were to write to the interface, the project’s volunteers would have to redistribute the code as precompiled libraries — something they don’t have the manpower or resources to do for multiple platforms. In addition, the MAGE-ML exporter option, while technically feasible, would impose “significant information restrictions and performance costs,” according to an open letter issued by the Bioconductor core team on June 27.
The letter, signed by 21 Bioconductor developers, called for Affy to “open the new file format to support and encourage research and development in the microarray analysis domain.”
Within a week, the company did just that.
Opting to open the file formats “is not a decision without risk,” said Scott Jokerst, senior product manager for data management products and head of Affy’s external developers’ program, but surprisingly, the risk doesn’t involve intellectual property. “I know people consider that Affymetrix has a large IP estate, but file formats isn’t one of those areas,” he said “It’s more about how do you grow a robust software community?”
Jokerst said that Affy encourages “creative alternatives” such as Bioconductor in the microarray analysis sector, and put a lot of thought into providing enough information for third-party developers to write software for the Affy platform while ensuring those packages won’t break every time the company changes its formats.
The shift to the binary CEL format is expected to greatly reduce the file size and speed data access as future, smaller, versions of the GeneChip pump out ever more data, but the company initially didn’t foresee any difficulty in keeping the format proprietary.
Now, in addition to the options of an API license or MAGE-ML access, developers can request direct access to any of Affy’s file formats. Affy will release a new version of its software using the new CEL format by August.
The company has a new message board (http://www.affymetrix.com/ support/developer/index.affx) for developers.