Skip to main content
Premium Trial:

Request an Annual Quote

A New Champion for Open Source

Premium

One of the driving forces for technological innovation in life sciences informatics has been the backing of large, public-sector funders. In North America, this pretty much means the US National Institutes of Health. The direct result of NIH funding for informatics projects has been dramatic improvements in the application of advanced statistical and software design in biomedical science. This funding has come either from the technologically focused centers, such as the National Center for Bioinformatics and the National Center for Research Resources, or the more disease-focused institutes, such as the National Cancer Institute.

Informatics research funding has been largely free of restrictions: the use of the products of the research was left to the discretion of the individual. This was a reasonable state of affairs when the products developed were the goal of the research; for example, the implementation of an algorithm for predicting trends in molecular evolution. However, with the advent of large-scale proteomics and transcriptomics projects, much of the informatics being developed is designed specifically to enable laboratory research efforts, rather than being an end unto itself. By enabling I mean that while you can repeat someone else’s experimental protocol based on a published description, you can’t interpret the results without either direct access to the original software or developing your own equivalent code.

This situation has had a negative impact on the broader value of publicly funded efforts. Developing your own code to replace that developed by another group places one in a Catch-22 situation: you can only obtain funding to develop novel bioinformatics, but if you are reproducing someone else’s existing code base, it is by definition not novel. This fact of funding has had the effect of giving the original group a considerable practical advantage, encouraging what Richard Stallman referred to as “software hoarding.” By retaining the software or only releasing difficult-to-use binary versions, it is possible for the originating laboratory to build on its original success while making sure that competitors can never quite catch up.

I don’t mean to imply that this situation is always caused by some Machiavellian plan on the part of an evil genius. Often the reason for not releasing the original code is that it was written in such haste and with so many kluges to accommodate the changing requirements of lab scientists that the final result is not something that the authors want their own colleagues to see. The necessary polishing required to create something they can be proud of ends up being a low priority for the lab-based collaborators, for whom such time-consuming efforts hold no benefit.

The NIH has recognized that the failure to fully release code developed for larger scale projects has become a problem. In an effort to fix the situation, NIH RFA documents have begun to include sections like this:

“A software dissemination plan, with appropriate timelines, must be included in the application … NIH does have goals for software dissemination, and reviewers will be instructed to evaluate the dissemination plan relative to these goals.”

What are these goals? The software must be freely available to the nonprofit sector, but licensed in such a way as to allow customized versions to be included in commercial packages. The licensing must allow access to the source code, so that other groups may modify it and share those modified versions. The software must also be transferable, so that if the group that holds the “official” version decides to discontinue supporting it, another group may take over the project. Being in charge of the “official” version also requires that group to provide some way to distribute other versions and modular extensions to the original.

One can only assume that applications complying with this new requirement will be more likely to be funded than those that do not. It is also fairly safe to assume that successful applications will have the necessary budget items, milestones, and timelines so that this type of work can be completed. What isn’t clear is just how research groups will adjust to carrying out these activities and engage in the larger world of open source software development.

It may be a hard row to hoe for the biologists who typically write and administer large NIH grants. Even experienced informatics researchers might be tempted to simply rely on buzzwords from the old free software movement to attempt to meet these goals. Unfortunately, many of these older (often idealistic) concepts don’t really apply in this case. Examples of well-known, but inappropriate, open source ideas are the “GNU public license” and “copyleft.” These concepts limit commercial use of software in such a way as to make it practically impossible to use any sections of code covered by this type of license. On the other side are very simple documents, such as the “MIT License,” which are little more than copyright statements. These give the developer very little legal protection and leave the user without any clear idea of the responsibilities that the developer is willing to assume.

One of the few really good things that came out of the Internet bubble was the development of legal documents and ideas that can be applied to the development of open source software intended for use in a for-profit environment. The Open Source Initiative keeps a good repository of template documents that can be used to educate yourself about what is covered by open source licenses. Many of them, such as the “IBM Public License” and the “Intel Open Source License,” are clearly meant to give access to the code while preserving its usefulness for commercial use. You are free to use any of these templates — under the “Open Software License,” of course.  

Ron Beavis has developed instrumentation and informatics for protein analysis since joining Brian Chait’s group at Rockefeller University in 1989. He currently runs his own bioinformatics design and consulting company, Beavis Informatics, based in Winnipeg, Canada.

The Scan

For Better Odds

Bloomberg reports that a child has been born following polygenic risk score screening as an embryo.

Booster Decision Expected

The New York Times reports the US Food and Drug Administration is expected to authorize a booster dose of the Pfizer-BioNTech SARS-CoV-2 vaccine this week for individuals over 65 or at high risk.

Snipping HIV Out

The Philadelphia Inquirer reports Temple University researchers are to test a gene-editing approach for treating HIV.

PLOS Papers on Cancer Risk Scores, Typhoid Fever in Colombia, Streptococcus Protection

In PLOS this week: application of cancer polygenic risk scores across ancestries, genetic diversity of typhoid fever-causing Salmonella, and more.