CHICAGO – Nearly seven years after it first appeared online as a beta release, the Ontology Development Kit (ODK) has matured into an open-source software package used by the developers and keepers of dozens of biomedical ontologies.
The ODK is meant to implement the principles of the Open Biological and Biomedical Ontologies (OBO) Foundry, a community of ontology developers that seeks to streamline and standardize processes and promote interoperability.
"It gives you confidence that the ontology has been engineered in a standard, consistent way," said Christopher Mungall, head of biosystems data science at Lawrence Berkeley National Laboratory in Berkeley, California.
Mungall is among a group of developers, including David Osumi-Sutherland and Nicolas Matentzoglu of the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI), who have been involved in ODK development since the beginning. "[We are] trying to loosely coordinate efforts of different people developing ontologies for a number of years" as members of OBO, Mungall said.
Mungall was the lead author and Osumi-Sutherland the corresponding author of a paper published last month in Database: The Journal of Biological Databases and Curation, describing the development and utility of the Ontology Development Kit.
Mungall and David Osumi-Sutherland have collaborated on several ontologies in the past. Mungall wrote the original version of Uberon, which Osumi-Sutherland is now a leader of. They also worked together on the Cell Ontology for single-cell sequencing.
The ODK is made up of two basic components, a series of workflows for building and managing ontologies, and a toolbox to execute the workflows. "The ODK simplifies the process of maintaining an ontology, allowing [ontology developers] to focus on content rather than technical aspects of maintenance," according to the paper.
The workflows follow best practices recommended by the OBO Foundry for building scripts, releasing updates, running quality checks, and importing terms from other ontologies. The toolbox is presented as a Docker packet.
"Historically, everyone had their own ad hoc processes for building their ontology," said Mungall, a principal investigator for many ontology-related projects, including the Gene Ontology Consortium, the Monarch Initiative, and Phenomics First. They might all use the same ontology development tool, called Protégé, but they would manage files in very different ways.
"We wouldn't have good practices for version control of the ontology files that they would make," Mungall said. GitHub allows for version control, but each GitHub repository is laid out differently and has different workflows and procedures for checking the accuracy of ontologies.
"We realized there was a need to essentially standardize these workflows and allow people to come up with a common GitHub project structure such that it's very easy for people to go from one ontology to another," Mungall said.
The creators said that ODK is now involved in the maintenance of more than 70 mostly biomedical ontologies, including the Human Phenotype Ontology, the Cell Ontology, the cross-species Uberon, the Phenotype and Trait Ontology (PATO), the Brain Data Standards Ontology, and the Provisional Cell Ontology (PCL).
The champions and developers of various ontologies have discovered and adopted the ODK mostly by word of mouth, and the vast majority are translational research. "Even though it can be used for any ontology, we're mostly interested in supporting ontologies that intend to become part of OBO," Mungall said. However, he added that "broader clinical terminologies are welcome to adopt" the ODK.
It can be difficult for existing ontologies to migrate to the ODK retrospectively, so some users have only partially adopted it, according to Mungall. "They'll use the Docker container or something like that [so they don't] change their entire structure to be completely ODK-compliant," he said.
"The main thing that we care about is new ontologies," Mungall said.
One key objective of this toolkit is to standardize documentation. For example, the ODK looks for data that is inconsistent. "A sample that is annotated to be both a T-cell and a neuron at the same time is probably going to be erroneous," Mungall noted.
"The message I try and get across is just bringing what we've learned from robust software development over to ontology development as well," Mungall said. "We're just trying to bring across everything that we've learned from best practices into ontology development."
Shawn Tan co-led development of the Provisional Cell Ontology along with Osumi-Sutherland and Huseyin Kir at EMBL-EBI, though that ontology actually grew out of single-cell transcriptomics work in the lab of Richard Scheurmann, director of informatics at the US-based J. Craig Venter Institute.
"The idea was that these cell types were not quite ready for the cell ontology … as a temporary place to store these," Tan explained.
Tan said that the ODK allows ontologies like PCL to be more flexible and collaborative and allows ontology developers to easily perform things like dynamic inputs and ensure that they remain up to date and compatible with other ontologies.
"ODK is a community-driven tool," Tan said, not funded by membership fees or direct grants. Researchers contribute to the development and maintenance of ODK and specific ontologies because they help them do their jobs better.
The Provisional Cell Ontology is involved in the US National Institutes of Health's new $500 million BRAIN Initiative Cell Atlas Network (BICAN) project that is attempting to map the approximately 200 billion neurons and other cells in the brain through single-cell sequencing, noninvasive medical imaging, and advanced bioinformatic analysis. The group also participated in the earlier BRAIN Initiative Cell Census Network (BICCN).
Much of EMBL-EBI's work with BICCN, which looked at the primary cell cortex in mice and humans, involved single-cell cell transcriptomics with cell types that still needed annotation.
Tan said that his group at EMBL-EBI is "really, really heavily" working on this provisional ontology in preparation for BICAN. "The dataset is going to be huge compared to even the BICCN, which was really huge," he said.
Tan said that he and his colleagues got involved with ODK when PCL developers were trying to integrate their work with Cell Ontology, which had already adopted the kit. "Our aim was also to get [PCL] into the Open Biomedical Ontologies Foundry," he said.
Tan also said that the ODK is an easy way for someone not trained in software engineering like himself to participate in ontology development and maintenance.
"Having something like that allows me to use a whole wide range of tools that I personally would not be able to handle by myself," he said. "Having this centralized thing that we can work together [on] as a community, that's the other big one for me."
"Problem-solving, troubleshooting, it all just becomes a lot easier when we know what tools we are using," Tan said.
The ODK also supports "social coding," a collaborative approach to open-source software development.
"We all contribute to this ontology and this ontology is a community tool," Tan said. "We think it helps the community get more involved with ontology building and hopefully that means the ontology gets closer to what biologists want."
The most recent update of the ODK, version 1.3.1, was released in June.
The paper's authors said that they have "already observed significantly lower error rates in many of the ontologies that use the ODK, thanks to the ability of the automated testing system provided by the ODK to catch errors early on," but they were light on details.
The ODK will be regularly updated with features including new quality-control tests and other tweaks that the creators said would adhere to the FAIR principles of data being findable, accessible, interoperable, and reusable.
The current release is not built to prevent "bad ontology modeling," according to the paper. "We hope to be able to make stronger use of design pattern-based validation and advanced semantic validation techniques," including the Linked Open Data Modeling Language (LinkML) that will help prevent human error, the authors wrote.
However, ODK creators said in the paper said that a future version will reconcile workflows with those in other frameworks, specifically naming OntoAnimals and a derivative called Ontofox, a tool for searching ontology tools and axioms.