A team led by researchers from the European Bioinformatics Institute has developed a suite of five open source software tools called Investigation/Study/Assay, or ISA, which they claim is the first general-purpose format open source suite of tools that lets researchers curate their experimental metadata using standard ontologies and convert the data into a suitable format for submission to public databases.
In a paper published earlier this month in Bioinformatics the researchers describe how ISA addresses two recurring bottlenecks researchers face when they attempt to share the results of their experiments with the larger scientific community in public databases, such as the Gene Expression Omnibus and the European Nucleotide Archive, as required by many journals and funding agencies.
The authors write that on one hand, researchers have to contend with varying 'omics data formats, models, and terminologies used by these public repositories to describe different types of assays. On the other hand, after the data has been submitted, there aren’t enough curators to annotate the datasets; a dilemma the authors suggest can be solved by improving “annotation at the source.”
To tackle these issues head on, the team developed ISA to "regularize local management of experimental metadata by enabling curation at source, supporting community-defined standards, and preparing studies for submission to public repositories.”
The suite includes tools like ISAcreator, which lets users create reports and edit experimental metadata; and the BioInvestigation Index, which is a relational database for storing and querying data. A third tool is the ISAconverter, which lets users convert their data into acceptable formats for submission to the public domain.
BioInform spoke to EBI’s Susanna-Assunta Sansone, one of ISA’s developers, about the software, including the team’s current efforts to train researchers to use the suite, to modify the tools to support other data formats required by public databases, and to create tools that are generic enough to meet the varying needs of researchers. Following is an edited transcript of the interview.
Can you provide some background on how the tool was developed?
We partnered in the carcinoGENOMICS project, [which] was the first [project] that provided funds, use cases, and requirements for why this software should be developed. We have since [partnered] with a wide variety of [researchers in the] community.
We started with the project in 2007, released a beta version [of the suite] in 2008, and the first version in early 2009.
It is very difficult to [develop] a tool that suits everybody's needs. Each experimentalist, lab, and institute has their own specific needs in terms of what tools would help them. Some [groups] have tools in house and they want our tools to work with their tools. What we have done is work with these [groups] to develop core functionalities so that [the researchers] can take the tools and develop them further according to their in-house requirements.
What niche does the ISA suite of tools fill?
The ISA software is a tool for managing experimental metadata, [by which] I mean all the contextual information that usually accompanies the data [such as] the sample characteristics, the technology used, the measurement type, the parameters, the instruments, and how you relate the sample to the data which is the result of your experiment. It's really about describing the experimental metadata and [then] trying to standardize the descriptors that the experimentalists use. It’s a suite of software for local use in small to medium [sized] labs, which have no other way to capture this information that is stored in Excel sheets most of the time.
The suite has three specific functions. It helps both experimentalists [and] curators because nowadays curators are the people that annotate the experiment and experimentalists can be curators themselves. The tool is addressed to labs which need to record information locally, want to record it in a very consistent way, and store the information locally but they also want to submit this information to public repositories.
Public repositories tend to have their own standards that they require the experimentalist to [meet]. Standards here are specific descriptions that [data repositories] want to see when the submission happens. For example, they want all the characteristics of the samples, and some databases want the experimentalist to use specific controlled terminology or ontology. The tools enable the experimentalist at source, to describe the experiment in a way that meets the requirements of the public repository.
How can researchers access and use the suite?
You download what you need. For example, you might have your own database which does a better job for your local needs and you are only going to use one or more components which [help] you to do the proper standardized description of your experiment and then submit it to the public repository.
We are going to work on a single bundle for those [who] want to download all the tools. We developed the tool in a modular way because each tool has a specific target user. Certain components are more for power users — for example [users who] are able to select which ontologies are relevant for the experimentalist — and some tools are directed toward the experimentalist who needs to be guided in the description process.
Can you give me an example of an experiment that used one or all of ISA's tools?
The example I could use is the carcinoGENOMIC project that guided the development of the tools. We designed these tools for them because it is a multi-site project [involving several research groups] that did parts of the experiment in different countries. Ultimately, they needed to collect this information and submit it to the public repository.
They were looking at the effects of certain chemical compounds on animal models. They produced microarray data, metabolomics data, and a lot of phenotypical information. We [used] a common set of terminology to describe the animal models, a common set of terms to describe the microarray experiments, the metabolomics experiments, [and] the phenotypical information. We collect[ed] this data using the tools, a central person did the curation, and now the data is being processed and submitted to the relevant public repositories.
Do the tools use a lot of the computer's memory?
Not at all, the largest component is the database, but everything else is very lightweight.
Have you received any feedback from users?
We collect [both] positive and negative experiences because both are important. The positives are the [researchers] are able to collect the data locally and send it to the public repositories. The main limitation is the curation. [It takes] time to properly annotate these experiments because some public databases require a lot of information to be submitted.
What are the next steps?
This is a collaborative process so we collect requests and we are trying to [share] the workload with our collaborators. We are trying to make this tool [work] with the other tools that [researchers] have in house. We are also working with new communities who have found the tool useful and are approaching us. We have to do training and demos, usually web meetings and occasionally visits, [which are] a lot of work and time. We [also] have workshops to bring users together [for] further training.
We are going to support other [data] formats so that we can support other public repositories on top of the ones we already support. We have to improve the distribution mechanism and we also have to improve the packaging, so that the database component can be easily explored, even if you don't have a lot of specialists in house.