Skip to main content
Premium Trial:

Request an Annual Quote

Q&A: SIB's Pascale Gaudet on New Guidelines to Enable Data Exchange between Biological Databases


pascale.jpgAn international group of researchers is spearheading an initiative that aims to develop and propagate a list of guidelines that will ensure that database providers supply enough information about their databases to make data exchange with other resources possible.

The consortium published an editorial in the recent database issue of Nucleic Acids Research that includes a preliminary checklist of 17 guidelines, dubbed BioDBcore. The current list is by no means complete, however, and the group is seeking the community's input and participation in the effort.

In the paper, the authors wrote that adopting BioDBcore, which they describe as "a community-defined, uniform, generic description of the core attributes of biological databases," will "encourage consistency and interoperability between resources" as well as "promote the use of semantic and syntactic standards," among other goals.

The proposed guidelines are similar to the more than 30 checklists that comprise the Minimum Information for Biological and Biomedical Investigations, or MIBBI, portal. In fact, several MIBBI representatives are members of the BioDBcore group, along with editors from journals like NAR and Database, members of the Asia-Pacific Bioinformatics Network and the European Life Science Infrastructure for Biological Information, and others. The creation of the checklist is overseen by the International Society for Biocuration and the BioSharing Forum.

Currently the group plans to implement BioDBcore in three phases. Phase one involves consulting and encouraging participation from interested parties while phase two will see the joint development of a comprehensive list.

During implementation — planned for the third phase of the project — the group plans to create a "public submission website" that will make it easy for database developers to both enter and update their data.

This week, BioInform spoke with Pascale Gaudet, who is the neXtProt scientific manager at the Swiss Bioinformatics Institute and one of the authors of the BioDBcore paper in NAR. What follows is an edited version of the conversation.

Let's start with some background on how the International Society for Biocuration got started

The International Society for Biocuration was formed to promote the work of biocurators. In the past 10 years there has been a big explosion of data and databases that [are] still quite disorganized and many don't have secure funding. Essentially we are trying to be better organized as a group of professionals. Our major activities are international conferences where we meet and interact with each other and [we are] also trying to [make] funding agencies aware about the work that we do.

Legally we are incorporated in Switzerland but, as [is the case with] every database, we live in the webspace so we don’t have real offices. The ISB is still quite young and our only revenues are from the membership fees.

Tell me about the group's involvement in setting up and maintaining BioDBcore.

Before the biocuration society was founded, an ongoing goal from this [and other] groups has been to have better data flow from researchers into the journals into the databases. Right now there are problems with standardization of data types, data exchange, websites that come and go, and supplementary material [containing] a lot of data that isn't in a uniform format. The volume of data curated suffers because we [have to] take valuable time to determine what protein we are looking at, or convert tables into a format that databases will be able to read before this data can be annotated, [for example].

A few databases have been working with publishers and journal editors to encourage better practices in data sharing and how people describe the biological objects. For example, nucleotide sequences and protein structure IDs are routinely requested by journals and that facilitates integration of the data into public databases. However, this is difficult to extend to the large number of databases available today, and when we requested that publishers [provide] the types of identifiers and annotations they ask from researchers, we realized that we needed to specify the format through which the data can be exchanged. BioDBcore's goal is one step in that direction.

Is there a data exchange structure in place?

The minimal information about different types of experiments [are] grouped under [the] Minimum Information for Biological and Biomedical Investigations, MIBBI. So for different types of experiments, there are guidelines about how to exchange the data for specific data types, but there is no centralized format to comprehensively describe a database, which has different kinds of data, and then to describe that data, such as whether that data applies any of those minimal guidelines and so on.

A lot of databases have this [information] on their sites in one or several readme files, but it's not present anywhere in a centralized way. Several groups have tried to compile lists of databases, the most stable one probably being the NAR list of databases, but it's been very difficult to maintain those lists because not everyone is aware of [them], and people [do] not have incentives to contribute to those lists.

We are trying to set up a flexible format by which anyone providing those types of lists [containing] meta information about databases could just export it, including the databases themselves. This way we could exchange it and everyone can more easily maintain it. This work is also done in collaboration with BioSharing, an organization whose goal is to interconnect journals, funders, and researchers to implement good data sharing practices.

You already have a list of 17 descriptors on your website. Why did you select these?

It's typical information that one needs to access a database. I don’t think there are any surprises there. This is the basic information describing how to access a database, its terms of licensing, and the data represented.

Following up on that, isn't that information normally included with the database or in papers describing it?

The information is not systematically made available and it's not made available in any one specific way. For example, if you take any database and you want to know the data release frequency, you might have to search through the readme file [although] you may or may not find it. You can find the URL by googling it but then you might need to search for the e-mail in [another location] if you want to contact the [database provider]. You can find all this [information] but it takes some work.

The other thing that is not captured systematically, which is where we are trying to bring a lot of value, [relates to] the database scope and standards used. Usually that's known but often someone will publish a paper about a specific aspect of the database, for example protein interaction data and relevant data standards; however, for a single database, that information may be found across several papers.

It seems to me that there are a lot of small specialized databases that contain information that could be lumped into much larger databases. Why do researchers take that route rather than submitting and ultimately accessing their data from these more comprehensive resources?

I think it's just a matter of practicality rather than a preference. It's not trivial to develop a database but right now the informatics tools that exist make it pretty easy to set up something on an as-needed basis, about mitochondrial proteins in zebrafish for example. A researcher can just parse a list and organize it with lab data and create a webpage. But to have a complex relational database where you have biocuration, quality control, and you maintain the data up to date, that’s a huge operation. [These] specialized databases usually serve purposes not accomplished by the larger databases [and] therefore stay in operation to fill that niche.

What are some challenges of creating a resource like this?

One challenge is the fact that biology is so diverse [and] there are a lot of different types of data to represent. We are working on providing the vocabulary we need to describe the scope of the different groups [but it's] difficult to have something that is understandable at the human scale and has all the complexity of all the different databases and so we need to have the right balance there.

Another challenge is encouraging people to make the effort and provide the data about their database. We are trying to make BioDBCore attractive in that it would give a database more visibility and the opportunity to be presented in one or several websites that could present all the databases in this uniform method. For example, I imagine we could have a website where one could search what databases would accept protein-protein interaction data for yeast in a certain format that I have and they would know where this data could be stored. This also has implications for data-sharing plans in grant applications. They could use this centralized resource to discover new databases that have specific functionalities.

Won't the guidelines mean a lot of extra work for database developers and make it less likely that they will be adopted?

Yes, it is a little bit of work but we are hoping that the cost will be exceeded by the value brought by the visibility from having meta information about your database available in one or many centralized places. That is the incentive for people to participate.

The goal is to have an interface where the data can be provided and converted into the [required] format so I imagine the extra work would be perhaps one day a year, maybe less, because once you have this filled, you can just maintain it once a year if there are updates and changes.

In the paper, you mention that BioDDcore will be implemented in three phases. Which phase are you on now?

We are between Phase 1 and Phase 2. Phase 1 is ongoing, as we will continue to [encourage] people to participate and now we are starting to consult with the group and find exactly what information we are going to collect.

The implementation will be a little bit later when we all agree what the descriptors will be and how we will capture them.

The Scan

Genetic Risk Factors for Hypertension Can Help Identify Those at Risk for Cardiovascular Disease

Genetically predicted high blood pressure risk is also associated with increased cardiovascular disease risk, a new JAMA Cardiology study says.

Circulating Tumor DNA Linked to Post-Treatment Relapse in Breast Cancer

Post-treatment detection of circulating tumor DNA may identify breast cancer patients who are more likely to relapse, a new JCO Precision Oncology study finds.

Genetics Influence Level of Depression Tied to Trauma Exposure, Study Finds

Researchers examine the interplay of trauma, genetics, and major depressive disorder in JAMA Psychiatry.

UCLA Team Reports Cost-Effective Liquid Biopsy Approach for Cancer Detection

The researchers report in Nature Communications that their liquid biopsy approach has high specificity in detecting all- and early-stage cancers.