Skip to main content
Premium Trial:

Request an Annual Quote

European Project Aims to Develop Tools, Guidelines to Improve Data Sharing, Access

Premium

NEW YORK (GenomeWeb) – A new European project aims to change the way life sciences data is managed and shared. The effort, called FAIRplus, commenced earlier this year with a budget of €8.2 million ($9.5 million). It's being funded by the Innovative Medicine Initiative, a private-public partnership that supports European health research.

ELIXIR, the European infrastructure for life science data, and Janssen are leading the effort. All together, FAIRplus involves 22 pharmaceutical companies, small- and medium-sized enterprises, and academic partners. FAIR stands for findable, accessible, interoperable, and reusable. The FAIR principles were first articulated following a meeting of European stakeholders in 2014. The FAIRplus project evolved out of the perceived failure of large-scale European projects to make the most use of the data generated.

Though the IMI, which has a total budget of €5.3 billion, has supported over 100 projects to date, organizers felt the large amounts of data generated in these projects has not been fully utilized.

Serena Scollen, project coordinator on the FAIRplus project and head of human genomics and translational data at ELIXIR at its headquarters in Hinxton, UK, noted that data generated by IMI-funded projects is often stored in "isolated silos," with inconsistent annotation or in incompatible formats. "At this moment, data interoperability has been identified as a major blocker for drug discovery in the pharmaceutical industry at large," Scollen noted.

By making data interoperable and searchable through applying the FAIR principles, project organizers hope to support drug discovery and development efforts and to create a fertile ground for applying new computational methods to the data as well as combining the data with third-party research outputs.

As such, Scollen called FAIRplus a "huge opportunity" for European science.

It will also have an impact on the genomics community, she underscored. While FAIRplus is focused on setting standards for improving all life science data sharing and accessibility, its principles were developed with genomics in mind. Scollen noted that one of the first papers to discuss the application of the FAIR principles specifically referred to its application in genomics.

The paper, which was authored by an international research team led by investigators at the Center for Plant Biotechnology and Genomics at Universidad Politécnica de Madrid in Spain, appeared in Nature in 2016.

"The aspect that distinguishes FAIR principles from other open data initiatives is their focus on the ability of machines to automatically process and use data," Scollen said. "This goes beyond the ability of researchers to find and exchange data," she said.

Even for highly sensitive data, there are "clear benefits" of using FAIR data, she added, such as when reusing the data internally or applying new machine learning methods. "To do this effectively, the data and metadata need to be machine-readable so that the machines can act on it," Scollen said.

She noted that the Nature paper provided a user scenario in gene regulation and gene expression where having FAIR data would be "hugely beneficial to researchers."

IMI now wants to see the FAIR principles applied to the projects it has funded and beyond. The IMI is a partnership between the EU and the European Federation of Pharmaceutical Industries and Associations (EFPIA), which represents the European pharmaceutical industry. The IMI has pledged €4 million to the effort, while EFPIA kicked in an additional €4.2 million to support it.

The work plan of FAIRplus includes identifying data sources, defining standards and processes to apply the FAIR principles or "FAIRify" the data, and developing an infrastructure to host data. There are five work packages. The first is focused on identifying data sources for FAIRification, with four pilot projects to be FAIRified by this June, followed by an additional 15 IMI datasets to undergo the process by the end of 2020.

The second work package will define standards for describing and linking elements of the datasets, as well metrics for determining the FAIRness of the datasets. Among its deliverables will be a "FAIR Cookbook" containing guidance on how to FAIRify datasets. The guidance is expected to be ready by the end of 2021. FAIRplus also has work packages on implementation and infrastructure; communication and outreach; and project management, coordination, dissemination, and sustainability.

Pilot projects

One of the four pilot projects is OncoTrack, an omics-heavy, IMI-backed effort that could serve as a case study for applying the FAIR principles to other genomics projects. 

OncoTrack ran from 2011 to 2016 with the aim of establishing new ways for assessing biomarkers obtained from colon cancer patients via liquid biopsy. Samples underwent characterization by genomic sequencing — whole-genome in some cases, whole-exome in others — as well as transcript sequencing and array-based methylation analysis. Confirmatory genome sequencing and transcriptome analyses were performed on xenograft and cell-culture models. Drug response data for a panel of 16 therapeutic agents in those models was also generated, as well as proteomic data using multiplex mass spectrometry and other methods.

The result was a large amount of data that could undergo FAIRification under FAIRplus consisting of genomic, transcriptomic, and proteomic data on 261 patients, representing about 60 terabytes of data, 40 terabytes of which still require archiving.

David Henderson, a principal scientist at Bayer in Berlin who is also the principal investigator of OncoTrack, said the hope of FAIRplus is to "propagate a set of standards so that future projects can already initiate their data collection and archiving activities using these standards and don't have to rework everything at a later date."

He added that the genomics community would benefit from having a "unified set of standards for data … for metadata, and for searching for information online that should make it easier to locate and reuse data."

Henderson said that genomics researchers already apply certain standards to the data they generate, but that the integration of other kinds of data necessitated the application of the FAIR principles in such projects.

"In the field of genomics, it's fairly straightforward," said Henderson. "Anyone who is doing whole-genome sequencing is working with similar systems," he said. "The problems start coming up more when you try to combine this data with information from clinical trial records or from patient records in large clinical centers. There is much more heterogeneity."

Henderson added that most new EU Horizon 2020 or IMI projects are required to have data management plans, and that application of the FAIR standards could be a component. "The idea of common standards, or FAIRification, can easily be integrated into that, and then life would be a lot easier for everybody," he said.

According to Scollen, the OncoTrack data was initially stored in a database for project partners and researchers outside the project could not access it. FAIRifying the data will involve identifying, describing , and linking elements of the datasets to make them accessible by external researchers.

"The scientific and societal impact from the systematic reuse of the OncoTrack data includes the potential for development of new diagnostic procedures and the wider implementation of the developed biological models in the translational research process," noted Scollen.

She noted that OncoTrack represents a "retrospective FAIRification," and that in the future other projects could implement the principles at the point of data collection. Another one of the four projects selected to pilot the implementation of the FAIR principles is Research Empowerment on Solute Carriers, or RESOLUTE, which commenced last year and involved 13 partners in academia and industry. The project aims to establish solute carriers, a group of membrane proteins, as a target class for medical R&D. Data types will include RNA-seq and quantitative gene expression analysis.

Scollen noted that as a newly commenced project, RESOLUTE, through its cooperation with FAIRplus will seek to implement the FAIR methods into its data management approach from the "earliest stages of its operation."

FAIRplus is currently working on a formal criteria and guidelines for selecting projects for FAIRification going forward, Scollen said. The criteria should be available by the end of the year. Ultimately, FAIRplus aims to implement the FAIR principles in around 20 such projects.

Project organizers also hope that any tools or guidelines developed will be adopted widely by other research projects and organizations, not just those supported by IMI, Scollen noted.