PHILADELPHIA — Data integration has long been a challenge for life science informatics, but a number of new tools and methodologies are coming online with hopes of finally solving this problem, according to speakers at the fourth annual Data Integration in Life Sciences conference held at the University of Pennsylvania, held here this week.
The three-day workshop, sponsored by Penn’s engineering department and genomics institute in collaboration with Microsoft Research, centered on several key themes that are expected to help researchers combine and analyze disparate data sets and tools: namely workflows, provenance, and the importance of semantics and controlled vocabularies for data annotation.
Topics ranged from neuroscience to proteomics to SNP data analysis and beyond, but all centered around data integration.
Workflows on the Agenda
Katy Wolstencroft of the University of Manchester School of Computer Science provided an update on the myGrid/Taverna project and a related effort called UTOPIA (User friendly Toolkits for Operating Informatics Applications).
With in silico experiments traditionally requiring ad hoc collections of scripts and programs to process and visualize biological data, the group is developing two software frameworks to address the limitations of current solutions, which Wolstencroft described as “architecturally fragile” and poorly scalable.
The myGrid/Taverna framework is a well-established workflow design and enactment environment for building experiments, while UTOPIA is a flexible visualization system to help interpret experimental results.
Wolstencroft also discussed a new project underway called MyExperiment, a workflow system derived from the University of Manchester’s existing solutions.
In an interview with BioInform following her talk, Wolstencroft said that MyExperiment is somewhat similar to the MySpace social networking site, except that rather than having specifically social goals, her site is designed for scientists looking to share their work with peers.
“We want it to be a place for [scientists] to be able to share their workflows and share their experiments, and share their experiences of those experiments as well. But we also, on the other hand, want it to be … a portal for people who don’t want to build workflows or run workflows,” Wolstencroft said.
At the moment, she said, people who build workflows in Taverna tend to be bioinformaticians or people with some computing experience. ”But there are a lot of biologists out there who want to do workflow experiments but they don’t really want to [invest the time it takes to use] Taverna. So the portal should be a place for them to find workflows that other people have already built,” she added.
A beta version of MyExperiment is slated to go live in late July, Wolstencroft said. Further information about the project is available here.
The Transparency of Provenance
If workflows were the gravy for the conference, the concept of provenance — tracking the origins of data in order to better assess its reliability — was the meat.
Several speakers noted that IP issues and the need for transparency in biological research are driving demand for biologists, bioinformaticists, and computer scientists to more carefully account for their work. Researchers can’t just present the results of an experiment, but must be able to describe where the underlying data came from and who has had access to it.
Can Türker from the Swiss Federal Institute of Technology’s Functional Genomics Center Zurich, told BioInform that while provenance was a definite theme of the conference, the idea of collaboration and getting the user involved was also key.
“We want it to be a place for [scientists] to be able to share their workflows and share their experiments, and share their experiences of those experiments as well. But we also, on the other hand, want it to be … a portal for people who don’t want to build workflows or run workflows.”
Türker, who leads a 30-person data-integration team at the center, developed a system called B-Fabric that integrates and queries all data generated at FGCZ.
“The idea of a fabric is that you have a number of data sources that you put together, that are linked together,” he said. “What is important is that all these things are done in a transparent way so the user doesn’t have to care about where the data lies, how the data is processed internally.”
He said that users can link into B-Fabric using any web browser. “What’s really important is you have a transparency layer not allowing you to take care of the technical issues. … The user just knows he wants to provide some data, to have to search for some data, and he does not have to take care of where the data lies, and which structures [in which it lies]. He just has a simple database to work with.”
Why the ‘B’? “That’s just for biology,” Turker said.
Semantics, Ontologies, and Annotations — Oh, My!
Another important theme of the conference was the use of semantics to help map and model life science data.
Penn researchers described one approach to meet this need: the Annotation Grammar and Extraction Language (AnGEL), which the group developed to support its Transcription Element Search System (TESS). While not available as a standalone system yet, TESS users will soon be able to download AnGEL for their own purposes.
The Penn team said that AnGEL defines grammar rules and is able to organize a hierarchical structure with Distributed Annotation System-like attributes. It also includes plug-in devices that adapt to a data source parser.