Sarah Cohen-Boulakia, a French native, is now a post-doctoral researcher in the computer and information sciences department at the University of Pennsylvania.Before Penn, she was a PhD student at the LRI bioinformatics group at Paris-South University. She has extensive experience in the areas of workflow/dataflow design as well as integrating and querying biological and biomedical databases.
Along with her husband, Olivier Biton, and Susan Davidson of the University of Pennsylvania, Cohen-Boulakia presented a poster at the 4th annual Data Integration in Life Sciences conference. In addition, Boulakia wrote the preface of the proceedings guide that was published and distributed in handbook form prior to the June 27-29, 2007 workshop.
Her group’s subject — querying provenance via scientific workflows with Zoom*UserViews — is viewable at http://db.cis.upenn.edu/research/
and inspired the following e-mail discussion with BioInform
Please describe the history of your involvement with creating scientific workflow systems.
I have been working on scientific workflows since graduate school. Part of my PhD work was done in the context of the European HKIS project which aims to develop an integration platform to help oncologists manage and integrate their clinical data with data coming from public sources.
I had to understand how biologists designed their experiments and try to find ways to help them automatize experimental processes by representing each experiment as a workflow. Those workflows have then been implemented in the HKIS platform. It made it possible for biologists to share their analyses and exchange their expertise among partners.
Since I am a postdoc in the database group of the University of Pennsylvania, I've been involved in several projects where complex scientific analyses are performed by biologists, like in the SHARQ and pPOD projects.
Specifically, what is the workflow system created at U-Penn that uses Zoom* UserViews?
Actually, the great thing about Zoom*UserViews is that it is generic, it can be used in several workflow systems that have been developed by other groups (I describe more about the relationships between Zoom and other workflow systems below. We are thus not developing any new workflow system in our group.
You are adamant about the need for provenance in addressing the piles of data that stream in during bioinformatics research. Why do you feel provenance is currently so important?
There are many application domains which can benefit from a provenance system in science. First, interpreting and understanding the result of a scientific experiment necessitates knowing the provenance (or context) in which the data has been produced.
Second, provenance plays a crucial role in evaluating the quality of the data: By providing the source data and transformations, provenance may help to estimate data quality and data reliability.
Third, provenance can be used to detect errors in data generation.
Fourth, reproducing a local experiment or an experiment described in a research paper may only be possible if precise provenance information is provided. Last but not least, the copyright and ownership of data can be determined using provenance information. It enables its citation, and may determine liability in case of erroneous data.
Please explain how your system works in tandem with existing workflow systems such as Kepler, myGrid, and Chimera.
Current scientific workflow systems, like myGrid or Kepler, provide files describing the data and steps used to produce a given data product.
Zoom*UserViews aims at using those files and extracting from them relevant information for users, allowing users to focus on the most relevant information first and get progressively more information to better understand the result they got.
It sounds like Zoom*UserViews is highly interactive. Is that right? Please elaborate on how the user participates.
Yes, it is. Zoom*UserViews allows users to define what steps of the workflow are particularly important for them. It then automatically takes the user choice into account and hides any data produced by a step considered as non relevant. Usually, users first consider that a few steps only are important to them and then they want to have finer results and consider more steps as being of interest. In Zoom*UserViews, they can interactively refine their ‘view’ of the workflow by choosing more and more interesting steps in the workflow; the system will automatically adapt the answer of their provenance queries.
What is ‘the Provenance Challenge’? You mention one of them on your site; but how many other ‘challenges’ do you currently face/tout with regard to ZOOM?
The provenance challenge is a workshop:
The idea of the first provenance challenge was to compare the various provenance systems and be able to understand the capabilities of those systems. Basically, the aim was to allow a user to decide which system to use depending on the kind of problem related to provenance he wanted to solve. Seventeen international teams participated in this challenge including myGrid/Taverna, Kepler, VizTrail, and a lot of other very interesting workflow systems. Each team had to execute a given workflow using its system and to show how the system was able to answer a given list of provenance questions, such as ‘how this result has been produced?’ or ’what is the difference between those two executions’... Using that framework it was possible to better emphasize the similarities/differences among systems.
As for what are the challenges we face ... I would say that a huge remaining challenge is to be able to compare two experiments and give the user an interpretable result: a result that s/he will be able to exploit/understand. As scientific experiments are very complicated and as there are a lot of alternative ways to perform a given analysis, helping scientists to compare their experiments [represented as workflows] is really a crucial need and a difficult problem to solve.
In your abstract, you say that:
” …There are several reasons why composite step-classes are useful
in workflows. First, they can be used to hide complexity and allow users to focus on a higher level of abstraction. For example, users may wish to focus on biologically relevant step-classes and hide step-classes which focus on formatting. Second, composite step-classes can represent authorization levels; users without the appropriate clearance level would not be allowed to see the details of a composite step-class.”
What exactly then is a step-class and how important is this concept to the workflow as a whole? How does it work in everyday practice. Can you provide an example?
A step-class is a module, a task, a process which will be executed. It is a part of the workflow considered as a unit. A workflow is composed of several step-classes that are chained together. For example, a workflow which aims to build a phylogenetic tree will be composed of several step-classes describing the steps of the experiment: finding genomic sequences, building an alignment of those sequences, and eventually computing a tree. However, some step-classes in a workflow might describe formatting tasks. Those step-classes are not interesting for the user. In Zoom*UserViews, we help users focus on the relevant information by hiding information coming from unrelevant step-classes.
What else is your department working on that could be of interest to readers of BioInform?
The Database group at the University of Pennsylvania is involved in several bioinformatics projects in collaboration with the Children's Hospital of Philadelphia, Penn Center for Bioinformatics and several groups of phylogeneticists in the USA. It is one of the very first group who worked on the problem of information integration for biological data.
In the past five years, I have also been working with Susan Davidson and Christine Froidevaux at the University of Paris 11 on querying guides for biologists in the BioGuide project: http://bioguide-project.net/
Members of the DB group have also been working on the problem of maintaining biological sources updated in the Orchestra project leaded by Zack Ives. Other members have been working on designing integration systems for phylogenetic information in the pPod project under the lead of Val Tannen. In all those projects, provenance takes a special place.