Article Summaries (Models of Data Provenance in Bio-Medical Application Contexts)

The following summaries do not strictly follow the publication order, but the lineage of underlying models developed or extended by the respective work. The article summarized last is an exception and not part of that lineage.

(Zhong et al. 2013) want to improve the modelling accuracy of brain data – as well as its integration from various sources – and, to this end, develop a brain informatics provenance model. The proposed model is based on the Open Provenance Model (OPM) (Moreau et al. 2011), which is extended. Resulting is a brain informatics provenance model with the OPM’s basic elements (Artefact – immutable state, Process – series of actions, Agent – facilitates a process) and two extended elements: (1) an attribute, which describes a characteristic of artefacts or processes of agents, and (2) a process-set. The elements are then used to create general frameworks of data- and analysis provenances. Specific datasets then need corresponding frameworks with specific elements describing them. E.g. a brain analysis provenance describes the analysis performed on a dataset, the tasks performed and their in- and outputs. The model’s feasibility and usability are evaluated with a case study, thinking-centric systematic investigations, which uses specific provenance frameworks.

By making the causal dependencies in a provenance model explicit, (Ma et al. 2015) seek to enable features that depend on such information, notably access control. Using the Open Provenance Model (OPM) (Moreau et al. 2011) as a basis, this work bases its enhancements on a list of requirements stating that (1) the model has to be fine grained (able to capture different levels of detail), (2) provide provenance security, and (3) support various types of provenance queries and views. The stated definition of data provenance is the documented history of actors, communication, access control and other user preferences leading to a given data object. The entities in this definition are mostly mirrored in the proposed provenance model, where actors are carrying out operations on data and communicate data between different operations. Access control- and granularity policies are additionally included to enable the desired access control, directed by actors. Causal relationships are modelled by role-specific variants of the OPM relationships ‘was controlled by’, ‘used’ and ‘was generated by’. Each datum is associated with a record of uniquely identified data objects directly used in its creation – forming a directed acyclic graph – where data in the environment is not explicitly part of an operation’s input. In practice, provenance records are represented by a relational DBMS. The Feasibility of the proposed model is demonstrated by applying it to an actual healthcare use case (Diabetes Quality Improvement Program workflow).

To enable the provenance documentation of complex software systems, (Groth, Miles, and Moreau 2009) propose a process documentation model that should work independently of the number of software parts, institutions and application domains. Using results of the authors’ previous work, non-functional requirements for the model are stated, namely factuality, attributability and autonomous creation. In model development, an actor centric view is adopted – where each actor represents certain functionality and can interact with other actors through message passing. The elements necessary to represent a process are contained within the p-assertions introduced by the authors. Actors create p-assertions for events and data directly accessible by them. E.g., a p-assertion for an interaction includes asserter identity (attributability), event id, and message representation with description of its generation. Further p-assertions represent relationships, with cause(s) and effect, for actor internal data- and control flow; and internal information, which models message reception and is useful to abstract away detail of data creation. Organizing and understanding p-assertions may be enabled by grouping them together by a common event to form p-structures. The provenance of an event can be described by a causal graph, created from the p-structure. The model’s feasibility (regarding the stated requirements) is qualitatively evaluated by means of a bioinformatics use case and the query mechanisms of the PReServ provenance store. The authors show that the requirements are satisfied by design of the p-structure, if users adhere to the specification.

(Miles et al. 2011) present a method that enables the recording of data provenance for existing applications, which were possibly designed without provenance in mind. The method should be easy to use and all its steps, and recorded data, should derive from use case requirements. The proposed provenance architecture follows a service-oriented style: state is only created by actors, which can communicate by message passing. Aside from these interactions, the main components of the model are relationships between in- and output data, and internal actor state. In applying PrIMe, processes are usually reshaped to conform to the actor model; the main steps are: (1) Determine relevant data and associated provenance (scope) by means of use cases. (2) Reveal the information flow by decomposing the application into actors and their interactions. Find knowledgeable actors with access to provenance information. Repeat at finer granularity if there are still inaccessible data items. Note actors without data access. (3) Adapt the application to record the provenance and answer the initial use case questions. A bioinformatics use case explains the model, and to demonstrate its feasibility, usability and traceability (back to use cases). The service-oriented approach is also compared to an aspect-oriented solution (Jacobson and Ng, 2004). The authors argue for an advantage of the assumptions made in their approach, which enable reuse by causal connections of provenance data.

As part of the goal to understand the creation of quantified self (QS) data, and enable trust in it, (Schreiber 2016) proposes a provenance data model (and a corresponding ontology) for such data – to specify what data will be stored. The provenance model is based on the W3C standard PROV-DM as well as analysing requirements of QS workflows. In order to constrain provenance recording to relevant subprocesses, 10 questions to be answered are raised by the author about the QS data of a user, each about one of the basic model constituents (Entity, Agent, Activity). The author extends the PROV-DM model classes by ‘QS-developer’ and ‘Self’ Agents; ‘User’, ‘Device’, ‘Application’ and ‘Service’ Activities; as well as ‘UserData’, ‘Miscellaneous’ and ‘Record’ Entities. One provenance (sub-)model is created for each workflow activity resulting from the requirements analysis: Input, Export, Request (from web services), Aggregation (multiple sources), and Visualization. The model’s feasibility is shown by means of a fitness tracker use case using Fitbit steps data, resulting in a provenance visualization which shows how Fitbit data leads to a graphical representation of the user’s steps walked.

The goal of this work by Ghent University’s Data Science Lab and University of Sao Paulo’s ICMC researchers (Amanqui et al. 2016) is to develop a conceptual provenance model for (the process of) species identification as a biodiversity research task. The model is required to be interoperable in heterogeneous environments such as the web, and to be evaluated with a use case. As a basis for the model, use case questions were asked to five biodiversity scientists in structured interviews (e.g. what was collected, where, how, why, and by whom). Using the W3C PROV model, the use cases are then modelled using Entities, Agents, Activities (corresponding nodes in a graph), and the relationships between them (edges in that same graph). To model the naming and renaming of species (by possibly different researchers), the authors extended the W3C PROV model by three Entity subtypes representing the species name at different stages in the identification process. Feasibility of the model was demonstrated by applying it to the genetic identification of an alga species: it was possible to query the implemented datastore for the species’ lineage.

In an interdisciplinary effort, São Paulo researchers (Almeida et al. 2016) of the city’s university and blood centre aimed to improve blood donation data quality with the help of a provenance model. This should enable finding (1) the donors at risk of iron deficiency anaemia and (2) the donation number/interval associated with an increase of that risk. Relevant donor groups are determined by a step-by-step, question-based selection- and distinction process modelling specialist knowledge. Experts helped to select suitable attributes according to usefulness (> 80% of valid entries), objectivity and donor age (18–69). The final model contains successive inclusion criteria which separate the donors according to their (1) risk of developing anaemia, (2) number of donations and (3) deferral due to low Hct (at first donation). Additional data filters perform pre-processing (cf. above) and normalization, insert (derivative) data for analysis, and enforce the time period studied (1996–2006). Data resulting from model application underwent a descriptive statistical analysis (mean plots, life table estimator, multivariate analysis) to show the resulting groups’ suitability and reliability. Results show anaemia probability after donation to be inversely related to previous Hct levels for both men and women. However, young and female donors face higher and earlier anaemia risk. The feasibility of the model is shown by the above use case. However, no comparison is made with existing models, and no formal measure of fitness is given.