Introduction

In distributed data storage and big data analysis it is important for the user to be able to trace the persons and processes that originated, used or changed data. Without this information data can neither be processed or shared in a legally and ethically correct way nor meaningfully be interpreted. In many domains, however, the data's provenance is not sufficiently documented, and thus not available for structuring and querying. By data provenance we mean information about the data’s origins, including the entities and transformations involved in the data's derivation up to the current state.

For a more formal definition of data provenance, one could view a datum as a node in a causal Bayesian network (assuming causal data is available; see and its context). The directed edges coming from that node's ancestors then correspond to transformations – modelled as mathematical relations – that led to that node's datum. The transformation edges together with the ancestor data nodes then form the datum's provenance. Implied is that transformation edges and data nodes contain the metadata necessary for a given application.

Data provenance has been identified as a key feature in several data related activities, including sharing and reuse. Recognition and documentation of data provenance is required to adequately address data security issues to ensure privacy. Information on data provenance is also a key element of data quality assessment and has been successfully used to automatically find causes of defects in big data computing .

For these reasons, the importance of data provenance has been noted in different domains of medical routine and research data use .

Note that actual data provenance documentation is closely related to, but different from provenance as a concept in that actual documentation can never capture all causes of a datum and reflects the priorities of a particular use case.

While several models, like the Open Provenance and PROV models, have been described for the documentation of data provenance, existing provenance models have not, to our knowledge, been systematically collected and analysed for structural similarities.

Because various actors exchange data in a medical setting, interoperability and therefore standardisation is highly desirable. Detailed provenance of medical data is not yet routinely recorded, let alone in a standardised form. One study, for instance, found that lack of glucose test data provenance led to delays in the diagnosis of diabetes .

The objective of this work is to systematically collect models on data provenance developed for biomedical application.

Without claiming particular difficulties for this objective, we expect it to be a small but useful step towards widespread and interoperable data provenance systems in medical applications. Such systems pose the same challenges as other big data tasks. Especially pronounced in medicine is the continuing growth of (mostly unstructured) data, the tension between accessibility and privacy, and advanced division of labour with accompanying diversity and distribution of data systems and users . While a requirements analysis and evaluation are beyond the scope of this work, a data provenance model should allow efficient implementations under these circumstances in addition to being general and adaptable enough to serve the users and applications mentioned above.

Provenance modelling, tracking, querying and other provenance applications are all within the review’s initial scope, yet we confine further analysis here to the references dealing primarily with provenance modelling aspects.

While this work was done as a master’s project, it also took place in the context of the Medical Informatics Initiative of Germany’s Federal Ministry of Education and Research (BMBF) , specifically the interoperability work-group and the MIRACUM consortium .

A detailed description of the reviewing process, with inclusion- and exclusion criteria, is provided in the following section.

Results

Following PRISMA, this review’s article inclusion- and exclusion process is depicted as a PRISMA diagram:

PRISMA diagram showing article search, merging, retrieval, as well as inclusion/exclusion steps.

Starting with the articles resulting from searches within Web of Science and OvidSP databases (query: provenance AND medic*; resulting CITAVI database files: Web of Science, OvidSP, merged) the diagram shows each in- and exclusion step as an oval. The numbers of resulting articles are shown in rectangles where the final number of 16 included references represents articles dealing primarily with provenance modelling (7 articles, 2 other articles were excluded during data extraction – a brief review and a very brief letter-style article ), tracking (6) and querying (3, one article recategorised from modelling after data extraction). The 58 references excluded after full text screening include books (4) and articles primarily concerned with datasets (1), data analysis (3), data quality (7), e-health security (5) and -systems (5) and various applications not strictly focused on provenance tracking and querying (33).

Raw- and summary data-sheets

Raw data extraction was performed for all 16 included articles (raw data-sheet).

For the 7 articles focused on provenance modelling, a summary data-sheet was created.

The tabular summary data-sheet should enable the reader to quickly gain an overview of each work as well as the ability to easily compare individual aspects or whole articles.

Because the model presentations used in the original articles are often very different from each other, and use nonstandard diagrams, we found it necessary to unify, and sometimes abstract, them a final table focused solely on the models themselves. To this end we created a standardised representation for each base model, using the textual and graphical information in the summary table.

Models data-sheet

The following models data-sheet bases on the summary data-sheet linked above and contains only the 7 references dealing mostly with provenance modelling. While the summary sheet gives an overview of each whole article, the models data-sheet allows a focused look at the models used in each work.

The models data-sheet's content stems from the following sources in the summary data-sheet – which may have been reworded and clarified. The provenance definition comes from the methods column of the summary table; the model types and representations as well as their relationships and entities mainly come from the results and method columns. We created UML diagrams based on figures or text in the articles and abstracted if necessary.

Next to the article reference, each row contains a provenance definition from the article, because it lays the groundwork for the following models. A unified visual representation of the model or its key aspect shows the model elements, their interaction, and allows comparison between models. The types of models as well as their representations used in the article are made clear. Finally, the base model elements and their relationships are listed (details included in the full, formal model specifications are omitted for efficiency and clarity). This information should give a good overview of the models devised and used in the reviewed works. Model aspects listed separately also facilitate comparison between the approaches.

Zhong et al. 2013 (textual summary)

(Zhong et al. 2013) base model diagram — Model summary of who used and extended the Open Provenance Model (OPM)

Provenance definition	Base model diagram
According to the authors, “provenance information describes the origins and the history of data in its life cycle”.
Model representations	Objects and dependencies
Acyclic graph.	Objects and dependency relationships correspond to (parts of) the OPM, where objects are (instances of sub-classes of) artifacts, agents, or processes. Additionally, each of the main object instances can have attributes and processes can be combined to process-sets; the OPM’s dependency relationships are: Control (process by agent) Generation (artifact by process) Trigger (process by process) Derivation (artifact from artifact) Use (artifact by process) Additionally, there can be (unnamed) dependencies of agents, artifacts and processes on their attributes.

Ma et al. 2015 (textual summary)

(Ma et al. 2015) base model diagram — Model summary of whose model can be represented using and extending the Open Provenance Model (OPM)

Provenance definition	Base model diagram
The authors describe the provenance of a data object as the documented history of the actors, communication, environment, access control and other user preferences that led to that data object.
Model representations	Objects and dependencies
Directed acyclic graph (DAG).	Objects and dependency relationships correspond to (parts of) the OPM, with additional access-control- and granularity policies. Objects are (instances of sub-classes of) artifacts, agents, or processes; dependency relationships: Control (process by agent) Generation (artefact by process) Trigger (process by process) Additionally, there can be (unnamed) dependencies of access-control- and granularity-policies on processes or agents.

Groth, Miles, and Moreau 2009 (textual summary)

(Groth, Miles, and Moreau 2009) base model diagram — Model summary of which may be seen as groundwork for

Provenance definition	Base model diagram
The authors define the provenance of a result as the process which led to that result.
Model representations	Objects and dependencies
Directed acyclic graph (DAG). The nodes in a provenance-representing graph are occurrences (events and data at an event) in the role of causes, effects or both. The (hyper-)edges in such a graph represent the causal connections.	Interactions between actors (with internal state) by sending/receiving messages, causal relationships between incoming and outgoing message data.

Miles et al. 2011 (textual summary)

(Miles et al. 2011) base model diagram — Model summary of which may be seen as groundwork for

Provenance definition	Base model diagram
The authors describe the provenance of a data item as the process that led to that item.
Model representations	Objects and dependencies
Actor model, leading to a process documentation model (of the same application). The recorded process documentation shows the application’s execution; a directed acyclic graph (DAG) of the data’s causal dependencies.	Interactions between actors (with internal state) by sending/receiving messages, causal relationships between incoming and outgoing message data.

Schreiber 2016 (textual summary)

(Schreiber 2016) base data model — Model summary of who used the then current version of

Provenance definition	Base model diagram
The author cites the W3C’s definition of provenance as “a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing [..]”
Model representations	Objects and dependencies
Directed acyclic graph (DAG).	The same as in W3C’s PROV model. Objects are entities, agents, or activities; dependency relationships: Acting, on behalf of s. o. (agent on behalf of agent) Attribution (entity to agent) Association (activity with agent) Derivation (entity from entity) Use (entity by activity) Generation (entity by activity) Information (activity by activity)

Amanqui et al. 2016 (textual summary)

(Amanqui et al. 2016) base model diagram — Model summary of who used and extended the then current version of

Provenance definition	Base model diagram
The authors describe provenance in the context of species identification as the history of the species – meaning the process of identification – which typically involves different persons possibly far apart in space-time.
Model representations	Objects and dependencies
Directed acyclic graph (DAG).	Mostly the same as in W3C’s PROV model. Objects are entities (including custom sub-classes), agents, or activities; dependency relationships: Acting, on behalf of s. o. (agent on behalf of agent) Attribution (entity to agent) Association (activity with agent) Derivation (entity from entity) Use (entity by activity) Generation (entity by activity) Information (activity by activity)

Almeida et al. 2016 (textual summary)

(Almeida et al. 2016) base model diagram — Model summary of whose concrete model may be abstracted and represented with (see base model diagram)

Provenance definition	Base model diagram
The authors mention a definition of provenance as documentation of the history of data, including each transformation step.
Model representations	Objects and dependencies
To create study groups for statistical analyses, inclusion criteria and transformation processes were applied to the input data. Provenance information could be obtained for the intermediate and final data. Workflow execution is represented as a directed acyclic graph (DAG). The work shows the creation of a specific model instance for a particular purpose rather than a framework or meta-model, but the approach seems sufficiently general to be applied to similar problems at other institutions.	Transformation processes, inclusion criteria, data (input, transitional, output); relationships of consumption and generation.

The standardised models data-sheet enables the conclusion that all but one of the reviewed models are part of the same lineage. We describe the models in the order of that lineage.

The first two modelling works, and , extend the Open Provenance Model (OPM, ), the lineage’s starting point for our purposes. While the extension is straightforward for , the base model diagram for is more complicated because the authors did not directly base their model on the OPM, and used their own terminology. The diagram illustrates that the model can be represented as an extension of the OPM. In all diagrams, extensions to the reference base model are shown in blue.

The next group of modelling articles contains and . They are special because they fall in between the OPM and its successor, , and form part of ’s development work. As such, these works are simultaneously the most detailed and conceptual of our modelling review. While the OPM’s Agent/Actor and Process types are readily recognised, the Data is more hidden is this transitional model. Since the model incorporates message passing between actors – cf. – as a central concept, the data is implicit in the messages, with the possibility to make it explicit again as internal actor state. On the other hand, the transformative relationship between incoming and outgoing messages (data) is more explicit in this model diagram than in the ones of both OPM and PROV, where it is hidden as the combination of the Used and GeneratedBy dependencies. The final aspect emphasised by this group of models is the relationship between causes and effects – an important aspect in the development of the PROV model but perhaps more implicit than explicit in its use so far.

We end our overview of the models data-sheet with the group of models based on PROV, namely and . The former uses PROV in its plain form, the latter extends it by domain-specific sub-types of the Entity type. We also group the model used by into this category. Even if it is not explicitly based on PROV, the elements of this specific model can be abstracted and then be represented by the more general PROV base model, as done in our diagram.

The similarity between the OPM and PROV base models should be apparent, as their three main types Agent, Artifact/Entity, Process/Activity directly correspond, and the depicted dependencies of PROV are a super-set of those of OPM if we equate TriggeredBy with WasInformedBy and ControlledBy with WasAssociatedWith. PROV adds a recursive dependency for the Agent type as well as a dependency of Entity on Agent.

In summary, we showed (not strictly formally) that all reviewed base models used for medical applications can be represented in terms of the base PROV model. The models are then part of a model lineage leading from the OPM to PROV, with transitional models in between.

Since the reviewed models were designed or extended for and applied in actual biomedical applications we conclude that there exists general and standardised provenance models for this use case. This does not necessarily mean that all biomedical use cases are covered by PROV, the most general model we found. Even if its extensibility was shown on a use case basis in the reviewed works, each new and specific use case should perform its own requirements analysis and – preferably also quantitative – evaluation.

Discussion

When looking at the figures shown in the raw-data and summary tables the likely impression is one of heterogeneity, apart from the directed acyclic graph that underlies all but one model representations. Differences are in the specific application, (visual) structure and even terminology used to describe the models.

However, this impression of heterogeneity proves misleading on further examination. A closer look already reveals elements that are shared by most models: entities, agents (or actors), activities and the relations between them. This similarity is not accidental. Two of the modelling articles base their work on the Open Provenance Model (OPM) : and . The OPM was first released in 2007 and served as a basis for developing .

and also appear to lead up to the development of the PROV data model. Not only do they cover aspects of said model, whose proposed documentation status in 2013 they precede, but the intersection of authors of both papers are all contributors to the W3C model.

Next, and directly use and extend the PROV data model.

It emerges that six out of the seven provenance modelling articles considered in the review base their models on the same lineage of models – which are all part of the provenance of PROV. The remaining work is , which aims to create a model instance for a particular purpose and does not use the existing OPM or PROV data model but can be represented using PROV as shown above.

Since most models share a common basis, there are also similarities in the development process. Mostly, model design is informed by a requirements analysis, often in the form of use cases (also called provenance questions), and evaluation is done with respect to those requirements, albeit largely in a qualitative fashion that shows little more than the model’s feasibility.

Summary

In this scoping review, we captured works which examine the modelling of data provenance in the context of biomedical applications. While provenance modelling, tracking, querying and other provenance applications are all within the review’s initial scope, we confined further analysis in this work to the references dealing primarily with provenance modelling aspects. To our knowledge this is the first review using systematic literature retrieval methods on this topic, which aim to minimise the risk of bias.

Despite their heterogeneous presentation involving non-standard diagrams and different levels of abstraction we could reduce all found base models for biomedical applications to the PROV data model.

Our result follows from traceable, systematic search, simplification, and standardisation steps. We depicted the biomedical provenance models found using systematic literature search using a standard representation, simplified and abstracted where possible, without changing the underlying base model. The final representation enables comparisons that would hardly be comprehensible using the original data extracted from the reviewed articles.

References

Almeida, Fernanda Nascimento, Gisela Tunes, Julio Cezar Brettas da Costa, Ester Cerdeira Sabino, Alfredo Mendrone-Junior, and Joao Eduardo Ferreira. 2016. ‘A Provenance Model Based on Declarative Specifications for Intensive Data Analyses in Hemotherapy Information Systems’. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 59: 105–113. https://doi.org/10.1016/j.future.2015.09.019.
Amanqui, Flor K., Tom de Nies, Anastasia Dimou, Ruben Verborgh, Erik Mannens, Rik van de Walle, and Dilvan Moreira. 2016. ‘A Model of Provenance Applied to Biodiversity Datasets’. 2016 IEEE 25TH INTERNATIONAL CONFERENCE ON ENABLING TECHNOLOGIES: INFRASTRUCTURE FOR COLLABORATIVE ENTERPRISES (WETICE), 235–240. https://doi.org/10.1109/WETICE.2016.59.
Groth, Paul, Simon Miles, and Luc Moreau. 2009. ‘A Model of Process Documentation to Determine Provenance in Mash-Ups’. ACM TRANSACTIONS ON INTERNET TECHNOLOGY 9 (1): 3. https://doi.org/10.1145/1462159.1462162.
Ma, Taotao, Hua Wang, Jianming Yong, and Yueai Zhao. 2015. ‘Causal Dependencies of Provenance Data in Healthcare Environment’. PROCEEDINGS OF THE 2015 IEEE 19TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 643–648. https://doi.org/10.1109/CSCWD.2015.7231033.
Miles, Simon, Paul Groth, Steve Munroe, and Luc Moreau. 2011. ‘PrIMe: A Methodology for Developing Provenance-Aware Applications’. ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY 20 (3): 8. https://doi.org/10.1145/2000791.2000792.
Peroni, Silvio, Francesco Osborne, Angelo Di Iorio, Andrea Giovanni Nuzzolese, Francesco Poggi, Fabio Vitali, and Enrico Motta. 2017. ‘Research Articles in Simplified HTML: A Web-First Format for HTML-Based Scholarly Articles’. PeerJ Computer Science 3 (October): e132. https://doi.org/10.7717/peerj-cs.132.
Ruiz-Olazar, Margarita, Evandro S. Rocha, Sueli S. Rabaca, Carlos Eduardo Ribas, Amanda S. Nascimento, and Kelly R. Braghetto. 2016. ‘A Review of Guidelines and Models for Representation of Provenance Information from Neuroscience Experiments’. https://doi.org/10.1007/978-3-319-40593-3_26.
Schreiber, Andreas. 2016. ‘A Provenance Model for Quantified Self Data’. https://doi.org/10.1007/978-3-319-40250-5_37.
Yildiz, Ustun, Khalid Belhajjame, and Daniela Grigori. 2015. ‘Modeling Evidence-Based Medicine Applications with Provenance Data in Pathways’. https://doi.org/10.4108/icst.pervasivehealth.2015.260251.
Zhong, Han, Jianhui Chen, Taihei Kotake, Jian Han, Ning Zhong, and Zhisheng Huang. 2013. ‘Developing a Brain Informatics Provenance Model’. https://doi.org/10.1007/978-3-319-02753-1_44.
Duke, Clifford S., and John H. Porter. 2013. ‘The Ethics of Data Sharing and Reuse in Biology’. BioScience 63 (6): 483–89. https://doi.org/10.1525/bio.2013.63.6.10.
Vayena, Effy, Marcel Salathé, Lawrence C. Madoff, and John S. Brownstein. 2015. ‘Ethical Challenges of Big Data in Public Health’. PLOS Computational Biology 11 (2): e1003904. https://doi.org/10.1371/journal.pcbi.1003904.
Simmhan, Yogesh L., Beth Plale, and Dennis Gannon. 2005. ‘A Survey of Data Provenance in E-Science’. SIGMOD Rec. 34 (3): 31–36. https://doi.org/10.1145/1084805.1084812.
Madnick, Stuart E., Richard Y. Wang, Yang W. Lee, and Hongwei Zhu. 2009. ‘Overview and Framework for Data and Information Quality Research’. J. Data and Information Quality 1 (1): 2:1–2:22. https://doi.org/10.1145/1515693.1516680.
Curcin, Vasa, Elliot Fairweather, Roxana Danger, and Derek Corrigan. 2017. ‘Templates as a Method for Implementing Data Provenance in Decision Support Systems’. Journal of Biomedical Informatics 65 (January): 1–21. https://doi.org/10.1016/j.jbi.2016.10.022.
Curcin, Vasa. 2017. ‘Embedding Data Provenance into the Learning Health System to Facilitate Reproducible Research’. Learning Health Systems 1 (2): n/a-n/a. https://doi.org/10.1002/lrh2.10019.
Wiles, Louise K., Peter D. Hibbert, Jacqueline H. Stephens, Enrico Coiera, Johanna Westbrook, Jeffrey Braithwaite, Ric O. Day, Ken M. Hillman, and William B. Runciman. 2017. ‘STANDING Collaboration: A Study Protocol for Developing Clinical Standards’. BMJ Open 7 (10): e014048. https://doi.org/10.1136/bmjopen-2016-014048.
Shang, Ning, Chunhua Weng, and George Hripcsak. 2017. ‘A Conceptual Framework for Evaluating Data Suitability for Observational Studies’. Journal of the American Medical Informatics Association. https://doi.org/10.1093/jamia/ocx095.
Wojno, Kirk, John Hornberger, Paul Schellhammer, Minghan Dai, and Travis Morgan. 2015. ‘The Clinical and Economic Implications of Specimen Provenance Complications in Diagnostic Prostate Biopsies’. The Journal of Urology 193 (4): 1170–77. https://doi.org/10.1016/j.juro.2014.11.019.
‘PROV-DM: The PROV Data Model’. n.d. Accessed 28 August 2017. https://www.w3.org/TR/2013/REC-prov-dm-20130430/.
Moreau, Luc, Ben Clifford, Juliana Freire, Joe Futrelle, Yolanda Gil, Paul Groth, Natalia Kwasnikowska, et al. 2011. ‘The Open Provenance Model Core Specification (v1.1)’. Future Generation Computer Systems 27 (6): 743–56. https://doi.org/10.1016/j.future.2010.07.005.
‘Shared Data, Shared Benefits | Medical Informatics Initiative’. http://www.medizininformatik-initiative.de/en/start.
‘MIRACUM – Medical Informatics in Research and Medicine’. http://www.miracum.org/.
Hewitt, Carl, Peter Bishop, and Richard Steiger. 1973. ‘A Universal Modular ACTOR Formalism for Artificial Intelligence’. In Proceedings of the 3rd International Joint Conference on Artificial Intelligence, 235–245. IJCAI’73. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=1624775.1624804.
Liberati, Alessandro, Douglas G. Altman, Jennifer Tetzlaff, Cynthia Mulrow, Peter C. Gøtzsche, John P. A. Ioannidis, Mike Clarke, P. J. Devereaux, Jos Kleijnen, and David Moher. 2009. ‘The PRISMA Statement for Reporting Systematic Reviews and Meta-Analyses of Studies That Evaluate Health Care Interventions: Explanation and Elaboration’. PLOS Medicine 6 (7): e1000100. https://doi.org/10.1371/journal.pmed.1000100.
Moher, David, Alessandro Liberati, Jennifer Tetzlaff, Douglas G. Altman, and The PRISMA Group. 2009. ‘Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement’. PLoS Med 6 (7): e1000097. https://doi.org/10.1371/journal.pmed.1000097.
Lefebvre, Carol, Eric Manheimer, and Julie Glanville. 2008. ‘Searching for Studies’. In Cochrane Handbook for Systematic Reviews of Interventions, 95–150. Wiley-Blackwell. https://doi.org/10.1002/9780470712184.ch6.
Pearl, Judea. 2009. ‘Causal Bayesian Networks’ in ‘Causality by Judea Pearl’, 21–26. Cambridge Core. September 2009. https://doi.org/10.1017/CBO9780511803161.
Gulzar, Muhammad Ali, Matteo Interlandi, Xueyuan Han, Mingda Li, Tyson Condie, and Miryung Kim. 2017. ‘Automated Debugging in Data-Intensive Scalable Computing’. In Proceedings of the 2017 Symposium on Cloud Computing, 520–534. SoCC ’17. New York, NY, USA: ACM. https://doi.org/10.1145/3127479.3131624.
McGovern, A. P., H. Fieldhouse, Z. Tippu, S. Jones, N. Munro, and S. de Lusignan. 2017. ‘Glucose Test Provenance Recording in UK Primary Care: Was That Fasted or Random?’ Diabetic Medicine 34 (1): 93–98. https://doi.org/10.1111/dme.13067.
Zhao, Jun, Chris Wroe, Carole Goble, Robert Stevens, Dennis Quan, and Mark Greenwood. 2004. ‘Using Semantic Web Technologies for Representing E-Science Provenance’. In The Semantic Web – ISWC 2004, 92–106. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30475-3_8.
Auffray, Charles, Rudi Balling, Inês Barroso, László Bencze, Mikael Benson, Jay Bergeron, Enrique Bernal-Delgado, et al. 2016. ‘Making Sense of Big Data in Health Research: Towards an EU Action Plan’. Genome Medicine 8 (1): 71. https://doi.org/10.1186/s13073-016-0323-y.

Abstract