W3C

PROV-DM Part 1: The Provenance Data Model

Working Draft WD4 (internal release)

W3C Editor's Draft 09 March 2012

This version:
http://dvcs.w3.org/hg/prov/raw-file/default/model/prov-dm.html
Latest published version:
http://www.w3.org/TR/prov-dm/
Latest editor's draft:
http://dvcs.w3.org/hg/prov/raw-file/default/model/prov-dm.html
Previous version:
http://www.w3.org/TR/2012/WD-prov-dm-20120202/
Editors:
Luc Moreau, University of Southampton
Paolo Missier, Newcastle University
Contributors:
Khalid Belhajjame, University of Manchester
Stephen Cresswell, legislation.gov.uk
Yolanda Gil, Invited Expert
Reza B'Far, Oracle Corporation
Paul Groth, VU University of Amsterdam
Graham Klyne, University of Oxford
Jim McCusker, Rensselaer Polytechnic Institute
Simon Miles, Invited Expert
James Myers, Rensselaer Polytechnic Institute
Satya Sahoo, Case Western Reserve University

Abstract

PROV-DM is a data model for provenance that describes the entities, people and activities involved in producing a piece of data or thing in the world. PROV-DM is domain-agnostic, but is equipped with extensibility points allowing further domain-specific and application-specific extensions to be defined. PROV-DM is accompanied by PROV-N, a technology-independent notation, which allows serializations of PROV-DM instances to be created for human consumption, which facilitates the mapping of PROV-DM to concrete syntax, and which is used as the basis for a formal semantics of PROV-DM.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is released internally by the Provenance Working Group.
This document is part of the PROV family of specifications, a set of specifications aiming to define the various aspects that are necessary to achieve the vision of inter-operable interchange of provenance information in heterogeneous environments such as the Web. This document defines the PROV-DM data model for provenance, accompanied with a notation to express instances of that data model for human consumption. Other documents are:

This document was published by the Provenance Working Group as an Editor's Draft. If you wish to make comments regarding this document, please send them to public-prov-wg@w3.org (subscribe, archives). All feedback is welcome.

Publication as an Editor's Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1. Introduction

For the purpose of this specification, provenance is defined as a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing in the world. In particular, the provenance of information is crucial in deciding whether information is to be trusted, how it should be integrated with other diverse information sources, and how to give credit to its originators when reusing it. In an open and inclusive environment such as the Web, where users find information that is often contradictory or questionable, provenance can help those users to make trust judgements.

The idea that a single way of representing and collecting provenance could be adopted internally by all systems does not seem to be realistic today. Instead, a pragmatic approach is to consider a core data model for provenance that allows domain and application specific representations of provenance to be translated into such a data model and exchanged between systems. Heterogeneous systems can then export their provenance into such a core data model, and applications that need to make sense of provenance in heterogeneous systems can then import it, process it, and reason over it.

Thus, the vision is that different provenance-aware systems natively adopt their own model for representing their provenance, but a core provenance data model can be readily adopted as a provenance interchange model across such systems.

A set of specifications, referred to as the PROV family of specifications, define the various aspects that are necessary to achieve this vision in an interoperable way:

The PROV-DM data model for provenance consists of a set of core concepts, and a few common relations, based on these core concepts. PROV-DM is a domain-agnostic model, but with clear extensibility points allowing further domain-specific and application-specific extensions to be defined.

This specification intentionally presents the key concepts of the PROV Data Model, without drilling down into all its subtleties. Using these key concepts, it becomes possible to write useful provenance assertions very quickly, and publish or embed them along side the data they relate to.

However, if data changes, then it is challenging to express its provenance precisely, like it would be for any other form of metadata. To address this challenge, an upgrade path is proposed to enrich simple provenance, with extra-descriptions that help qualify the specific subject of provenance and provenance itself, with attributes and interval, intended to satisfy a comprehensive set of constraints. These aspects are covered in the companion specification [PROV-DM-CONSTRAINTS].

1.1 Structure of this Document

Section 2 provides an overview of PROV-DM listing its core types and their relations.

In section 3, PROV-DM is applied to a short scenario, encoded in PROV-N, and illustrated graphically.

Section 4 provides the definition of PROV-DM constructs.

Section 5 introduces further relations offered by PROV-DM, including relations for data collections and domain-independent common relations.

Section 6 summarizes PROV-DM extensibility points.

Section 7 introduces constraints that can be applied to the PROV data model and that are covered in [PROV-DM-CONSTRAINTS].

1.2 PROV-DM Namespace

The PROV-DM namespace is http://www.w3.org/ns/prov-dm/ (TBC).

All the elements, relations, reserved names and attributes introduced in this specification belong to the PROV-DM namespace.

There is a desire to use a single namespace that all specifications of the PROV family can share to refer to common provenance terms. This is ISSUE-224.

1.3 Conventions

The key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" in this document are to be interpreted as described in [RFC2119].

2. Overview

This section provides an overview of the main concepts found in the PROV data model.

2.1 Entity, Activity, Agent

PROV-DM is a data model for describing the provenance of Entities, that is, of things in the world. The term "Things" encompasses a broad diversity of concepts, including digital objects such as a file or web page, physical things such as a building or a printed book, or a car as well as abstract concepts and ideas. One can regard any Web resource as an example of Entity in this context.

Entities are things in the world one wants to provide provenance for. For the purpose of this specification, things can be physical, digital, conceptual, or otherwise; the world may be real or imaginary.

An entity may be the document at URI http://www.w3.org/TR/prov-dm/, a file in a file system, a car or an idea.

An activity is anything that acts upon or with entities. This action can take multiple forms: consuming, processing, transforming, modifying, relocating, using, generating, or being associated with entities. Activities that operate on digital entities may for example move, copy, or duplicate them.

An activity may be the publishing of a document on the web, sending a twitter message, extracting metadata embedded in a file, or driving a car from Boston to Cambridge, assembling a data set based on a set of measurements, performing a statistical analysis over a data set, sorting news items according to some criteria, running a SPARQL query over a triple store, and editing a file.

An agent is a type of entity that bears some form of responsibility for an activity taking place.

The motivation for introducing agents in the model is to denote the agent's responsibility for activities. The definition of agent intentionally stays away from using concepts such as enabling, causing, initiating, affecting, etc, because many entities also enable, cause, initiate, and affect in some way the activities. Concepts such as initiating are themselves defined as relations between agent and activities. So the notion of having some degree of responsibility is really what makes an agent.

An agent is a particular type of Entity. This means that the model can be used to express provenance of the agents themselves.

Software for checking the use of grammar in a document may be defined as an agent of a document preparation activity, and at the same time one can describe its provenance, including for instance the vendor and the version history.

2.2 Generation, Usage, Derivation

Activities and entities are associated with each other in two different ways: activities are consumers of entities and activities are producers of entities. The act of producing or consuming an entity may have a duration. The term 'generation' refers to the completion of the the act of producing; likewise, the term 'usage' refers to the beginning of the act of consuming entities. Thus, we define the following notions of generation and usage.

Generation is the completed production of a new entity by an activity. This entity become available for usage after this generation. This entity did not exist before generation.

Usage is the beginning of an entity being consumed by an activity. Before usage, the activity had not begun to consume or use this entity and could not have been affected by the entity.

Examples of generation are the completed creation of a file by a program, the completed creation of a linked data set, and the completed publication of a new version of a document.

Usage examples include a procedure beginning to consume an argument, a service starting to read a value on a port, a program beginning to read a configuration file, or the point at which an ingredient, such as eggs, is being added in a baking activity. Usage may entirely consume an entity (e.g. eggs are no longer available after being added to the mix); alternatively, a same entity may be used multiple times, possibly by different activities (e.g. a file on a file system can be read indefinitely).

Activities are consumers of entities and producers of entities. In some case, the consumption of an entity influences the creation of another in some way. This notion is captured by derivations, defined as follows.

A derivation is a transformation of an entity into another, a construction of an entity into another, or an update of an entity, resulting in a new one.

Examples of derivation include the transformation of a relational table into a linked data set, the transformation of a canvas into a painting, the transportation of a work of art from London to New York, and a physical transformation such as the melting of ice into water.

2.3 Types of Entities and Agents

There are some useful types of entities and agents that are commonly encountered in applications making data and documents available on the Web; we introduce them in this section.

A plan is an entity that represents a set of actions or steps intended by one or more agents to achieve some goals. PROV-DM is not prescriptive about the nature of plans, their representation, the actions or steps they consist of, or their intended goals. Since plans may evolve over time, it may become necessary to track their provenance, so plans themselves are entities. Representing the plan explicitly in the provenance can be useful for various tasks: for example, to validate the execution as represented in the provenance record, to manage expectation failures, or to provide explanations.

A plan can be a blog post tutorial for how to set up a web server, a list of instructions for a micro-processor execution, a cook's written recipe for a chocolate cake, or a workflow for a scientific experiment.

A collection is an entity that provides structure to some constituents, which are themselves entities. This concept allows for the provenance of the collection, but also of its constituents to be expressed. Such a notion of collection corresponds to a wide variety of concrete data structures, such as a maps, dictionaries or associative arrays.

An example of collection is an archive of documents. Each document has its own provenance, but the archive itself also has some provenance: who maintained it, which documents it contained at which point in time, how it was assembled, etc.

An accountEntity is an entity that contains a bundle of provenance assertions.

Having found a resource, a user may want to retrieve its provenance. For users to decide whether they can place their trust in that resource, they may want to analyze its provenance, but also determine who the provenance is attributed to, and when it was generated. Hence, from the PROV-DM data model, the provenance is regarded as an entity, an AccountEntity, for which provenance can be sought.

Three types of agents are recognized by PROV-DM because they are commonly encountered in applications making data and documents available on the Web: persons, software agents, and organizations.

Even software agents can be assigned some responsibility for the effects they have in the world, so for example if one is using a Text Editor and one's laptop crashes, then one would say that the Text Editor was responsible for crashing the laptop. If one invokes a service to buy a book, that service can be considered responsible for drawing funds from one's bank to make the purchase (the company that runs the service and the web site would also be responsible, but the point here is that we assign some measure of responsibility to software as well).

So when someone models software as an agent for an activity in the PROV-DM model, they mean the agent has some responsibility for that activity.

2.4 Activity Association and Responsibility

Agents are defined as having some kind of responsibility for activities. However, one may want to be more specific about the nature of an agent's responsibility. For example, a programmer and a researcher could both be associated with running a workflow, but it may not matter which programmer clicked the button to start the workflow while it would matter a lot which researcher told the programmer to do so. So there is some notion of responsibility that needs to be captured.

Provenance reflects activities that have occurred. In some cases, those activities reflect the execution of a plan that was designed in advance to guide the execution. PROV-DM allows associating a plan to an activity, which represents what was intended to happen.

An activity association is an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity. It further allows for a plan to be specified, which is the plan intended by the agent to achieve some goals in the context of this activity.

Examples of association between an activity and agent are:

  • creation of a web page under the guidance of a designer;
  • various forms of participation in a panel discussion, including audience member, panelist, or panel chair;
  • a public event, sponsored by a company, and hosted by a museum;
  • an XSLT transform initiated by a user;

For an agent, responsibility is the fact of being accountable for the actions of a "subordinate" agent, in the context of an activity. The nature of this relation is intended to be broad, including delegation or a contractual relation.

A student publishing a web page describing an academic department could result in both the student and the department being agents associated with the activity, and it may not matter which student published a web page but it matters a lot that the department told the student to put up the web page.

2.5 Overview Diagram

The following diagram summarizes the elements and relations just described

TODO: short text required to explain the overview diagram

add a sentence saying that it is not complete coverage of the dm in diagram.

The text should say that we introduce a few relations based on the concepts introduced in section 2.1-2.4, that these relations are used in the example of section 3, and are fully defined in section 4-5.

The note should also say why relations are in past tense (we had something in previous version of prov-dm)

I have the impression that the diagram presented in Section 2.5 would > be more useful if placed at the beginning of Section 2 [KB]

There is some comments that the picture does not print well. We need to check.

Add links in the svg so that we can click on the figure.

PROV-DM overview

3. Example

The World Wide Web Consortium publishes many technical reports. In this example, we consider a technical report, and describe its provenance.

Specifically, we consider the second version of the PROV-DM document http://www.w3.org/TR/2011/WD-prov-dm-20111215. Its provenance can be expressed from several perspectives, which we present. In the first one, provenance is concerned with the W3C process, whereas in the second one, it takes the authors' viewpoint.

3.1 The Process View

Description: The World Wide Web Consortium publishes technical reports according to its publication policy. Working drafts are published regularly to reflect the work accomplished by working groups. Every publication of a working draft must be preceded by a "publication request" to the Webmaster. The very first version of a technical report must also preceded by a "transition request" to be approved by the W3C director. All working drafts are made available at a unique URI. In this scenario, we consider two successive versions of a given report, the policy according they were published, and the associated requests.

Concretely, in this section, we describe the kind of provenance record that the WWW Consortium could keep for auditors to check that due processes are followed. All entities involved in this example are Web resources, with well defined URIs (some of which locating archived email messages, available to W3C Members).

We now paraphrase some PROV-DM descriptions, and illustrate them with the PROV-N notation, a notation for PROV-DM aimed at human consumption. We then follow them with a graphical illustration. Full details of the provenance record can be found here.

Provenance descriptions can be illustrated graphically. The illustration is not intended to represent all the details of the model, but it is intended to show the essence of a set of provenance statements. Therefore, it should not be seen as an alternate notation for expressing provenance.

The graphical illustration takes the form of a graph. Entities, activities and agents are represented as nodes, with oval, rectangular, and octagonal shapes, respectively. Usage, Generation, Derivation, and Activity Association are represented as directed edges.

Entities are laid out according to the ordering of their generation event. We endeavor to show time progressing from top to bottom. This means that edges for Usage, Generation and Derivation typically point upwards.

Provenance of a Tech Report
Provenance of a Tech Report
Illustration to be hand crafted instead of being generated automatically. It's important to adopt a common style for all illustrations across all PROV documents.

CG: It would be helpful to see the properties labelled in the figure.

This simple example has shown a variety of PROV-DM constructs, such as Entity, Agent, Activity, Usage, Generation, Derivation, and ActivityAssociation. In this example, it happens that all entities were already Web resources, with readily available URIs, which we used. We note that some of the resources are public, whereas others have restricted access: provenance statements only make use of their identifiers. If identifiers do not pre-exist, e.g. for activities, then they can be generated, for instance ex:act2, occurring in the namespace identified by prefix ex. We note that the URI scheme developed by W3C is particularly suited for expressing provenance of these reports, since each URI denotes a specific version of a report. It then becomes very easy to relate the various versions, with PROV-DM constructs.

3.2 The Authors View

Description: A technical report is edited by some editor, using contributions from various contributors.

Here, we consider another perspective on technical report http://www.w3.org/TR/2011/WD-prov-dm-20111215. Provenance is concerned with the document editing activity, as perceived by authors. This kind of information could be used by authors in their CV or in a narrative about this document.

Again, we paraphrase some PROV-DM assertions, and illustrate them with the PROV-N notation. Full details of the provenance record can be found here.

Provenance of a Tech Report (b)
Provenance of a Tech Report (b)
Illustration to be hand crafted instead of being generated automatically. It's important to adopt a common style for all illustrations across all PROV documents.

CG: It would be helpful to see the properties labelled in the figure.

simplify the figure (leave just 2 authors (as in the example), or the editors), and label the edges as well.

3.3 Attribution of Provenance

The two previous sections provide two different perspectives on the provenance of a technical report. By design, the PROV approach allows for the provenance of a subject to be provided by multiple sources. For users to decide whether they can place their trust in the technical report, they may want to analyze its provenance, but also determine who the provenance is attributed to, and when it was generated, etc. In other words, we need to be able to express the provenance of provenance.

No new mechanism is required to support this requirement. PROV-DM makes the assumption that provenance statements have been bundled up, and named, by some mechanism outside the scope of PROV-DM. For instance, in this case, provenance statements were put in a file and exposed on the Web, respectively at ex:prov1 and ex:prov3. To express their respective provenance, these resources must be seen as entities, and all the constructs of PROV-DM are now available to characterize their provenance. In the example below, ex:prov1 is attributed to the agent w3:Consortium, whereas ex:prov3 to ex:Simon.

entity(ex:prov1, [prov:type="prov:AccountEntity" %% xsd:QName ])
wasAttributedTo(ex1:prov1,w3:Consortium)

entity(ex:prov3, [prov:type="prov:AccountEntity" %% xsd:QName ])
wasAttributedTo(ex1:prov3,ex:Simon)

4. PROV-DM Core

In this section, we revisit each concept introduced in Section 2, and provide its detailed definition in the PROV data model, in terms of its various constituents.

In PROV-DM, we distinguish elements from relations, which are respectively discussed in Section 4.1 and Section 4.2.

4.1 Element

4.1.1 Entity

Entities are things in the world one wants to provide provenance for. For the purpose of this specification, things can be physical, digital, conceptual, or otherwise; the world may be real or imaginary.

An entity, written entity(id, [ attr1=val1, ...]) in PROV-N, contains:
  • id: an identifier for an entity;
  • attributes: an optional set of attribute-value pairs representing this entity's situation in the world.

The following expression

entity(tr:WD-prov-dm-20111215, [ prov:type="document", ex:version="2" ])
states the existence of an entity, denoted by identifier tr:WD-prov-dm-20111215, with type document and version number 2. The attributes ex:version is application specific, whereas the attribute type is reserved in the PROV-DM namespace.

Further considerations:

  • The sets of Activities and Entities are disjoint, as described below.
The characterization interval of an entity is currently implicit. Making it explicit would allow us to define wasComplementOf more precisely. Beginning and end of characterization interval could be expressed by attributes (similarly to activities). How do we define the end of an entity? This is ISSUE-204.

4.1.2 Activity

An activity is anything that acts upon or with entities. This action can take multiple forms: consuming, processing, transforming, modifying, relocating, using, generating, or being associated with entities.

An activity, written activity(id, st, et, [ attr1=val1, ...]) in PROV-N, contains:
  • id: an identifier for an activity;
  • startTime: an optional time for the start of the activity;
  • endTime: an optional time for the end of the activity;
  • attributes: an optional set of attribute-value pairs for this activity.

The following expression

activity(a1,2011-11-16T16:05:00,2011-11-16T16:06:00,
        [ex:host="server.example.org",prov:type="ex:edit" %% xsd:QName])

states the existence of an activity with identifier a1, start time 2011-11-16T16:05:00, and end time 2011-11-16T16:06:00, running on host server.example.org, and of type edit. The attribute host is application specific (declared in some namespace with prefix ex). The attribute type is a reserved attribute of PROV-DM, allowing for sub-typing to be expressed.

Further considerations:

  • An activity is not an entity. This distinction is similar to the distinction between 'continuant' and 'occurrent' in logic [Logic].

4.1.3 Agent

An agent is a type of entity that bears some form of responsibility for an activity taking place.

An agent, noted agent(id, [ attr1=val1, ...]) in PROV-N, contains:
  • id: an identifier for an agent;
  • attributes: a set of attribute-value pairs representing this agent's situation in the world.

From an interoperability perspective, it is useful to define some basic categories of agents since it will improve the use of provenance by applications. There should be very few of these basic categories to keep the model simple and accessible. There are three types of agents in the model since they are common across most anticipated domains of use:

  • Person: agents of type Person are people.
  • Organization: agents of type Organization are social institutions such as companies, societies etc.
  • SoftwareAgent: a software agent is a piece of software.

These types are mutually exclusive, though they do not cover all kinds of agent.

The following expression is about an agent identified by e1, which is a person, named Alice, with employee number 1234.

agent(e1, [ex:employee="1234", ex:name="Alice", prov:type="prov:Person" %% xsd:QName])

It is optional to specify the type of an agent. When present, it is expressed using the prov:type attribute.

Shouldn't we allow for entities (not agent) to be associated with an activity? Should we drop the inference association-agent? ISSUE-203.

4.1.4 Note

As provenance descriptions are exchanged between systems, it may be useful to add extra-information to what they are describing. For instance, a "trust service" may add value-judgements about the trustworthiness of some of the entities or agents involved. Likewise, an interactive visualization component may want to enrich a set of provenance descriptions with information helping reproduce their visual representation. To help with interoperability, PROV-DM introduces a simple annotation mechanism allowing anything that is identifiable to be associated with notes.

A note, noted note(id, [ attr1=val1, ...]) in PROV-N, contains:
  • id: an identifier for a note;
  • attributes: a set of attribute-value pairs, whose meaning is application specific.

A separate PROV-DM relation is used to associate a note with something that is identifiable (see Section on annotation). A given note may be associated with multiple identifiable things.

The following note consists of a set of application-specific attribute-value pairs, intended to help the rendering of what it is associated with, by specifying its color and its position on the screen.

note(ex2:n1,[ex2:color="blue", ex2:screenX=20, ex2:screenY=30])
hasAnnotation(tr:WD-prov-dm-20111215,ex2:n1)

The note is associated with the entity tr:WD-prov-dm-20111215 previously introduced (hasAnnotation is discussed in Section Annotation). The note's identifier and attributes are declared in a separate namespace denoted by prefix ex2.

Alternatively, a reputation service may enrich a provenance record with notes providing reputation ratings about agents. In the following fragment, both agents ex:Simon and ex:Paolo are rated "excellent".

note(ex3:n2,[ex3:reputation="excellent"])
hasAnnotation(ex:Simon,ex3:n2)
hasAnnotation(ex:Paolo,ex3:n2)

The note's identifier and attributes are declares in a separate namespace denoted by prefix ex3.

4.2 Relation

This section describes all the PROV-DM relations between the elements introduced in Section Element. While these relations are not binary, they all involve two primary elements. They can be summarized as follows.

PROV-DM Core Relation Summary
EntityActivityAgentNote
EntitywasDerivedFrom
alternateOf
specializationOf
wasGeneratedByhasAnnotation
ActivityusedwasStartedBy
wasEndedBy
wasAssociatedWith
hasAnnotation
AgentactedOnBehalfOfhasAnnotation
NotehasAnnotation

4.2.1 Activity-Entity Relation

4.2.1.1 Generation
Generation is the completed production of a new entity by an activity. This entity become available for usage after this generation. This entity did not exist before generation.

Generation, written wasGeneratedBy(id,e,a,t,attrs) in PROV-N, has the following components:
  • id: an optional identifier for a generation;
  • entity: an identifier for a created entity;
  • activity: an optional identifier for the activity that creates the entity;
  • time: an optional "generation time", the time at which the entity was completely created;
  • attributes: an optional set of attribute-value pairs that describes the modalities of generation of this entity by this activity.

While each of the components activity, time, and attributes is optional, at least one of them must be present.

The following expressions

  wasGeneratedBy(e1,a1, 2001-10-26T21:32:52, [ex:port="p1", ex:order=1])
  wasGeneratedBy(e2,a1, 2001-10-26T10:00:00, [ex:port="p1", ex:order=2])

state the existence of two generations (with respective times 2001-10-26T21:32:52 and 2001-10-26T10:00:00), at which new entities, identified by e1 and e2, are created by an activity, identified by a1. The first one is available as the first value on port p1, whereas the other is the second value on port p1. The semantics of port and order are application specific.

In some cases, we may want to record the time at which an entity was generated without having to specify the activity that generated it. To support this requirement, the activity component in generation is optional. Hence, the following expression indicates the time at which an entity is generated, without naming the activity that did it.

  wasGeneratedBy(e,,2001-10-26T21:32:52)
4.2.1.2 Usage
Usage is the beginning of an entity being consumed by an activity. Before usage, the activity had not begun to consume or use this entity and could not have been affected by the entity.

Usage, written used(id,a,e,t,attrs) in PROV-N, has the following constituents:
  • id: an optional identifier for a usage;
  • activity: an identifier for the consuming activity;
  • entity: an identifier for the consumed entity;
  • time: an optional "usage time", the time at which the entity started to be used;
  • attributes: an optional set of attribute-value pairs that describe the modalities of usage of this entity by this activity.

A reference to a given entity may appear in multiple usages that share a given activity identifier.

The following usages

  used(a1,e1,2011-11-16T16:00:00,[ex:parameter="p1"])
  used(a1,e2,2011-11-16T16:00:01,[ex:parameter="p2"])

state that the activity identified by a1 consumed two entities identified by e1 and e2, at times 2011-11-16T16:00:00 and 2011-11-16T16:00:01, respectively; the first one was found as the value of parameter p1, whereas the second was found as value of parameter p2. The semantics of parameter is application specific.

A usage record's id is optional. It must be present when annotating usage records (see Section Annotation Record) or when defining precise-1 derivations (see Derivation).

4.2.2 Activity-Agent Relation

4.2.2.1 Activity Association
An activity association is an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity. It further allows for a plan to be specified, which is the plan intended by the agent to achieve some goals in the context of this activity.

As far as responsibility is concerned, PROV-DM offers two kinds of constructs. The first, introduced in this section, is a relation between an agent, a plan, and an activity; the second, introduced in Section Responsibility, is a relation between agents expressing that an agent was acting on behalf of another, in the context of an activity.

An activity association, written wasAssociatedWith(id,a,ag,pl,attrs) in PROV-N, has the following constituents:
  • id: an optional identifier for the association between an activity and an agent;
  • activity: an identifier for the activity;
  • agent: an identifier for the agent associated with the activity;
  • plan: an optional identifier for the plan adopted by the agent in the context of this activity;
  • attributes: an optional set of attribute-value pairs that describe the modalities of association of this activity with this agent.
In the following example, a designer and an operator agents are associated with an activity. The designer's goals are achieved by a workflow ex:wf.
activity(ex:a,[prov:type="workflow execution"])
agent(ex:ag1,[prov:type="operator"])
agent(ex:ag2,[prov:type="designer"])
wasAssociatedWith(ex:a,ex:ag1,[prov:role="loggedInUser", ex:how="webapp"])
wasAssociatedWith(ex:a,ex:ag2,ex:wf,[prov:role="designer", ex:context="project1"])
entity(ex:wf,[prov:type="prov:Plan"%% xsd:QName, ex:label="Workflow 1", 
              ex:url="http://example.org/workflow1.bpel" %% xsd:anyURI])
Since the workflow ex:wf is itself an entity, its provenance can also be expressed in PROV-DM: it can be generated by some activity and derived from other entities, for instance.
The activity association record does not allow for a plan to be asserted without an agent. This seems over-restrictive. Discussed in the context of ISSUE-203.
Agents should not be inferred. WasAssociatedWith should also work with entities. This is ISSUE-206.
4.2.2.2 Activity Start and Activity End

A activity start is a representation of an agent starting an activity. An activity end is a representation of an agent ending an activity. Both relations are specialized forms of wasAssociatedWith. They contain attributes describing the modalities of acting/ending activities.

An activity start, written wasStartedBy(id,a,ag,attrs) in PROV-N, contains:

  • id: an optional identifier for the activity start;
  • activity: an identifier for the started activity;
  • agent: an identifier for the agent starting the activity;
  • attributes: an optional set of attribute-value pairs describing modalities according to which the agent started the activity.

An activity end, written wasEndedBy(id,a,ag,attrs) in PROV-N, contains:

  • id: an optional identifier for the activity end;
  • activity: an identifier for the ended activity;
  • agent: an identifier for the agent ending the activity;
  • attributes: an optional set of attribute-value pairs describing modalities according to which the agent ended the activity.

In the following example,

wasStartedBy(a,ag,[ex:mode="manual"])
wasEndedby(a,ag,[ex:mode="manual"])

there is an activity denoted by a that was started and ended by an agent denoted by ag, in "manual" mode, an application specific characterization of these relations.

Should we define start/end records as representation of activity start/end events. Should time be associated with these events rather than with activities. This will be similar to what we do for entities. This is issue ISSUE-207.

4.2.3 Entity-Entity or Agent-Agent Relation

4.2.3.1 Responsibility Chain

PROV-DM offers a mild version of responsibility in the form of a relation to represent when an agent acted on another agent's behalf. So in the example of someone running a mail program, the program is an agent of that activity and the person is also an agent of the activity, but we would also add that the mail software agent is running on the person's behalf. In the other example, the student acted on behalf of his supervisor, who acted on behalf of the department chair, who acts on behalf of the university, and all those agents are responsible in some way for the activity to take place but we do not say explicitly who bears responsibility and to what degree.

We could also say that an agent can act on behalf of several other agents (a group of agents). This would also make possible to indirectly reflect chains of responsibility. This also indirectly reflects control without requiring that control is explicitly indicated. In some contexts there will be a need to represent responsibility explicitly, for example to indicate legal responsibility, and that could be added as an extension to this core model. Similarly with control, since in particular contexts there might be a need to define specific aspects of control that various agents exert over a given activity.

A responsibility chain, written actedOnBehalfOf(id,ag2,ag1,a,attrs) in PROV-N, has the following constituents:
  • id: an optional identifier for the responsibility chain;
  • subordinate: an identifier for the agent associated with an activity, acting on behalf of the responsible agent;
  • responsible: an identifier for the agent, on behalf of which the subordinate agent acted;
  • activity: an optional identifier of an activity for which the responsibility chain holds;
  • attributes: an optional set of attribute-value pairs that describe the modalities of this relation.
In the following example, a programmer, a researcher and a funder agents are described. The programmer and researcher are associated with a workflow activity. The programmer acts on behalf of the researcher (delegation) encoding the commands specified by the researcher; the researcher acts on behalf of the funder, who has an contractual agreement with the researcher. The terms 'delegation' and 'contact' used in this example are domain specific.
activity(a,[prov:type="workflow"])
agent(ag1,[prov:type="programmer"])
agent(ag2,[prov:type="researcher"])
agent(ag3,[prov:type="funder"])
wasAssociatedWith(a,ag1,[prov:role="loggedInUser"])
wasAssociatedWith(a,ag2)
actedOnBehalfOf(ag1,ag2,a,[prov:type="delegation"])
actedOnBehalfOf(ag2,ag3,a,[prov:type="contract"])

Further considerations:

  • If an activity is not specified, then the subordinate agent is considered to act on behalf of the responsible agent, in all the activities the subordinate agent is associated with.
4.2.3.2 Derivation
A derivation is a transformation of an entity into another, a construction of an entity into another, or an update of an entity, resulting in a new one.

According to Section Overview, for an entity to be transformed from, created from, or resulting from an update to another, there must be some underpinning activities performing the necessary actions resulting in such a derivation. A derivation can be described at various levels of precision. In its simplest form, derivation relates two entities. Optionally, attributes can be added to describe modalities of derivation. If the derivation is the result of a single known activity, then this activity can also be optionally expressed. And to provide a completely accurate description of derivation, the generation and usage of the generated and used entities, respectively, can be provided. The reason for optional information such as activity, generation, and usage to be linked to derivations is to aid analysis of provenance and to facilitate provenance-based reproducibility.

A derivation, written wasDerivedFrom(id, e2, e1, a, g2, u1, attrs) in PROV-N, contains:
  • id: an optional identifier for a derivation;
  • generatedEntity: the identifier of the entity generated by the derivation;
  • usedEntity: the identifier of the entity used by the derivation;
  • activity: an optional identifier for the activity using and generating the above entities;
  • generation: an optional identifier for the generation involving the generated entity and activity;
  • usage: an optional identifier for the usage involving the used entity and activity;
  • attributes: an optional set of attribute-value pairs that describe the modalities of this derivation.

Derivation is not defined to be transitive. Domain-specific specializations of derivation may be defined in such a way that the transitivity property holds.

The following descriptions state the existence of derivations.

wasDerivedFrom(e2,e1)
wasDerivedFrom(e2,e1,[prov:type="physical transform"])
wasDerivedFrom(e2,e1,a,g2,u1)
  wasGeneratedBy(g2,e2,a)
  used(u1,a,e1)

The first and second lines are about derivations between e2 and e1, but no information is provided as to the identity of the activity (and usage and generation) underpinning the derivation. In the second line, a type attribute is also provided.

The third description expresses that activity a, using the entity e1 according to usage u1, derived the entity e2 and generated it according to generation g2. It is followed by descriptions for generation g2 and usage u1. With such a comprehensive description of derivation, a program that analyzes provenance can identify the activity underpinning the derivation, it can identify how the original entity e1 was used by the activity (e.g. for instance, which argument it was passed as, if the activity is the result of a function invocation), and which output the derived entity e2 was obtained from (say, for a function returning multiple results).

Emphasize the notion of 'affected by' ISSUE-133.
4.2.3.3 Alternate and Specialization

The purpose of this section is to introduce relations between two entities that refer to the same thing in the world. Consider for example three entities:

  • e1 denoting "Bob, the holder of Facebook account ABC",
  • e2 denoting "Bob, the holder of Twitter account XYZ",
  • e3 denoting "Bob, the person".

These entities refer to the same real person Bob, either in different contexts, or at different levels of abstraction. Specifically:

  1. e1 and e2 refer to Bob in two contexts (as Facebook and Twitter users, respectively)
  2. both of e1 and e2 are more detailed than e3.

The following two relations are introduced for expressing alternative or specialized entities.

An alternate relation, written alternateOf(alt1, alt2) in PROV-N, addresses case (1). It has the following constituents:
  • firstAlternate: an identifier of the first of the two entities;
  • secondAlternate: an identifier of the second of the two entities.

The following expressions describe two persons, respectively holder of a Facebook account and a Twitter account, and their relation as alternate.

entity(facebook:ABC, [ prov:type="person with Facebook account " ])
entity(twitter:XYZ, [ prov:type="person with Twitter account" ])
alternateOf(facebook:ABC, twitter:XYZ)

A specialization relation, written specializationOf(sub, super) in PROV-N, addresses case (2). It has the following constituents:
  • specializedEntity: an identifier of the specialized entity;
  • generalEntity: an identifier of the entity that is being specialized.

The following expressions describe two persons, the second of which is holder of a Twitter account. The second entity is a specialization of the first.

entity(ex:Bob, [ prov:type="person", ex:name="Bob" ])
entity(twitter:XYZ, [ prov:type="person with Twitter account" ])
specializationOf(twitter:XYZ, ex:Bob)
A discussion on alternative definition of these relations has not yet reached a satisfactory conclusion. This is ISSUE-29. Also ISSUE-96.

4.2.4 Annotation

An annotation is a link between something that is identifiable and a note referred to by its identifier.

Multiple notes can be associated with a given identified object; symmetrically, multiple objects can be associated with a given note. Since notes have identifiers, they can also be annotated. The annotation mechanism (with note and annotation) forms a key aspect of the extensibility mechanism of PROV-DM (see extensibility section).

An annotation relation, written hasAnnotation(r,n) in PROV-N, has the following constituents:

  • something: the identifier of something being annotated;
  • note: an identifier of a note.

The following expressions

entity(e1,[prov:type="document"])
entity(e2,[prov:type="document"])
activity(a,t1,t2)
used(u1,a,e1,[ex:file="stdin"])
wasGeneratedBy(e2, a, [ex:file="stdout"])

note(n1,[ex:icon="doc.png"])
hasAnnotation(e1,n1)
hasAnnotation(e2,n1)

note(n2,[ex:style="dotted"])
hasAnnotation(u1,n2)

describe two documents (attribute-value pair: prov:type="document") identified by e1 and e2, and their annotation with a note indicating that the icon (an application specific way of rendering provenance) is doc.png. The example also includes an activity, its usage of the first entity, and its generation of the second entity. The usage is annotated with a style (an application specific way of rendering this edge graphically). To be able to express this annotation, the usage was provided with an identifier u1, which was then referred to in hasAnnotation(u1,n2).

4.3 Further Elements of PROV-DM

This section introduces further elements of PROV-DM.

4.3.1 Namespace Declaration

A PROV-DM namespace is identified by an IRI reference [IRI]. In PROV-DM, attributes, identifiers, and literals with qualified names as data type can be placed in a namespace using the mechanisms described in this specification.

A namespace declaration consists of a binding between a prefix and a namespace. Every qualified name with this prefix in the scope of this declaration refers to this namespace. A default namespace declaration consists of a namespace. Every un-prefixed qualified name in the scope of this default namespace declaration refers to this namespace.

The PROV-DM namespace is http://www.w3.org/ns/prov-dm/ (TBC).

4.3.2 Identifier

An identifier is a qualified name.

A qualified name is a name subject to namespace interpretation. It consists of a namespace, denoted by an optional prefix, and a local name.

PROV-DM stipulates that a qualified name can be mapped into an IRI by concatenating the IRI associated with the prefix and the local part.

A qualified name's prefix is optional. If a prefix occurs in a qualified name, it refers to a namespace declared in a namespace declaration. In the absence of prefix, the qualified name refers to the default namespace.

4.3.3 Attribute

An attribute is a qualified name.

The PROV data model introduces a pre-defined set of attributes in the PROV-DM namespace, which we define below. The interpretation of any attribute declared in another namespace is out of scope.

4.3.3.1 prov:role

The attribute prov:role denotes the function of an entity with respect to an activity, in the context of a usage, generation, activity association, activity start, and activity end. The attribute prov:role is allowed to occur multiple times in a list of attribute-value pairs. The value associated with a prov:role attribute must be a PROV-DM Literal.

The following activity start describes the role of the agent identified by ag in this start relation with activity a.

   wasStartedBy(a,ag, [prov:role="program-operator"])
4.3.3.2 prov:type

The attribute prov:type provides further typing information for an element or relation. PROV-DM liberally defines a type as a category of things having common characteristics. PROV-DM is agnostic about the representation of types, and only states that the value associated with a prov:type attribute must be a PROV-DM Literal. The attribute prov:type is allowed to occur multiple times.

The following describes an agent of type software agent.

   agent(ag, [prov:type="prov:SoftwareAgent" %% xsd:QName])
4.3.3.3 prov:label

The attribute prov:label provides a human-readable representation of a PROV-DM element or relation. The value associated with the attribute prov:label must be a string.

This is ISSUE-219.
4.3.3.4 prov:location

A location can be an identifiable geographic place (ISO 19112), but it can also be a non-geographic place such as a directory, row, or column. As such, there are numerous ways in which location can be expressed, such as by a coordinate, address, landmark, and so forth. This document does not specify how to concretely express locations, but instead provide a mechanism to introduce locations, by means of attributes.

The attribute prov:location is an optional attribute of entity and activity. The value associated with the attribute prov:location must be a PROV-DM Literal, expected to denote a location.

The following expression describes entity Mona Lisa, a painting, with a location attribute.

 entity(ex:MonaLisa, [prov:location="Le Louvres, Paris", prov:type="StillImage"])

4.4 Literal

Usually, in programming languages, Literal are a notation for values. So, Literals should probably be moved to the serialization. Here, instead, we should define the types of values. Thoughts?

A PROV-DM Literal represents a data value such as a particular string or number. A PROV-DM Literal represents a value whose interpretation is outside the scope of PROV-DM.

The following examples respectively are the string "abc", the string "abc", the integer number 1, and the IRI "http://example.org/foo".

  "abc"
  1
  "http://example.org/foo" %% xsd:anyURI

The following example shows a literal of type xsd:QName (see QName [XMLSCHEMA-2]). The prefix ex must be bound to a namespace declared in a namespace declaration.

  "ex:value" %% xsd:QName

4.5 Time

It's a legacy of the charter that time is a top level section. Time is a specific kind of value, and should be folded into the "value" section.

Time instants are defined according to xsd:dateTime [XMLSCHEMA-2].

Time is optional in usage, generation, and activity

5. PROV-DM Common Relations

The following figure summarizes the additional relations described in this section.

common relations
PROV-DM Common Relations

5.1 Revision

A revision is the result of revising an entity into a revised version. Deciding whether something is made available as a revision of something else usually involves an agent who takes responsibility for approving that the former is a due variant of the latter. The agent who is responsible for the revision may optionally be specified. Revision is a particular case of derivation of an entity into its revised version.

A revision relation, written wasRevisionOf(id,e2,e1,ag,attrs) in PROV-N, contains:

Revisiting the example of Section 3.1, we can now state that the report tr:WD-prov-dm-20111215 is a revision of the report tr:WD-prov-dm-20111018, approved by agent w3:Consortium.

entity(tr:WD-prov-dm-20111215, [ prov:type="pr:RecsWD" %% xsd:QName ])
entity(tr:WD-prov-dm-20111018, [ prov:type="pr:RecsWD" %% xsd:QName ])
wasRevisionOf(tr:WD-prov-dm-20111215, tr:WD-prov-dm-20111018, w3:Consortium)

5.2 Attribution

Attribution is the ascribing of an entity to an agent. More precisely, when an entity e is attributed to agent ag, entity e was generated by some activity a, which in turn was associated to agent ag. Thus, this relation is useful when the activity is not known, or irrelevant.

An attribution relation, written wasAttributedTo(id,e,ag,attr) in PROV-N, contains the following elements:

Revisiting the example of Section 3.2, we can ascribe tr:WD-prov-dm-20111215 to some agents without having to make an activity explicit.

agent(ex:Paolo, [ prov:type="Person" ])
agent(ex:Simon, [ prov:type="Person" ])
entity(tr:WD-prov-dm-20111215, [ prov:type="pr:RecsWD" %% xsd:QName ])
wasAttributedTo(tr:WD-prov-dm-20111215, ex:Paolo, [prov:role="editor"])
wasAttributedTo(tr:WD-prov-dm-20111215, ex:Simon, [prov:role="contributor"])

5.3 Activity Ordering

The following relations express dependencies amongst activities.

An information flow ordering relation, written as wasInformedBy(id,a2,a1,attrs) in PROV-N, contains:

Relation wasInformedBy is not transitive.

Consider two long running services, which we represent by activities s1 and s2.

activity(s1,,,[prov:type="service"])
activity(s2,,,[prov:type="service"])
wasInformedBy(s2,s1)
The last line indicates that some entity was generated by s1 and used by s2.

A control ordering relation, written as wasStartedBy(id, a2, a1, attrs) in PROV-N, contains:

Suppose activities a1 and a2 are computer processes that are executed on different hosts, and that a1 started a2. This can be expressed as in the following fragment:

activity(a1,t1,t2,[ex:host="server1.example.org",prov:type="workflow"])
activity(a2,t3,t4,[ex:host="server2.example.org",prov:type="subworkflow"])
wasStartedBy(a2,a1)

5.4 Traceability

A traceability relation between two entities e2 and e1 is a generic dependency of e2 on e1 that indicates either that e1 was necessary for e2 to be created, or that e1 bears some responsibility for e2's existence.

A traceability relation, written tracedTo(id,e2,e1,attrs) in PROV-N, contains:

We note that the ancestor is allowed to be an agent since agents are entities.

We refer to the example of Section 3.1, and specifically to Figure prov-tech-report. We can see that there is a path from tr:WD-prov-dm-20111215 to w3:Consortium or to pr:rec-advance. This is expressed as follows.

 tracedTo(tr:WD-prov-dm-20111215,w3:Consortium)
 tracedTo(tr:WD-prov-dm-20111215,pr:rec-advance)

Derivation and association are particular cases of traceability.

5.5 Quotation

I find that quotation is really a misnomer. This expands into derivation with attribution, in what sense is the derived entity a "quote" of the original? . The agent that is quoted is particularly obscure. It does not seem to be involved in the quoting at all. Why isn't quoting an activity with the quoting agent associated with it? [PM]. Need example [DG].

A quotation is the repeat of an entity (such as text or image) by someone other that its original author. Quotation is a particular case of derivation in which entity e2 is derived from entity e1 by copying, or "quoting", parts of it.

A quotation relation, written wasQuotedFrom(id,e2,e1,ag2,ag1,attrs) in PROV-N, contains:

5.6 Original Source

I find this relation confusing. Please add an example. I wouldn't really know when to use this. [PM]. Need example [DG]

An original source relation is a particular case of derivation that states that an entity e2 (derived) was originally part of some other entity e1 (the original source).

An original source relation, written hadOriginalSource(id,e2,e1,attrs), contains:

5.7 Collections

Collection relations address the need to describe the evolution of entities that have a collection structure, that is, which may contain other entities. Specifically, this section exploits the built-in type for entities, called collection, and two relations to describe the effect of adding elements to, and removing elements from, a collection entity. The intent of these relations and entity types is to capture the history of changes that occurred to a collection.

A collection is an entity that has a logical internal structure consisting of key-value pairs, often referred to as a map. More precisely, the following entity types are introduced:

The following relations relate a collection c1 with a collection c2 obtained after adding or removing a new pair to (resp. from) c1:
   entity(c, [prov:type="EmptyCollection"])    // e is an empty collection
   entity(v1)
   entity(v2)
   entity(c1, [prov:type="Collection"])
   entity(c2, [prov:type="Collection"])
  
  CollectionAfterInsertion(c1, c, "k1", v1)       // c1 = { ("k1",v1) }
  CollectionAfterInsertion(c2, c1, "k2", v2)      // c2 = { ("k1",v1), ("k2", v2) }
  CollectionAfterRemoval(c3, c2, k1)              // c3 = { ("k2",v2) }

A relation CollectionAfterInsertion, written CollectionAfterInsertion(collAfter, collBefore, key, value), contains:

A relation CollectionAfterDeletion, written CollectionAfterDeletion(collAfter, collBefore, key), contains:

I propose to call them afterInsertion instead of CollectionAfterInsertion (likewise, for deletion). What about attributes and optional Id?

Further considerations:

Deleted further items. Some of them are constraints which belong to part 2.

6. PROV-DM Extensibility Points

The PROV data model provides several extensibility points that allow designers to specialize it to specific applications or domains. We summarize these extensibility points here:

The PROV data model is designed to be application and technology independent, but specializations of PROV-DM are welcome and encouraged. To ensure interoperability, specializations of the PROV data model that exploit the extensibility points summarized in this section must preserve the semantics specified in the PROV-DM documents (part 1 to 3).

7. Data Model Constraints

A. References

A.1 Normative references

[IRI]
M. Duerst, M. Suignard. Internationalized Resource Identifiers (IRI). January 2005. Internet RFC 3987. URL: http://www.ietf.org/rfc/rfc3987.txt
[OWL2-SYNTAX]
Boris Motik; Peter F. Patel-Schneider; Bijan Parsia. OWL 2 Web Ontology Language:Structural Specification and Functional-Style Syntax. 27 October 2009. W3C Recommendation. URL: http://www.w3.org/TR/2009/REC-owl2-syntax-20091027/
[PROV-O]
Satya Sahoo and Deborah McGuinness (eds.) Khalid Belhajjame, James Cheney, Daniel Garijo, Timothy Lebo, Stian Soiland-Reyes, and Stephan Zednik Provenance Formal Model. 2011, Working Draft. URL: http://www.w3.org/TR/prov-o/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Internet RFC 2119. URL: http://www.ietf.org/rfc/rfc2119.txt
[XMLSCHEMA-2]
Paul V. Biron; Ashok Malhotra. XML Schema Part 2: Datatypes Second Edition. 28 October 2004. W3C Recommendation. URL: http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/

A.2 Informative references

[Logic]
W. E. JohnsonLogic: Part III.1924. URL: http://www.ditext.com/johnson/intro-3.html
[PROV-AQ]
Graham Klyne and Paul Groth (eds.) Luc Moreau, Olaf Hartig, Yogesh Simmhan, James Meyers, Timothy Lebo, Khalid Belhajjame, and Simon Miles Provenance Access and Query. 2011, Working Draft. URL: http://www.w3.org/TR/prov-aq/
[PROV-DM-CONSTRAINTS]
Luc Moreau and Paolo Missier (eds.) ... PROV-DM Constraints. 2011, Working Draft. URL: http://www.w3.org/TR/prov-dm-constraints/
[PROV-N]
Luc Moreau and Paolo Missier (eds.) ... PROV-N ..... 2011, Working Draft. URL: http://www.w3.org/TR/prov-n/
[PROV-PRIMER]
Yolanda Gil and Simon Miles (eds.) Khalid Belhajjame, Helena Deus, Daniel Garijo, Graham Klyne, Paolo Missier, Stian Soiland-Reyes, and Stephan Zednik Prov Model Primer. 2011, Working Draft. URL: http://www.w3.org/TR/prov-primer/
[PROV-SEM]
James Cheney Formal Semantics Strawman. 2011, Work in progress. URL: http://www.w3.org/2011/prov/wiki/FormalSemanticsStrawman