PROV-DM, the PROV data model, is a data model for provenance that describes the entities, people and activities involved in producing a piece of data or thing. PROV-DM distinguishes core structures, forming the essence of provenance descriptions, from extended structures catering for more advanced uses of provenance. PROV-DM is organized in six components, respectively dealing with: (1) entities and activities, and the time at which they were created, used, or ended; (2) agents bearing responsibility for entities that were generated and activities that happened; (3) derivations of entities from entities; (4) properties to link entities that refer to the same thing; (5) notion of bundle, a mechanism to support provenance of provenance; (6) collections forming a logical structure for its members.

This document introduces the provenance concepts found in PROV and defines PROV-DM types and relations. PROV data model is domain-agnostic, but is equipped with extensibility points allowing domain-specific information to be included.

Two further documents complete the specification of PROV-DM. First, a companion document specifies the set of constraints that provenance descriptions should follow. Second, a separate document describes a provenance notation for expressing instances of provenance for human consumption; this notation is used in examples in this document.

PROV Family of Specifications

This document is part of the PROV family of specifications, a set of specifications defining various aspects that are necessary to achieve the vision of inter-operable interchange of provenance information in heterogeneous environments such as the Web. The specifications are:

How to read the PROV Family of Specifications

Fourth Public Working Draft

This is the fourth public release of the PROV-DM document. Following feedback, the Working Group has decided to reorganize this document substantially, separating the data model from its contraints and the notation used to illustrate it. The PROV-DM release is synchronized with the release of the PROV-O, PROV-PRIMER, PROV-N, and PROV-CONSTRAINTS documents. We are now clarifying the entry path to the PROV family of specifications.

Collections and Contents

The interpretation of Collection statements is defined by the following axioms. Function Contents: C → ℘(E) maps a collection entity c ∈ C to a finite set {e1, … en} ⊂ E of entities, where C is the set of all Entities of type Collection, and E is the set of all Entities.
  1. entity(c, [prov:type='prov:EmptyCollection']) ⇒ Contents(c) = ∅
  2. derivedByInsertionFrom(c2, c1, E) ⇒ Contents(c) = Contents(c1) ∪ E;
  3. derivedByRemovalFrom(c2, c1, E) ⇒ Contents(c) = Contents(c1) \ E;
  4. memberOf(c, E) ⇒ Contents(c) ⊃ E
  5. memberOf(c, E, true) ⇒ Contents(c) = E

Some consequences of these axioms

The following examples illustrate how these axioms can be used, and in particular one can decide whether or not a set of statements is consistent.
A chain of insertions and removals that starts from statements of the form (1) or (5) leads to a complete characterisation of the contents of the final collection.
 entity(c, [prov:type='prov:EmptyCollection']),
 derivedByInsertionFrom(c1, c, E1),
 derivedByInsertionFrom(c2, c1, E2) 
From these statements, one entails: Contents(c2) = E1 ∪ E2

Similarly:

 entity(c, [prov:type='prov:EmptyCollection'])
 memberOf(c, E, true)
 derivedByInsertionFrom(c1, c, E1)
 derivedByInsertionFrom(c2, c1, E2) 
Contents(c2) = E ∪ E1 ∪ E2
Incomplete characterisation of the contents of the final collection:
 entity(c, [prov:type='prov:Collection'])
 memberOf(c, E)
 derivedByInsertionFrom(c1, c, E1)
This entails: Contents(c1) ⊃ E ∪ E1
Use of multiple memberOf statements, with no complete flag:
memberOf(c, E1)
memberOf(c, E2)
Contents(c) ⊃ E1 ∪ E2
Use of multiple memberOf statements, with one or more complete flags:
1) memberOf(c, E1, true)  
2) memberOf(c, E2)
From (1): Contents(c) = E1
From (2): Contents(c) ⊃ E2

This is inconsistent unless E2 ⊂ E1. In other words, any memberOf statement that adds to an existing "complete" memberOf statement must be contained by the latter.

Example:

memberOf(todays-us-supreme-court, {})   # todays-us-supreme-court  contains at least JGR Jr
memberOf(todays-us-supreme-court, {Paolo}, true)   # todays-us-supreme-court  contains at least Paolo
this is inconsistent.

(1) memberOf(todays-us-supreme-court, {})                # todays-us-supreme-court  contains at least JGR Jr
(2) memberOf(todays-us-supreme-court, {, Paolo}, true)   # todays-us-supreme-court  contains  Paolo and another dude
this is consistent because (2) contains (1).

Multiple derivation statements regarding the same collection.

Case 1: the deriving collection is the same in both statements. This leads to an inconsistency (except in the trivial case in which the inserted sets are identical).

(1) derivedByInsertionFrom(c1, c, E1)
(2) derivedByInsertionFrom(c1, c, E2) 
From (1): Contents(c1) = c ∪ E1
From (2): Contents(c1) = c ∪ E2

Case 2: two different deriving collections:

(1) derivedByInsertionFrom(c3, c1, E1)
(2) derivedByInsertionFrom(c3, c2, E2) 
From (1): Contents(c3) = c1 ∪ E1
From (2): Contents(c3) = c2 ∪ E2

This is not necessarily an inconsistency.

Acknowledgements

WG membership to be listed here.