Provenance is information about entities, activities, and people, involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. PROV-DM is the conceptual data model that forms a basis for the W3C provenance (PROV) family of specifications. PROV-DM distinguishes core structures, forming the essence of provenance information, from extended structures catering for more specific uses of provenance. PROV-DM is organized in six components, respectively dealing with: (1) entities and activities, and the time at which they were created, used, or ended; (2) derivations of entities from entities; (3) agents bearing responsibility for entities that were generated and activities that happened; (4) a notion of bundle, a mechanism to support provenance of provenance; and, (5) properties to link entities that refer to the same thing; (6) collections forming a logical structure for its members.

This document introduces the provenance concepts found in PROV and defines PROV-DM types and relations. PROV data model is domain-agnostic, but is equipped with extensibility points allowing domain-specific information to be included.

Two further documents complete the specification of PROV-DM. First, a companion document specifies the set of constraints that provenance should follow. Second, a separate document describes a provenance notation for expressing instances of provenance for human consumption; this notation is used in examples in this document.

Intended to be Last Call (TBC)

This is the fifth public release of the PROV-DM document. Publication as Last Call working draft means that the Working Group believes that it has satisfied the relevant technical requirements outlined in its charter on this document. The design is not expected to change significantly, going forward, and now is the key time for external review, before the implementation phase.

Please Comment By (date TBD)

The PROV Working group seeks public feedback on this Working Draft. The end date of the Last Call review period is TBD, and we would appreciate comments by that date to [email protected]

PROV Family of Specifications

This document is part of the PROV family of specifications, a set of specifications defining various aspects that are necessary to achieve the vision of inter-operable interchange of provenance information in heterogeneous environments such as the Web. The specifications are:

How to read the PROV Family of Specifications

Introduction

PROV-DM Types and Relations

Provenance concepts, expressed as PROV-DM types and relations, are organized according to six components that are defined in this section. The components and their dependencies are illustrated in Figure 4. A component that relies on concepts defined in another is displayed above it in the figure. So, for example, component 6 (collections) depends on concepts defined in component 3 (derivation), itself dependen on concepts defined in component 1 (entity and activity).

Component 6: Collections

The sixth component of PROV-DM is concerned with the notion of collections. A collection is an entity that has some members. The members are themselves entities, and therefore their provenance can be expressed. Some applications need to be able to express the provenance of the collection itself: e.g. who maintains the collection (attribution), which members it contains as it evolves, and how it was assembled. The purpose of Component 6 is to define the types and relations that are useful to express the provenance of collections. In PROV, the concept of Collection is implemented by means of dictionaries, which we introduce in this section.

Figure 10 depicts the sixth component with four new classes (Collection, Dictionary, EmptyDictionary, and Pair) and three associations (insertion, removal, and memberOf).

dictionaries
Figure 10: Collections Component Overview

The intent of these relations and types is to express the history of changes that occurred to a collection. Changes to collections are about the insertion of entities into, and the removal of entities from the collection. Indirectly, such history provides a way to reconstruct the contents of the collection.

Collection

A collection is a multiset of entities (it is a multiset, rather than a set, because it may not be possible to verify that two distinct entity identitifiers do not denote, in fact, the same entity).

PROV-DM defines the following types related to collections:

  • prov:Collection denotes an entity of type Collection, i.e. an entity that can participate in relations amongst collections;
  • prov:EmptyCollection denotes an empty collection.
entity(c0, [prov:type='prov:EmptyCollection' ])  // c0 is an empty collection
entity(c1, [prov:type='prov:Collection'  ])      // c1 is a collection, with unknown content

In PROV, the concept of Collection is provided as an extensibility point for specialized kinds of collections. One of these, Dictionary, is defined next.

Collection Memberhsip

A collection membership relation is defined, to allow stating the members of a Collection.

A membership relation, written memberOf(id; c, {e_1, ..., e_n}, cplt, attrs), has:
  • id: an OPTIONAL identifier identifying the relation;
  • collection: an identifier (c) for the collection whose members are asserted;
  • entity-set: a set of entities e_1, ..., e_n that are members of the collection;
  • complete: an OPTIONAL boolean Value (cplt). It is interpreted as follows:
    • if it is present and set to true, then c is believed to include all and only the members specified in the entity-set;
    • if it is present and set to false, then c is believed to include more members in addition to those specified in the entity-set;
    • if it is not present, then c is believed to include all the members specified in the entity-set, and it MAY include more.
  • attributes: an OPTIONAL set (attrs) of attribute-value pairs representing additional information about this relation.

Note that the attribute complete indicates that the membership relation provides a complete description of the collection membership. It is possible for different provenance descriptions to provide different membership statements regarding the same collection. The resolution of any potential conflict amongst such membership statements is defined by applications.

Dictionary

PROV-DM defines a specific type of collection, specified as follows.

Conceptually, a dictionary has a logical structure consisting of key-entity pairs. This structure is often referred to as a map, and is a generic indexing mechanism that can abstract commonly used data structures, including associative lists, relational tables, ordered lists, and more. The specification of such specialized structures in terms of key-value pairs is out of the scope of this document.

A given dictionary forms a given structure for its members. A different structure (obtained either by insertion or removal of members) constitutes a different dictionary. Hence, for the purpose of provenance, a dictionary entity is viewed as a snapshot of a structure. Insertion and removal operations result in new snapshots, each snapshot forming an identifiable dictionary entity.

Following the earlier definitions for generic collections, PROV-DM defines the following types related to dictionaries:

  • prov:Dictionary is a subtype of prov:Collection. It denotes an entity of type dictionary, i.e. an entity that can participate in relations amongst dictionaries;
  • prov:EmptyDictionary is a subtype of prov:EmptyCollection. It denotes an empty dictionary.
entity(d0, [prov:type='prov:EmptyDictionary' ])  // d0 is an empty dictionary
entity(d1, [prov:type='prov:Dictionary'  ])      // d1 is a dictionary, with unknown content

Dictionary Membership

The dictionary membership has the same purpose as the collection membership relation, but it applies to entities having prov:type = 'prov:Dictionary'. It allows stating the members of a Dictionary.

A membership relation, written memberOf(id; c, {(key_1, e_1), ..., (key_n, e_n)}, cplt, attrs), has:
  • id: an OPTIONAL identifier identifying the relation;
  • dictionary: an identifier (c) for the dictionary whose members are asserted;
  • key-entity-set: a set of key-entity pairs (key_1, e_1), ..., (key_n, e_n) that are members of the dictionary;
  • complete: an OPTIONAL boolean Value (cplt). It is interpreted as follows:
    • if it is present and set to true, then c is believed to include all and only the members specified in the key-entity-set;
    • if it is present and set to false, then c is believed to include more members in addition to those specified in the key-entity-set;
    • if it is not present, then c is believed to include all the members specified in the key-entity-set, and it MAY include more.
  • attributes: an OPTIONAL set (attrs) of attribute-value pairs representing additional information about this relation.

The attribute complete is interpreted as for the general collection membership relation.

entity(d1, [prov:type='prov:Dictionary' ])    // d1 is a dictionary, with unknown content
entity(d2, [prov:type='prov:Dictionary' ])    // d2 is a dictionary, with unknown content

entity(e1)
entity(e2)

memberOf(d1, {("k1", e1), ("k2", e2)} )  
memberOf(d2, {("k1", e1), ("k2", e2)}, true)  

From these descriptions, we conclude:
  • d1 has the following pairs as members: ("k1", e1), ("k2", e2), and may contain others.
  • d2 exactly has the following pairs as members: ("k1", e1), ("k2", e2), and does not contain any other.

Thus, the membership of d1 is only partially known.

Dictionary Insertion

An Insertion relation, written derivedByInsertionFrom(id; d2, d1, {(key_1, e_1), ..., (key_n, e_n)}, attrs), has:

  • id: an OPTIONAL identifier identifying the relation;
  • after: an identifier (d2) for the dictionary after insertion;
  • before: an identifier (d1) for the dictionary before insertion;
  • key-entity-set: the inserted key-entity pairs (key_1, e_1), ..., (key_n, e_n) in which each key_i is a value, and e_i is an identifier for the entity that has been inserted with the key; each key_i is expected to be unique for the key-entity-set;
  • attributes: an OPTIONAL set (attrs) of attribute-value pairs representing additional information about this relation.

An Insertion relation derivedByInsertionFrom(id; d2, d1, {(key_1, e_1), ..., (key_n, e_n)}) states that d2 is the dictionary following the insertion of pairs (key_1, e_1), ..., (key_n, e_n) into dictionary d1.

entity(d0, [prov:type='prov:EmptyDictionary' ])    // d0 is an empty dictionary
entity(e1)
entity(e2)
entity(e3)
entity(d1, [prov:type='prov:Dictionary' ])
entity(d2, [prov:type='prov:Dictionary' ])

derivedByInsertionFrom(d1, d0, {("k1", e1), ("k2", e2)})       
derivedByInsertionFrom(d2, d1, {("k3", e3)})    
From this set of descriptions, we conclude:
  • d0 is the set { }
  • d1 is the set { ("k1", e1), ("k2", e2) }
  • d2 is the set { ("k1", e1), ("k2", e2), ("k3", e3) }

Insertion provides an "update semantics" for the keys that are already present in a dictionary, since a new pair replaces an existing pair with the same key in the new dictionary. This is illustrated by the following example.

entity(d0, [prov:type='prov:EmptyDictionary' ])    // d0 is an empty dictionary
entity(e1)
entity(e2)
entity(e3)
entity(d1, [prov:type='prov:Dictionary' ])
entity(d2, [prov:type='prov:Dictionary' ])

derivedByInsertionFrom(d1, d0, {("k1", e1), ("k2", e2)})       
derivedByInsertionFrom(d2, d1, {("k1", e3)})    
This is a case of update of e1 to e3 for the same key, "k1".
From this set of descriptions, we conclude:
  • d0 is the set { }
  • d1 is the set { ("k1", e1), ("k2", e2) }
  • d2 is the set { ("k1", e3), ("k2", e2) }

Dictionary Removal

A Removal relation, written derivedByRemovalFrom(id; d2, d1, {key_1, ... key_n}, attrs), has:

  • id: an OPTIONAL identifier identifying the relation;
  • after: an identifier (d2) for the dictionary after the deletion;
  • before: an identifier (d1) for the dictionary before the deletion;
  • key-set: a set of deleted keys key_1, ..., key_n, for which each key_i is a value;
  • attributes: an OPTIONAL set (attrs) of attribute-value pairs representing additional information about this relation.

A Removal relation derivedByRemovalFrom(id; d2,d1, {key_1, ..., key_n}) states that d2 is the dictionary following the removal of the set of pairs corresponding to keys key_1...key_n from d1.

entity(d0, [prov:type="prov:EmptyDictionary"])    // d0 is an empty dictionary
entity(e1)
entity(e2)
entity(e3)
entity(d1, [prov:type="prov:Dictionary"])
entity(d2, [prov:type="prov:Dictionary"])

derivedByInsertionFrom(d1, d0, {("k1", e1), ("k2",e2)})       
derivedByInsertionFrom(d2, d1, {("k3", e3)})
derivedByRemovalFrom(d3, d2, {"k1", "k3"})   
From this set of descriptions, we conclude:
  • d0 is the set { }
  • d1 is the set { ("k1", e1), ("k2", e2) }
  • d2 is the set { ("k1", e1), ("k2", e2), ("k3", e3) }
  • d3 is the set { ("k2", e2) }

Further considerations:

PROV Notation

Membership

The following table summarizes how each constituent of a PROV-DM Membership maps to a non-terminal.

Dictionary MembershipNon-Terminal
idoptionalIdentifier
dictionarydIdentifier
key-entity-setkeyEntitySet
completecomplete
attributesoptionalAttributeValuePairs
   memberOf(mId, c, {e1, e2, e3}, [])   // Collection membership
   memberOf(mId, c, {("k4", v4), ("k5", v5)}, [])   // Dictionary membership
  

Here mid is the optional membership identifier, c is the identifier for the collection whose membership is stated, {("k4", v4), ("k5", v5)} is the set of key-value pairs that are members of c, and [] is the optional (empty) set of attributes.

The remaining examples show cases for Dictionaries, where some of the optionals are omitted. Key-entity sets are replaced with Entity sets for the corresponding generic Collections examples.
memberOf(c3, {("k4", v4), ("k5", v5)})
memberOf(c3, {("k4", v4)})
memberOf(c3, {("k4", v4)}, false)
memberOf(c3, {("k4", v4)}, true)
memberOf(c3, {("k4", v4), ("k5", v5)},[])  
memberOf(c3, {("k4", v4), ("k5", v5)},true, [])  

Insertion

The following table summarizes how each constituent of a PROV-DM Insertion maps to a non-terminal.

InsertionNon-Terminal
idoptionalIdentifier
aftercIdentifier
beforecIdentifier
key-entity-setkeyEntitySet
attributesoptionalAttributeValuePairs
 derivedByInsertionFrom(id; c1, c, {("k1", v1), ("k2", v2)}, [])  
  

Here id is the optional insertion identifier, c1 is the identifier for the collection after the insertion, c is the identifier for the collection before the insertion, {("k1", v1), ("k2", v2)} is the set of key-value pairs that have been inserted in c, and [] is the optional (empty) set of attributes.

The remaining examples show cases where some of the optionals are omitted.
 derivedByInsertionFrom(c1, c, {("k1", v1), ("k2", v2)})  
 derivedByInsertionFrom(c1, c, {("k1", v1)})  
 derivedByInsertionFrom(c1, c, {("k1", v1), ("k2", v2)}, [])

Removal

The following table summarizes how each constituent of a PROV-DM Removal maps to a non-terminal.

RemovalNon-Terminal
idoptionalIdentifier
aftercIdentifier
beforecIdentifier
key-setkeySet
attributesoptionalAttributeValuePairs
 derivedByRemovalFrom(id; c3, c, {"k1", "k3"}, [])  
  

Here id is the optional removal identifier, c1 is the identifier for the collection after the removal, c is the identifier for the collection before the removal, {("k1", v1), ("k2", v2)} is the set of key-value pairs that have been removed from c, and [] is the optional (empty) set of attributes.

The remaining examples show cases where some of the optionals are omitted.
   derivedByRemovalFrom(c3, c1, {"k1", "k3"})               
   derivedByRemovalFrom(c3, c1, {"k1"})               
   derivedByRemovalFrom(c3, c1, {"k1", "k3"}, [])               

Dictionary Constraints

As resolved at F2F3, the material in this section goes, if anywhere, into the PROV-COLLECTIONS note.

Membership is a convenience notation, since it can be expressed in terms of an insertion into some dictionary. The membership definition is formalized by .

memberOf(d, {(k1, v1), ...}) holds IF AND ONLY IF there exists a dictionary d0, such that derivedByInsertionFrom(d, d0, {(k1, v1), ...}).


A dictionary may be obtained by insertion or removal, or said to satisfy the membership relation. To provide an interpretation of dictionaries, PROV-DM restricts one dictionary to be involved in a single derivation by insertion or removal, or to one membership relation. PROV-DM does not provide an interpretation for statements that consist of two (or more) insertion, removal, membership relations that result in the same dictionary.

The following constraint ensures unique derivation.

The following constraint is unclear.

A dictionary MUST NOT be derived through multiple insertions, removal, or membership relations.

Consider the following statements about three dictionaries.
entity(d1, [prov:type='prov:Dictionary'])
entity(d2, [prov:type='prov:Dictionary'])
entity(d3, [prov:type='prov:Dictionary'])


derivedByInsertionFrom(d3, d1, {("k1", e1), ("k2", e2)})
derivedByInsertionFrom(d3, d2, {("k3", e3)})

There is no interpretation for such statements since d3 is derived multiple times by insertion.

As a particular case, dictionary d is derived multiple times from the same d1.

derivedByInsertionFrom(id1, d, d1, {("k1", e1), ("k2", e2)})
derivedByInsertionFrom(id2, d, d1, {("k3", e3), ("k4", e4)})

The interpretation of such statements is also unspecified.

To describe the insertion of the 4 key-entity pairs, one would instead write:

derivedByInsertionFrom(id1, d, d1, {("k1", e1), ("k2", e2), ("k3", e3), ("k4", e4)})
The same is true for any combination of insertions, removals, and membership relations:

The following statements

derivedByInsertionFrom(d, d1, {("k1", e1)})
derivedByRemovalFrom(d, d2, {"k2"})
have no interpretation. Nor have the following:
derivedByInsertionFrom(d, d1, {("k1", e1)})
memberOf(d, {"k2",e2})

Dictionary branching

It is allowed to have multiple derivations from a single root dictionary, as long as the resulting entities are distinct, as shown in the following example.
entity(d0, [prov:type='prov:EmptyDictionary'])    // d0 is an empty dictionary
entity(d1, [prov:type='prov:Dictionary'])
entity(d2, [prov:type='prov:Dictionary'])
entity(d3, [prov:type='prov:Dictionary'])
entity(e1)
entity(e2)
entity(e3)

derivedByInsertionFrom(d1, d0, {("k1", e1)})      
derivedByInsertionFrom(d2, d0, {("k2", e2)})       
derivedByInsertionFrom(d3, d1, {("k3", e3)})       
From this set of statements, we conclude:
  d1 = { ("k1", e1) }
  d2 = { ("k2", e2) }
  d3 = { ("k1", e1), ("k3", e3)}

Dictionaries and Weaker Derivation Relation

Since a set of statements regarding a dictionary's evolution may be incomplete, so is the reconstructed state obtained by querying those statements. In general, all statements reflect partial knowledge regarding a sequence of data transformation events. In the particular case of dictionary evolution, in which some of the state changes may have been missed, the more generic derivation relation should be used to signal that some updates may have occurred, which cannot be expressed as insertions or removals. The following example illustrates this.

In the example, the state of d2 is only partially known because the dictionary is constructed from partially known other dictionaries.
entity(d0, [prov:type='prov:EmptyDictionary'])    // d0 is an empty dictionary
entity(d1, [prov:type='prov:Dictionary'])    
entity(d2, [prov:type='prov:Dictionary'])    
entity(d3, [prov:type='prov:Dictionary'])    
entity(e1)
entity(e2)

derivedByInsertionFrom(d1, d0, {("k1", e1)})       
wasDerivedFrom(d2, d1)                       
derivedByInsertionFrom(d3, d2, {("k2", e2)})       
 
From this set of statements, we conclude:
  • d1 = { ("k1", e1) }
  • d2 is somehow derived from d1, but the precise sequence of updates is unknown
  • d3 includes ("k2", e2) but the earlier "gap" leaves uncertainty regarding ("k1", e1) (it may have been removed) or any other pair that may have been added as part of the derivation activities.
Do the insertion/removal derivation steps imply wasDerivedFrom, wasVersionOf, alternateOf?

Dictionaries and Contents

Axiomatisation of dictionaries to be expressed here. See here.
<-p>Collection classes and properties are specializations of the Starting Point and Qualified terms that describe the provenance of collections as key-value pairs that are inserted a nd removed to create new collections. The classes and properties in this category are listed below and are discussed in Section 3.4.

need to include aquarius:"includes/at-a-glance-collections.html"

Collections Terms

A prov:Dictionary is an prov:Entity that acts as a container to some members, which are themselves entities. Specifically, a dictionary is composed of set of key-value pairs, where a literal key is used to identify a constituent entity within the dictionary. To illustrate this, the example below describes a dictionary :c1 that has as members the two key value pairs ("k1", :e1) and ("k2", :e2).

{% escape %}{% include "includes/prov/examples/eg-26-provo-collections-narrative/rdf/membership.ttl" %}{% endescape %}

It is worth noting that :c1 MAY also have other members (i.e. prov:knownMembership is not functional). A dictionary MAY be empty and thus not have any known memberships, in which case it SHOULD be described as an instance of the subclass prov:EmptyDictionary.

To describe the provenance of a dictionary, PROV-O provides two kinds of involvements: prov:qualifiedInsertion is used to describe that a dictionary was obtained from an existing dictionary by inserting a set of key-value pairs. prov:qualifiedRemoval is used to specify that a given dictionary was obtained from an existing dictionary by removing a set of key-value pairs. The example below specifies that the dictionary :c1 was obtained from the empty dictionary :c by inserting the key-value pairs ("k1", :e1) and ("k2", :e2).

{% escape %}{% include "includes/prov/examples/eg-26-provo-collections-narrative/rdf/insertion.ttl" %}{% endescape %}

Similarly, the example below specifies that the dictionary :c3 was obtained by removing the key-value pairs associated with the keys "k1" and "k2" from the dictionary :c2. Thus, :c3 does not contain the members ("k1", :e1) and ("k2", :e2( from :c2.

{% escape %}{% include "includes/prov/examples/eg-26-provo-collections-narrative/rdf/removal.ttl" %}{% endescape %}

Collection Terms

The terms used to describe the provenance of collections of key-value pairs are discussed in Section 3.4.

need to include aquarius:"includes/at-a-glance-collections.html" need to include aquarius:"includes/cross-reference-collections.html"

Acknowledgements

WG membership to be listed here.