PROV-DM Part II: Constraints of the Provenance Data Model

Abstract

PROV-DM, the PROV data model, is a data model for provenance that describes the entities, people and activities involved in producing a piece of data or thing. PROV-DM is structured in six components, dealing with: (1) entities and activities, and the time at which they were created, used, or ended; (2) agents bearing responsibility for entities that were generated and actities that happened; (3) derivations between entities; (4) properties to link entities that refer to a same thing; (5) collections of entities, whose provenance can itself be tracked; (6) a simple annotation mechanism.

This document introduces a further set of concepts underpinning the PROV-DM data model and defines constraints that well-structured provenance descriptions should follow. These constraints help provide an interpretation for provenance descriptions. They are useful for readers who develop applications that generate provenance or reason over provenance.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

PROV Family of Specifications

This document is part of the PROV family of specifications, a set of specifications aiming to define the various aspects that are necessary to achieve the vision of inter-operable interchange of provenance information in heterogeneous environments such as the Web. The specifications are as follows.

PROV-DM, the PROV data model for provenance,
PROV-DM-CONSTRAINTS, a set of constraints applying to the PROV data model (this document),
PROV-N, a notation for provenance aimed at human consumption,
PROV-O, the PROV ontology, an OWL-RL ontology allowing the mapping of PROV to RDF;
PROV-AQ, the mechanisms for accessing and querying provenance;
PROV-PRIMER, a primer for the PROV data model,
PROV-SEM, a formal semantics for the PROV data model.
PROV-XML, an XML schema for the PROV data model.

How to read the PROV Family of Specifications

The primer is the entry point to PROV offering a pedagogical presentation of the provenance model.
The Linked Data and Semantic Web community should focus on PROV-O defining PROV classes and properties specified in an OWL-RL ontology. For further details, PROV-DM and PROV-DM-CONSTRAINTS specify the constraints applicable to the data model, and its interpretation. PROV-SEM provides a mathematical semantics.
The XML community should focus on PROV-XML defining an XML schema for PROV-DM. Further details can also be found in PROV-DM, PROV-DM-CONSTRAINTS, and PROV-SEM.
Developers seeking to retrieve or publish provenance should focus of PROV-AQ.
Readers seeking to implement other PROV serializations should focus on PROV-DM and PROV-DM-CONSTRAINTS. PROV-O, PROV-N, PROV-XML offer examples of mapping to RDF, text, and XML, respectively.

Fourth Public Working Draft

This is the fourth public release of the PROV-DM document. Following feedback, the Working Group has decided to reorganize this document substantially, separating the data model, from its contraints, and the notation used to illustrate it. The PROV-DM release is synchronized with the release of the PROV-O, PROV-PRIMER, PROV-N, PROV-DM-CONSTRAINTS documents. We are now making clear what the entry path to the PROV family of specifications is.

This document was published by the Provenance Working Group as an Editor's Draft. If you wish to make comments regarding this document, please send them to public-prov-wg@w3.org (subscribe, archives). All feedback is welcome.

Publication as an Editor's Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

1. Introduction
- 1.1 Conventions
2. Data Model Refinement
3. PROV-DM Definitional Constraints and Inferences
4. PROV-DM Account Constraints
5. PROV-DM Event Ordering Constraints
6. PROV-DM Structural Constraints
7. PROV-DM Collection Constraints
- 7.1 Collection branching
- 7.2 Collections and Weaker Derivation Relation
A. Acknowledgements
B. References
- B.1 Normative references
- B.2 Informative references

1. Introduction

Provenance is defined as a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing. A companion specification [PROV-DM] defines PROV-DM, a data model for provenance, allowing such descriptions to be expressed.

This specification is one of several specifications, referred to as the PROV family of specifications, defining the various aspects that are necessary to achieve the vision of inter-operable exchange of provenance:

A data model for provenance, which is presented in three documents:
- PROV-DM (part I): the provenance data model itself, expressed in natural language [PROV-DM];
- PROV-DM-CONSTRAINTS (part II): constraints underpinning the data model (this document);
- PROV-N (part III): a notation to express instances of that data model for human consumption [PROV-N];
PROV-O: the PROV ontology, an OWL-RL ontology allowing the mapping of PROV to RDF [PROV-O];
PROV-AQ: the mechanisms for accessing and querying provenance [PROV-AQ];
PROV-PRIMER: a primer for the PROV approach [PROV-PRIMER];
PROV-SEM: semantics of the PROV-DM data model [PROV-SEM];

PROV-DM is essentially defined without any constraints [PROV-DM]. This document introduces a further set of concepts underpinning this data model and defines constraints that well-structured provenance descriptions should follow and that provide an interpretation for these descriptions.

In [PROV-DM], a data model for provenance has been defined without introducing any constraint that this data model has to satisfy. First we introduce and refine various concepts such as attributes, event, entity, entity, interval, accounts, which underpin the PROV-DM data model. Using these notions, we explore the constraints that the PROV-DM data model has to satisfy.

Several types of constraints are specified.

Definitional constraints are constraints directly following the definition of concepts in the PROV data model (Section 4).
Account constraints have to be satisfied by provenance descriptions in the context of a given account (Section 5)
Event ordering constraints provide a "temporal interpretation" for provenance descriptions (Section 6)
Structural constraints are further constraints to be satisfied by generation descriptions (Section 7)
Collection constraints are the constraints that hold for collections (Section 8)

1.1 Conventions

The key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" in this document are to be interpreted as described in [RFC2119].

2. Data Model Refinement

Underpinning the PROV-DM data model is a notion of event, marking transitions in the world (when entities are generated, used, or invalidated, or activities started or ended). This notion of event is not first-class in the data model, but underpins many of its concepts and its semantics [PROV-SEM]. Thus, using this notion of event, we can provide an interpretation for the data model, which in turn can allow creators of provenance descriptons to make their descriptions more robust.

2.1 Time and Event

Time is critical in the context of provenance, since it can help corroborate provenance claims. For instance, if an entity is claimed to be obtained by transforming another, then the latter must have existed before the former. If it is not the case, then there is something wrong with such a provenance claim.

Although time is critical, we should also recognize that provenance can be used in many different contexts: in a single system, across the Web, or in spatial data management, to name a few. Hence, it is a design objective of PROV-DM to minimize the assumptions about time, so that PROV-DM can be used in varied contexts.

Furthermore, consider two activities that started at the same time instant. Just by referring to that instant, we cannot distinguish which activity start we refer to. This is particularly relevant if we try to explain that the start of these activities had different reasons. We need to be able to refer to the start of an activity as a first class concept, so that we can talk about it and about its relation with respect to other similar starts.

Hence, in our conceptualization of the world, an instantaneous event, or event for short, happens in the world and marks a change in the world, in its activities and in its entities. The term "event" is commonly used in process algebra with a similar meaning. For instance, in CSP [CSP], events represent communications or interactions; they are assumed to be atomic and instantaneous.

2.1.1 Types of Events

Five kinds of instantaneous events underpin the PROV-DM data model. The activity start and activity end events demarcate the beginning and the end of activities, respectively. The entity generation, entity usage, and entity invalidation events demarcate the characterization interval for entities. More specifically:

An entity generation event is the instantaneous event that marks the final instant of an entity's creation timespan, after which it is no longer available for use.

An entity usage event is the instantaneous event that marks the first instant of an entity's consumption timespan by an activity.

An entity invalidation event is the instantaneous event that marks the initial instant of the destruction, invalidation, or cessation of an entity, after which the entity is no longer available for use.

An activity start event is the instantaneous event that marks the instant an activity starts.

An activity end event is the instantaneous event that marks the instant an activity ends.

2.1.2 Event Ordering

To allow for minimalistic clock assumptions, like Lamport [CLOCK], PROV-DM relies on a notion of relative ordering of instantaneous events, without using physical clocks. This specification assumes that a partial order exists between instantaneous events.

Specifically, follows is a partial order between instantaneous events, indicating that an instantaneous event occurs at the same time as or after another. For symmetry, precedes is defined as the inverse of follows. (Hence, these relations are reflexive and transitive.)

How such partial order is realized in practice is beyond the scope of this specification. This specification only assumes that each instantaneous event can be mapped to an instant in some form of timeline. The actual mapping is not in scope of this specification. Likewise, whether this timeline is formed of a single global timeline or whether it consists of multiple Lamport's style clocks is also beyond this specification. It is anticipated that follows and precedes correspond to some ordering over this timeline.

This specification introduces a set of "temporal interpretation" rules allowing ordering constraints between instantaneous event to inferred from provenance descriptions. According to such temporal interpretation, descriptions must satisfy such constraints. We note that the actual verification of such ordering constraints is outside the scope of this specification.

PROV-DM also allows for time observations to be inserted in specific descriptions, for each recognized instantaneous event introduced in this specification. The presence of a time observation for a given instantaneous event fixes the mapping of this instantaneous event to the timeline. It can also help with the verification of associated ordering constraints (though, again, this verification is outside the scope of this specification).

2.2 Attributes in Entities and Beyond

When we talk about things in the world in natural language and even when we assign identifiers, we are often imprecise in ways that make it difficult to clearly and unambiguously report provenance: a resource with a URL may be understood as referring to a report available at that URL, the version of the report available there today, the report independent of where it is hosted over time, etc.

From a provenance viewpoint, it is important to identify a "partial state" of something, i.e. something with some aspects that have been fixed, so that it becomes possible to express its provenance, and what causes that thing, with these specific aspects to be as such.

It is the purpose of attributes in PROV-DM to help fix some aspect of entities. Indeed, we previously defined entities as things one wants to provide provenance for; we refine this definition as follows, using attribute-values to describe entities' "partial states", and linking them to the very existence of entities.

An entity is a thing one wants to provide provenance for and whose situation in the world is described by some attribute-value pairs. An entity's attribute-value pairs are specified when the entity description is created and remain unchanged. An entity's attribute-value pairs are expected to describe the entity's situation and (partial) state during an entity's characterization interval, which is defined as the period comprised between its generation event and its invalidation event.

An entity fixes some aspects of a thing and its situation in the world. An alternative entity may fix other aspects, and its provenance may be different.

Different users may take different perspectives on a resource with a URL. For each perspective, an entity may be expressed:

a report available at a URL: fixes the nature of the thing, i.e. a document, and its location;
the version of the report available there today: fixes its version number, contents, and its date;
the report independent of where it is hosted and of its content over time: fixes the nature of the thing as a conceptual artifact.

The provenance of these three entities may differ, and may be along the following lines:

the provenance of a report available at a URL may include: the act of publishing it and making it available at a given location, possibly under some license and access control;
the provenance of the version of the report available there today may include: the authorship of the specific content, and reference to imported content;
the provenance of the report independent of where it is hosted over time may include: the motivation for writing the report, the overall methodology for producing it, and the broad team involved in it.

We do not assume that any entity is more important than any other; in fact, it is possible to describe the processing that occurred for the report to be commissioned, for individual versions to be created, for those versions to be published at the given URL, etc., each via a different entity with attribute-value pairs that fix some aspect of the report appropriately.

Attributes are not restricted to entities, but they belong to a variety of PROV-DM objects, including activity, generation, usage, start, end, communication, attribution, association, responsibility, and derivation. Each object has its duration interval (potentially collapsing to a single time point), and attribute-value pairs for a given object, are expected to be descriptions that hold for the object's duration.

2.3 Description, Assertion, and Inference

PROV-DM is a provenance data model designed to express descriptions of the world.

A file at some point during its lifecycle, which includes multiple edits by multiple people, can be described by its type, its location in the file system, a creator, and content.

The data model is designed to capture activities that happened in the past, as opposed to activities that may or will happen. However, this distinction is not formally enforced. Therefore, all PROV-DM descriptions should be interpreted as what has happened, as opposed to what may or will happen.

This specification does not prescribe the means by which descriptions can be arrived at; for example, descriptions can be composed on the basis of observations, reasoning, or any other means.

Sometimes, inferences about the world can be made from descriptions conformant to the PROV-DM data model. When this is the case, this specification defines such inferences, allowing new descriptions to be inferred from existing ones. Hence, descriptions of the world can result either from direct assertion or from inference by application of inference rules defined by this specification.

2.4 Account

It is common for multiple provenance records to co-exist. For instance, when emailing a file, there could be a provenance record kept by the mail client, and another by the mail server. Such provenance records may provide different explanations about something happening in the world, because they are created by different parties or observed by different witnesses. A given party could also create multiple provenance records about an execution, to capture different levels of details, targeted at different end-users: the programmer of an experiment may be interested in a detailed log of execution, while the scientists may focus more on the scientific-level description. Given that multiple provenance records can co-exist, it is important to have details about their origin, who they are attributed to, how they were generated, etc. In other words, an important requirement is to be able to express the provenance of provenance.

An account is a entity that contains a bundle of provenance descriptions. PROV-DM does not provide an actual mechanism for creating accounts, i.e. for bundling up provenance descriptions and naming them. Accounts must satisfy some properties:

An account is as a container of provenance descriptions, hence its content may change over time.
If an account's set of descriptions changes over time, it increases monotonically with time.
A given description of e.g. an entity in a given account, in terms of its identifier and attribute-value pairs, does not change over time.

The last point is important. It indicates that within an account:

It is always possible to add new provenance descriptions, e.g. stating that a given entity was used by an activity, or derived from another. This is very much an open world assumption.
It is not permitted to add new attributes to a given entity (a form of closed world assumption from the attributes point of view), though it is always permitted to create a new description for an entity, which is a "copy" of the original description extended with novel attributes (cf Example merge-with-rename).

There is no construct in PROV-DM to create such bundles of descriptions. Instead, it is assumed that some mechanism, outside PROV-DM can create them. However, from a provenance viewpoint, such accounts are things we may want to describe the provenance of. In order to be able to do so, we need to see accounts as entities, whose origin can be described using PROV-DM vocabulary. Thus, PROV-DM introduces the reserved type Account.

3. PROV-DM Definitional Constraints and Inferences

In this section, we revisit the types and relations of PROV-DM that have constraints associated with their definitions.

PROV-DM allows for some attributes to be optionally expressed. Unless otherwise specified, when an optional attribute is not present in a description, some value should be assumed to exist for this attribute, though it is not known which.

3.1 Component 1: Entities and Activities

3.1.1 Entity

An entity is a thing one wants to provide provenance for and whose situation in the world is described by some attribute-value pairs. An entity's attribute-value pairs are specified when the entity description is created and remain unchanged. An entity's attribute-value pairs are expected to describe the entity's situation and (partial) state during an entity's characterization interval, i.e. a continuous interval between two instantaneous events, namely its generation event and its invalidation event.

If an entity's situation or state changes, this may result in its description being invalid, because one or more attribute-value pairs no longer hold. In that case, from the PROV viewpoint, there exists a new entity, which needs to be given a distinct identifier, and associated with the attribute-value pairs that reflect its new situation or state.

Further considerations:

In order to describe something over several intervals, it is required to create multiple entities, each with its own identifier. This allow potential dependencies between the various entities to be expressed.
There is no assumption that the set of attributes is complete and that the attributes are independent or orthogonal of each other.
A characterization interval may collapse into a single instant.

For the interpretation of an entity, see usage-precedes-invalidation.

3.1.2 Activity

An activity is delimited by its start and its end events; hence, it occurs over an interval delimited by two instantaneous events. However, an activity need not mention time information, nor duration, because they may not be known. An activity's attribute-value pairs are expected to describe the activity's situation during its interval, i.e. an interval between two instantaneous events, namely its start event and its end event.

For the interpretation of an activity, see start-precedes-end.

Further considerations:

An activity is not an entity. Indeed, an entity exists in full at any point in its lifetime, persists during this interval, and preserves the characteristics that makes it identifiable. In contrast, an activity is something that occurs, happens, unfolds, or develops through time, but is typically not identifiable by the characteristics it exhibits at any point during its duration. This distinction is similar to the distinction between 'continuant' and 'occurrent' in logic [Logic].

3.1.3 Generation

A generation is an instantaneous world event, the completed creation of a new entity by an activity. This entity becomes available for usage after this instantaneous event. This entity did not exist before creation. This instantaneous event encompasses a description of the modalities of generation of this entity by this activity, by means of key-value pairs.

A generation's id is optional. It must be used when annotating generations or when defining derivations (see Derivation).

For the interpretation of a generation, see generation-within-activity.

A generation can be used to indicate a generation time without having to specify the involved activity. A generation time is unique, as specified by the following constraint.

Given an entity denoted by e and two time instants t1 and t2, if wasGeneratedBy(e, -, t1) and wasGeneratedBy(e, -, t2) hold, then t1=t2.

See generation-uniqueness for further structural constraints on generations.

3.1.4 Usage

A usage is an instantaneous world event: an activity beginning to consume an entity. Before this event, the activity had not begun to consume or use to this entity. The description includes the modalities of usage of this entity by this activity.

A usage id is optional. It must be present when annotating usages or when defining derivations (see Derivation).

A reference to a given entity may appear in multiple usages for a given activity identifier.

For the interpretation of a usage, see generation-precedes-usage and usage-within-activity.

3.1.5 Start

An activity start event is the instantaneous event that marks the instant an activity starts. It allows for an optional time attribute. Activities also allow for an optional start time attribute. If both are specified, they must be the same, as expressed by the following constraint.

Given an activity activity(a,t1,t2,attrs1) and its start wasStartedBy(id,a,e,t,attrs2), then t=t1.

3.1.6 End

An activity end event is the instantaneous event that marks the instant an activity ends. It allows for an optional time attribute. Activities also allow for an optional end time attribute. If both are specified, they must be the same, as expressed by the following constraint.

Given an activity activity(a,t1,t2,attrs1) and its end wasEndedBy(id,a,e,t,attrs2), then t=t2.

3.1.7 Communication

Communication is formally defined as follows.

Given two activities identified by a1 and a2, wasInformedBy(a2,a1) holds, if and only if there is an entity with some identifier e and some sets of attribute-value pairs attrs1 and attrs2, such that wasGeneratedBy(e,a1,-,attrs1) and used(a2,e,-,attrs2) hold.

For the interpretation of an information flow ordering, see wasInformedBy-ordering.

The relationship wasInformedBy is not transitive. Indeed, consider the following descriptions.

wasInformedBy(a2,a1)
wasInformedBy(a3,a2)

We cannot infer wasInformedBy(a3,a1) from these expressions. Indeed, from wasInformedBy(a2,a1), we know that there exists e1 such that e1 was generated by a1 and used by a2. Likewise, from wasInformedBy(a3,a2), we know that there exists e2 such that e2 was generated by a2 and used by a3. The following illustration shows a case for which transitivity cannot hold. The horizontal axis represents the event line. We see that e1 was generated after e2 was used. Furthermore, the illustration also shows that a3 completes before a1. So it is impossible for a3 to have used an entity generated by a1.

non transitivity of wasInformedBy — Counter-example for transitivity of wasInformedBy

3.1.8 Start by Activity

Start of a2 by activity a1 is specified as follows.

Given two activities with identifiers a1 and a2, wasStartedBy(a2,a1) holds if and only if there exist an entity with some identifier e and some attributes gAttr and sAttr, such that wasGeneratedBy(e,a1,-,gAttr) and wasStartedBy(a2,e,-,sAttr) hold.

For the interpretation of a control flow ordering, see wasStartedBy-ordering.

3.2 Component 2: Agents and Responsibility

3.2.1 Attribution

If wasAttributedTo(e,ag) holds for some identifiers e and ag, then there exists an activity with some identifier a such that the following statements hold:

activity(a, t1, t2, attr1)
wasGenerateBy(e, a, -)
wasAssociatedWith(a, ag, -, attr2)

for some sets of attribute-value pairs attr1 and attr2, time t1, and t2.

3.2.2 Association

In an association of the form wasAssociatedWith(a, ag, -, attr), the absence of a plan means: either no plan exists, or a plan exists but it is not identified.

In an association of the form wasAssociatedWith(a, -, pl, attr), an agent exists but it is not identified.

For the interpretation of an association, see wasAssociatedWith-ordering.

3.2.3 Responsibility

For the interpretation of responsibility, see actedOnBehalfOf-ordering.

3.3 Component 3: Derivations

3.3.1 Derivation

For the interpretation of a derivation, see derivation-usage-generation-ordering and derivation-generation-generation-ordering

Note that inferring derivation from usage and generation does not hold in general. Indeed, when a generation wasGeneratedBy(g, e2, a, -, attrs2) precedes used(u, a, e1, -, attrs1), for some e1, e2, attrs1, attrs2, and a, one cannot infer derivation wasDerivedFrom(e2, e1, a, g, u) or wasDerivedFrom(e2,e1) since of e2 cannot possibly be derived from e1, given the creation of e2 precedes the use of e1.

Derivation is not defined to be transitive. Domain-specific specializations of derivation may be defined in such a way that the transitivity property holds.

See derivation-use for a structural constraint on derivations.

3.3.2 Revision

A revision needs to satisfy the following constraint, linking the two entities by a derivation, and stating them to be a specialization of a third entity.

Given two identifiers e1 and e2 identifying two entities, and an identifier ag identifying an agent, if wasRevisionOf(e2,e1,ag) holds, then there exists an entity with some identifier e and some attribute-values eAttrs, dAttrs, such that the following hold:

wasDerivedFrom(e2,e1,dAttrs)
entity(e,eAttrs)
specializationOf(e2,e)
specializationOf(e1,e)
wasAttributedTo(e2,ag)

In a revision of the form wasRevisionOf(e2,e1,-,attr), the absence of an agent means: either no agent exists, or an agent exists but it is not identified.

wasRevisionOf is a strict sub-relation of wasDerivedFrom since two entities e2 and e1 may satisfy wasDerivedFrom(e2,e1) without being a variant of each other.

3.3.3 Quotation

If wasQuotedFrom(e2,e1,ag2,ag1,attrs) holds for some identifiers e2, e1, ag2, ag1, then the following hold:

wasDerivedFrom(e2,e1)
wasAttributedTo(e2,ag2)
wasAttributedTo(e1,ag1)

In a quotation of the form wasQuotedFrom(e2,e1,-,-,attrs), the absence of an agent means: either no agent exists, or an agent exists but it is not identified.

3.3.4 Traceability

Traceability allows an entity to be transitively linked to another entity it is derived from, to an agent it is attributed to, or another agent having some responsibility, or a trigger of an activity that generated it.

Traceability can be inferred from existing descriptions, or can be asserted stating that a dependency path exists without its individual steps being expressed. This is captured by the following inference and constraint, respectively.

Given two identifiers e2 and e1 for entities, the following statements hold:

If wasDerivedFrom(e2,e1,a,g2,u1) holds, for some a, g2, u1, then tracedTo(e2,e1) also holds.
If wasDerivedFrom(e2,e1) holds, then tracedTo(e2,e1) also holds.
If wasAttributedTo(e2,ag1,aAttr) holds, then tracedTo(e2,ag1) also holds.
If wasAttributedTo(e2,ag2,aAttr), wasGeneratedBy(e2,a,gAttr), and actedOnBehalfOf(ag2,ag1,a,rAttr) hold, for some a, ag2, ag1, aAttr, gAttr, and rAttr, then tracedTo(e2,ag1) also holds.
If wasGeneratedBy(e2,a,gAttr) and wasStartedBy(a,e1,sAttr) hold, for some a, gAttr , sAttr
If tracedTo(e2,e) and tracedTo(e,e1) hold for some e, then tracedTo(e2,e1) also holds.

We note that the inference rule traceability-inference does not allow us to infer attributes, which are application specific.

If tracedTo(r2,r1,attrs) holds for two identifiers r2 and r1 identifying entities, and attribute-value pairs attrs, then there exist e⁰, e¹, ..., eⁿ for n≥1, with e⁰=r2 and eⁿ=r1, and for any i such that 0≤i≤n-1, at least of the following statements holds:

wasDerivedFrom(eⁱ,eⁱ⁺¹,a,g2,u1) holds, for some a, g2, u1, or
wasDerivedFrom(eⁱ,eⁱ⁺¹) holds, or
wasAttributedTo(eⁱ,eⁱ⁺¹) holds, or
wasAttributedTo(eⁱ,e), wasGeneratedBy(eⁱ,a,gAttr), and actedOnBehalfOf(e,eⁱ⁺¹,a,rAttr) hold, for some a, e and gAttr, rAttr, or
wasGeneratedBy(eⁱ,a,gAttr) and wasStartedBy(a,eⁱ⁺¹,sAttr) hold, for some a, e, and gAttr, and sAttr.

We note that the previous constraint is not really an inference rule, since there is nothing that we can actually infer. Instead, this constraint should simply be seen as part of the definition of the traceability relation.

3.4 Component 4: Alternate Entities

3.4.1 Specialization

Specialization is neither symmetric nor anti-symmetric.

"Alice's toyota car on fifth Avenue" is a specialization of "Alice's toyota car", but the converse does not hold.

anti-symmetric counter-example???

Specialization is transitive. Indeed if specializationOf(e1,e2) holds, then there is some common thing, say T1-2 they both refer to. Likewise, if specializationOf(e2,e3) holds, then there is some common thing, say T2-3 they both refer to. The Things T1-2 and T2-3 are the same since e2 refers to only one thing.

A specialization of "this email message" would be, for example, the "printed version on my desk", which is a specialization of "my thoughts on this email thread". So, the "printed version on my desk" is also a specialization "my thoughts on this email thread".

3.4.2 Alternate

Alternate not is reflexive. Indeed, alternate(e,e) does not hold for any arbitrary entity e since e may not be a specialization of another entity.

Alternate is symmetric. Indeed, if alternate(e1,e2) holds, then there exists an unspecified entity e, such that both e1 and e2 are specialization of e. Therefore, alternate(e2,e1) also holds.

Alternate is not transitive. Indeed, if alternate(e1,e2) holds, then there exists an unspecified entity e1-2, such that both e1 and e2 are specialization of e1-2. Likewise, if alternate(e2,e3) holds, then there exists an unspecified entity e2-3, such that both e2 and e3 are specialization of e2-3. It does not imply that there is a common entity e1-3 that both e1 and e3 specialize.

At 6pm, the customer in a chair is a woman in a red dress, who happens to be Alice. After she leaves, another customer arrives at 7pm, a man with glasses, who happens to be Bob. Transitivity does not hold since the womanInRedDress is not alternate of customerInChairAt7pm.

alternate(womanInRedDress,customerInChairAt6pm)
specialization(customerInChairAt6pm,Alice)
specialization(womanInRedDress,Alice)

alternate(manWithGlasses,customerInChairAt7pm)
specialization(customerInChairAt7pm,Bob)
specialization(manWithGlasses,Bob)

alternate(customerInChairAt6pm, customerInChairAt7pm)
specialization(customerInChairAt6pm, customerInChair)
specialization(customerInChairAt7pm, customerInChair)

The above example shows that customerInChairAt6pm and customerInChairAt7pm are two alternate entities that have no overlap, while womanInRedDress and customerInChairAt6pm do overlap. The following example illustrates another case of non-overlapping alternate entities.

Two copies of the same book, where copy A was destroyed before copy B was made.

4. PROV-DM Account Constraints

PROV-DM allows for multiple descriptions of entities (and in general any identifiable object) to be expressed.

Let us consider two descriptions of a same entity, which we have taken from two different contexts. A working draft published by the w3:Consortium:

entity(tr:WD-prov-dm-20111215, [ prov:type="pr:RecsWD" %% xsd:QName ])

The second version of a document edited by some authors:

entity(tr:WD-prov-dm-20111215, [ prov:type="document", ex:version="2" ])

Both descriptions are about the same entity identified by tr:WD-prov-dm-20111215, but they contain different attributes, describing the situation or partial state of the these entities according to the context in which they occur.

Two different descriptions of a same entity cannot co-exist in a same account as formalized in unique-description-in-account.

Given an entity identifier e, there is at most one description entity(e,attrs) occurring in a given account, where attrs is some set of attribute-values. Other descriptions of the same entity can exist in different accounts.

This constraint similarly applies to all other types and relations, with explicit identity.

See Section structural-constraints for a structural constraint on accounts

In some cases, there may be a requirement for two different descriptions of a same entity to be included in a same account. To satisfy the constraint unique-description-in-account, we can adopt a different identifier for one of them, and relate the two descriptions with the alternateOf relation.

We now reconsider the same two descriptions of a same entity, but we change the identifier for one of them:

entity(tr:WD-prov-dm-20111215, [ prov:type="pr:RecsWD" %% xsd:QName ])
entity(ex:alternate-20111215, [ prov:type="document", ex:version="2" ])
alternateOf(tr:WD-prov-dm-20111215,ex:alternate-20111215)
alternateOf(ex:alternate-20111215,tr:WD-prov-dm-20111215)

5. PROV-DM Event Ordering Constraints

Section section-time-event introduces a notion of instantaneous event marking changes in the world, in its activities and entities. PROV-DM identifies five kinds of instantaneous events, namely entity generation event, entity usage event, entity invalidation event, activity start event and activity end event. PROV-DM adopts Lamport's clock assumptions [CLOCK] in the form of a reflexive, transitive partial order follows (and its inverse precedes) between instantaneous events. Furthermore, PROV-DM assumes the existence of a mapping from instantaneous events to time clocks, though the actual mapping is not in scope of this specification.

Given that provenance consists of a description of past entities and activities, to be meaningful provenance descriptions must satisfy ordering constraints between instantaneous events, which we introduce in this section. For instance, an entity can only be used after it was generated; hence, we say that an entity's generation event precedes any of this entity's usage event. Should this ordering constraint be proven invalid, the associated generation and usage could not be credible. The rest of this section defines the temporal interpretation of provenance descriptions as a set of instantaneous event ordering constraints.

PROV-DM also allows for time observations to be inserted in specific provenance descriptions, for each of the five kinds of instantaneous events introduced in this specification. The presence of a time observation for a given instantaneous event fixes the mapping of this instantaneous event to the timeline. The presence of time information in a provenance description instantiates the ordering constraint with that time information. It is expected that such instantiated constraint can help corroborate provenance information. We anticipate that verification algorithms could be developed, though this verification is outside the scope of this specification.

The following figure summarizes the ordering constraints in a graphical manner. For each subfigure, an event time line points to the right. Activities are represented by rectangles, whereas entities are represented by circles. Usage, generation and derivation are represented by the corresponding edges between entities and activities. The four kind of instantaneous events are represented by vertical dotted lines (adjacent to the vertical sides of an activity's rectangle, or intersecting usage and generation edges). The ordering constraints are represented by triangles: an occurrence of a triangle between two instantaneous event vertical dotted lines represents that the event denoted by the left line precedes the event denoted by the right line.

constraints between events — Summary of instantaneous event ordering constraints

The mere existence of an activity entails some event ordering, since an activity start event always precedes the corresponding activity end event. This is illustrated by Subfigure constraint-summary (a) and expressed by constraint start-precedes-end.

The following ordering constraint holds for any activity: the start event precedes the end event.

A usage and a generation for a given entity implies ordering of events, since the generation event had to precede the usage event. This is illustrated by Subfigure constraint-summary (b) and expressed by constraint generation-precedes-usage.

For any entity, the following ordering constraint holds: the generation of an entity always precedes any of its usages.

Invalidation is defined at the event at which an entity ceases to exist as such. All usages of an entity precede its invalidation, which is captured by constraint usage-precedes-invalidation (without any explicit graphical representation).

For any entity, the following ordering constraint holds: any usage of an entity always precedes its invalidation.

By transitivity with generation-precedes-usage, generation of an entity precedes its invalidation.

A usage implies ordering of events, since the usage event had to occur during the associated activity. This is illustrated by Subfigure constraint-summary (c) and expressed by constraint usage-within-activity.

Given an activity with identifier a, an entity with identifier e, a set of attribute-value pairs attrs, and optional time t, if used(a,e,attrs) or used(a,e,attrs,t) holds, then the following ordering constraint holds: the usage of the entity denoted by e precedes the end of activity denoted by a and follows its start.

A generation implies ordering of events, since the generation event had to occur during the associated activity. This is illustrated by Subfigure constraint-summary (d) and expressed by constraint generation-within-activity.

Given an activity with identifier a, an entity with identifier e, a set of attribute-value pairs attrs, and optional time t, if wasGeneratedBy(e,a,attrs) or wasGeneratedBy(e,a,attrs,t) holds, then the following ordering constraint also holds: the generation of the entity denoted by e precedes the end of activity a and follows the start of a.

If there is a derivation between e2 and e1, then this means that the entity e1 had some form of influence on the entity e2; for this to be possible, some event ordering must be satisfied. First, we consider derivations, where the activity and usage are known. In that case, the usage of e1 has to precede the generation of e2. This is illustrated by Subfigure constraint-summary (e) and expressed by constraint derivation-usage-generation-ordering.

Given an activity with identifier a, entities with identifier e1 and e2, a generation identified by g2, and a usage identified by u1, if wasDerivedFrom(e2,e1,a,g2,u1,attrs) holds, then the following ordering constraint holds: the usage of entity denoted by e1 precedes the generation of the entity denoted by e2.

When the usage is unknown, a similar constraint exists, except that the constraint refers to its generation event, as illustrated by Subfigure constraint-summary (f) and expressed by constraint derivation-generation-generation-ordering.

Given two entities denoted by e1 and e2, if wasDerivedFrom(e2,e1, attrs) holds, then the following ordering constraint holds: the generation event of the entity denoted by e1 precedes the generation event of the entity denoted by e2.

Note that event ordering is between generations of e1 and e2, as opposed to derivation where usage is known, which implies ordering ordering between the usage of e1 and generation of e2.

Communication between two activities a1 and a2 also implies ordering of events, since some entity must have been generated by the former and used by the latter, which implies that the start event of a1 cannot follow the end event of a2. This is illustrated by Subfigure constraint-summary (g) and expressed by constraint wasInformedBy-ordering.

Given two activities denoted by a1 and a2, if wasInformedBy(a2,a1) holds, then the following ordering constraint holds: the start event of the activity denoted by a1 precedes the end event of the activity denoted by a2.

Start of a2 by activity a1 also implies ordering of events, since a1 must have been active before a2 started. This is illustrated by Subfigure constraint-summary (h) and expressed by constraint wasStartedBy-ordering.

Given two activities denoted by a1 and a2, if wasStartedBy(a2,a1) holds, then the following ordering constraint holds: the start event of the activity denoted by a1 precedes the start event of the activity denoted by a2.

Further constraints appear in Figure constraint-summary2 and are discussed below.

further constraints between events — Summary of instantaneous event ordering constraints (continued)

A trigger of an activity must exist when the activity starts. This is illustrated by Subfigure constraint-summary2 (a) and expressed by constraint wasStartedByAgent-ordering.

Given an activity denoted by a and an entity denoted by e, if wasStartedBy(a,e) holds, then the following ordering constraints hold: the start event of the activity denoted by a follows the generation event for entity e, and precedes the invalidation event of the same entity.

A similar constraints exists for the trigger of activity end, illustrated by Subfigure constraint-summary2 (b).

Given an activity denoted by a and an entity denoted by e, if wasEndedBy(a,e) holds, then the following ordering constraints hold: the end event of the activity denoted by a follows the generation event for entity e, and precedes the invalidation event of the same entity.

An activity that was associated with an agent must have some overlap with the agent. The agent may be generated, or may only become associated with the activity, after the activity start: so, the agent is required to exist before the activity end. Likewise, the agent may be destructed, or may terminate its association with the activity, before the activity end: hence, the agent invalidation is required to happen after the activity start. This is illustrated by Subfigure constraint-summary2 (c) and expressed by constraint wasAssociatedWith-ordering.

Given an activity denoted by a and an agent denoted by ag, if wasAssociatedWith(a,ag) holds, then the following ordering constraints hold: the start event of the activity denoted by a precedes the invalidation event of the agent denoted by ag, and the generation event for agent denoted by ag precedes the activity end event.

An entity that was attributed to an agent must have some overlap with the agent. The agent may be generated, or may only become attributed with the activity, after its generation: so, the agent is required to exist before the entity invalidation. Likewise, the agent may be destructed, or the entity may no longer be attributed to the agent, before the entity invalidation: hence, the agent invalidation is required to happen after the entity generation. This is illustrated by Subfigure constraint-summary2 (d) and expressed by constraint wasAttributedWith-ordering.

Given an entity denoted by e and an agent denoted by ag, if wasAttributedWith(e,ag) holds, then the following ordering constraints hold: the generation event of the entity denoted by e precedes the invalidation event of the agent denoted by ag, and the generation event for agent denoted by ag precedes the entity invalidation event.

For responsibility, two agents need to have some overlap.

Given two agents ag1 and ag2, if actedOnBehalfOf(ag2,ag1) holds, then the following ordering constraints hold: the generation event of the agent denoted by ag2 precedes the invalidation event of agent ag1, and the generation event for agent denoted by ag1 precedes invalidation event for ag2.

6. PROV-DM Structural Constraints

Section 4 provides definitional constraints for data model concepts. Section 5 introduces constraints on descriptions occurring in accounts. Section 6 defines an interpretation of this data model, in terms of event ordering constraints. This section introduces further constraints on the structure of PROV-DM descriptions. Descriptions that satisfy these constraints are said to be structurally well-formed. A benefit of structurally well-formed provenance descriptions is that further inferences can be made, because descriptions are more precise, and therefore, richer.

According to the definition of a generation, an entity becomes available after this entity's generation event, and does not exist before this event. From this definition, we conclude that PROV-DM does not allow for an entity to have two generations occurring at two different instants. The rationale for this constraint is as follows. Two distinct generation events (by a same activity or by two distinct activities), occurring one after the other, necessarily create two distinct entities; otherwise, the second generation event would have resulted in an entity that existed before its creation, which contradicts the definition of generation.

So, PROV-DM allows for two distinct generations g1 and g2 referencing a same entity provided they occur simultaneously. In practice, for such a simultaneous generation to occur, the generation event has to be unique and caused by a single world activity, though provenance may contain several descriptions for the same world activity.

In the following descriptions, a workflow execution a0 consists of two sub-workflow executions a1 and a2. Sub-workflow execution a2 generates entity e, so does a0.

activity(a0, [prov:type="workflow execution"])
activity(a1, [prov:type="workflow execution"])
activity(a2, [prov:type="workflow execution"])
wasInformedBy(a2,a1)

wasGeneratedBy(e,a0)
wasGeneratedBy(e,a2)

So, we have two different generations for entity e. Such an example is permitted in PROV-DM if the two activities denoted by a0 and a2 are a single thing happening in the world but described from different perspectives.

While this example is permitted in PROV-DM, it does not make the inter-relation between activities explicit, and it mixes descriptions expressed from different perspectives together. While this may acceptable in some specific applications, it becomes challenging for inter-operability. Indeed, PROV-DM does not offer any relation describing the structure of activities. Such descriptions are said not be structurally well-formed.

Structurally well-formed provenance can be obtained by partitioning the generations into different accounts. This makes it clear that these generations provide alternative descriptions of the same real-world generation event, rather than describing two distinct generation events for the same entity. When accounts are used, the example can be encoded as follows.

The same example is now revisited, with the following descriptions that are structurally well-formed. Two accounts are introduced, and there is a single generation for entity e per account.

In a first account, entitled "summary", we find:

activity(a0,t1,t2,[prov:type="workflow execution"])
wasGeneratedBy(e,a0,-)

In a second account, entitled "detail", we find:

activity(a1,t1,t3,[prov:type="workflow execution"])
activity(a2,t3,t2,[prov:type="workflow execution"])
wasInformedBy(a2,a1)
wasGeneratedBy(e,a2,-)

Structurally well-formed provenance satisfies some constraints, which force the structure of descriptions to be exposed by means of accounts. With these constraints satisfied, further inferences can be made about structurally well-formed descriptons. The uniqueness of generations in accounts is formulated as follows.

Given an entity denoted by e, two activities denoted by a1 and a2, two time instants t1 and t2, and two sets of attribute-value pairs attrs1 and attrs2, if wasGeneratedBy(id1, e, a1, t1, attrs1) and wasGeneratedBy(id2, e, a2, t2, attrs2) exist in the scope of a given account, then id1=id2, a1=a2, t1=t2 and attrs1=attrs2.

A further inference is permitted from derivations with an explicit activity and no usage:

Given an activity a, entities denoted by e1 and e2, and sets of attribute-value pairs dAttrs, gAttrs, if wasDerivedFrom(e2,e1, a, dAttrs) and wasGeneratedBy(e2,a,-,gAttrs) hold, then used(a,e1,-,uAttrs) also holds for some set of attribute-value pairs uAttrs.

This inference is justified by the fact that the entity denoted by e2 is generated by at most one activity in a given account (see generation-uniqueness). Hence, this activity is also the one referred to by the usage of e1.

We note that the converse inference, does not hold. From wasDerivedFrom(e2,e1) and used(a,e1,-), one cannot derive wasGeneratedBy(e2,a,-) because identifier e1 may occur in usages performed by many activities, which may have not generated the entity denoted by e2.

An account is said to be structurally well-formed if it satisfies the constraint generation-uniqueness. If an account is structurally well-formed, it supports the inference derivation-use.

Taking the union of two accounts is another account, formed by the union of the descriptions they respectively contain. We note that the resulting union may or may not invalidate some constraints:

Two entity descriptions with a same identifier but different sets of attributes exist in each original account may invalidate unique-description-in-account in the union, unless some form of description merging or renaming (as per Example) occurs.
Structurally well-formed accounts are not closed under union because the constraint generation-uniqueness may no longer be satisfied in the resulting union.

How to reconcile such accounts is beyond the scope of this specification.

7. PROV-DM Collection Constraints

Membership is a convenience notation, since it can be expressed in terms of an insertion into some collection. The membership definition is formalized by constraint membership-as-insertion.

memberOf(c, {(k1, v1), ...}) holds if and only if there exists a collection c0, such that derivedByInsertionFrom(c, c0, {(k1, v1), ...}).

A collection may be obtained by insertion or removal, or said to satisfy the membership relation. To provide an interpretation of collections, PROV-DM restricts one collection to be involved in a single derivation by insertion or removal, or to one membership relation. PROV-DM does not provide an interpretation for descriptions that consist of two (or more) insertion, removal, membership relations that result in the same collection.

The following constraint ensures unique derivation.

A collection must not be derived through multiple insertions, removal, or membership relations.

Consider the following descriptions about three collections.

entity(c1, [prov:type="prov:Collection"  %% xsd:QName])
entity(c2, [prov:type="prov:Collection"  %% xsd:QName])
entity(c3, [prov:type="prov:Collection"  %% xsd:QName])


derivedByInsertionFrom(c3, c1, {("k1", e1), ("k2", e2)})
derivedByInsertionFrom(c3, c2, {("k3", e3)})

There is no interpretation for such descriptions since c3 is derived multiple times by insertion.

As a particular case, collection c is derived multiple times from the same c1.

derivedByInsertionFrom(id1, c, c1, {("k1", e1), ("k2", e2)})
derivedByInsertionFrom(id2, c, c1, {("k3", e3), ("k4", e4)})

The interpretation of such descriptions is also unspecified.

To describe the insertion of the 4 key-entity pairs, one would instead write:

derivedByInsertionFrom(id1, c, c1, {("k1", e1), ("k2", e2), ("k3", e3), ("k4", e4)})

The same is true for any combination of insertions, removals, and membership relations:

The following descriptions

derivedByInsertionFrom(c, c1, {("k1", e1)})
derivedByRemovalFrom(c, c2, {"k2"})

have no interpretation. Nor have the following:

derivedByInsertionFrom(c, c1, {("k1", e1)})
memberOf(c, {"k2"}).

7.1 Collection branching

It is allowed to have multiple derivations from a single root collection, as long as the resulting entities are distinct, as shown in the following example.

entity(c0, [prov:type="prov:EmptyCollection" %% xsd:QName])    // c0 is an empty collection
entity(c1, [prov:type="prov:Collection" %% xsd:QName])
entity(c2, [prov:type="prov:Collection" %% xsd:QName])
entity(c3, [prov:type="prov:Collection" %% xsd:QName])
entity(e1)
entity(e2)
entity(e3)

derivedByInsertionFrom(c1, c0, {("k1", e1)})      
derivedByInsertionFrom(c2, c0, {("k2", e2)})       
derivedByInsertionFrom(c3, c1, {("k3", e3)})

From this set of descriptions, we conclude:

  c1 = { ("k1", e1) }
  c2 = { ("k2", e2) }
  c3 = { ("k1", e1), ("k3", e3)}

7.2 Collections and Weaker Derivation Relation

The state of a collection is only known to the extent that a chain of derivations starting from an empty collection can be found. Since a set of descriptions regarding a collection's evolution may be incomplete, so is the reconstructed state obtained by querying those descriptions. In general, all descriptions reflect partial knowledge regarding a sequence of data transformation events. In the particular case of collection evolution, in which some of the state changes may have been missed, the more generic derivation relation should be used to signal that some updates may have occurred, which cannot be expressed as insertions or removals. The following example illustrates this.

In the example, the state of c2 is only partially known because the collection is constructed from partially known other collections.

entity(c0, [prov:type="prov:EmptyCollection" %% xsd:QName])    // c0 is an empty collection
entity(c1, [prov:type="prov:Collection" %% xsd:QName])    
entity(c2, [prov:type="prov:Collection" %% xsd:QName])    
entity(c3, [prov:type="prov:Collection" %% xsd:QName])    
entity(e1)
entity(e2)

derivedByInsertionFrom(c1, c0, {("k1", e1)})       
wasDerivedFrom(c2, c1)                       
derivedByInsertionFrom(c3, c2, {("k2", e2)})