W3C

Constraints of the Provenance Data Model

W3C Working Draft 03 May11 September 2012

This version:
http://www.w3.org/TR/2012/WD-prov-constraints-20120503/http://www.w3.org/TR/2012/WD-prov-constraints-20120911/
Latest published version:
http://www.w3.org/TR/prov-constraints/
Latest editor's draft:
http://dvcs.w3.org/hg/prov/raw-file/default/model/prov-constraints.html
Previous version:
http://www.w3.org/TR/2012/WD-prov-constraints-20120503/
Editors:
James Cheney, University of Edinburgh
Paolo Missier, Newcastle University
Luc Moreau, University of Southampton
Author:
Tom De Nies, IBBT - Ghent University

Abstract

PROV-DM, the PROVProvenance is information about entities, activities, and people involved in producing a piece of data model,or thing, which can be used to form assessments about its quality, reliability or trustworthiness. PROV-DM is the conceptual data model that forms a data model basis for the W3C provenance that describes the entities, people and activities involved in producing a piece(PROV) family of data or thing. PROV-DM is structured in six components, dealing with: (1) entities and activities, and the time at which they were created, used, or ended; (2) agents bearing responsibility for entities that were generated and activities that happened; (3) derivations of entities from entities; (4) properties to link entities that refer to a same thing; (5) collections forming a logical structure for its members; (6) a simple annotation mechanism. specifications.

This document introducesdefines a further setsubset of concepts usefulPROV instances called valid PROV instances. The intent of validation is ensure that a PROV instance represents a history of objects and their interactions which is consistent, and thus safe to use for understanding the the purpose of logical reasoning and other kinds of analysis. Valid PROV data model and defines instances satisfy certain definitions, inferences , and constraints. These definitions, inferences, and constraints provide a measure of consistency checking for provenance and reasoning over provenance. They can also be used to normalize PROV instances to forms that can easily be compared in order to determine whether two PROV instances are allowed on provenance statements and validity constraints that equivalent. Validity and equivalence are also defined for PROV instances should follow. These inferences and constraints are useful for readers who develop applications that generate provenance bundles (that is, named instances) and documents (that is, a toplevel instance together with zero or reason over provenance. more bundles).

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

Last Call

This is the second public release of the PROV-CONSTRAINTS document. This is a Last Call Working Draft. The design is not expected to change significantly, going forward, and now is the key time for external review.

This specification identifies features at risk related to the at-risk Mention feature of [PROV-DM]: Inference 22 (mention-specialization-inference) and Constraint 31 (unique-mention). These might be removed from PROV-CONSTRAINTS.

PROV Family of Specifications

This document is part of the PROV family of specifications, a set of specifications defining various aspects that are necessary to achieve the vision of inter-operable interchange of provenance information in heterogeneous environments such as the Web. The specifications are:

How to read the PROV Family of Specifications

First Public Working Draft

This is the first public release of the PROV-CONSTRAINTS document. Following feedback, the Working Group has decided to reorganize the PROV-DM document substantially, separating the data model, from its constraints, and the notation used to illustrate it. The PROV-CONSTRAINTS release is synchronized with the release of the PROV-DM, PROV-O, PROV-PRIMER, and PROV-N documents. This document was published by the Provenance Working Group as a First PublicLast Call Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-prov-wg@w3.orgpublic-prov-comments@w3.org (subscribe, archives). The Last Call period ends 10 October 2012. All feedback is welcome.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This is a Last Call Working Draft and thus the Working Group has determined that this document has satisfied the relevant technical requirements and is sufficiently stable to advance through the Technical Recommendation process.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1. Introduction

Provenance is a record that describes the people, institutions, entities, and activities,activities involved in producing, influencing, or delivering a piece of data or a thing. This document complements the PROV-DM specification [PROV-DM] that defines a data model for provenance on the Web.

1.1 Conventions

The key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" in this document are to be interpreted as described in [RFC2119].

In this document, logical formulas contain variables written as lower-case identifiers. Some of these variables are written beginning with the underscore character _, by convention, to indicate that they appear only once in the formula. Such variables are provided merely as an aid to the reader.

1.2 Purpose of this document

PROV-DMThe PROV Data Model, PROV-DM, is a conceptual data model for provenance (realizable provenance, which is realizable using different serializationsrepresentations such as PROV-N, PROV-O,PROV-N and PROV-O. A PROV document is a set of PROV statements, together with zero or PROV-XML). However, nothingmore bundles, or named sets of statements. For example, a PROV document could be a .provn document, the result of a query, a triple store containing PROV statements in the RDF, etc. A PROV instance is a set of PROV statements. The PROV-DM specification [PROV-DM] forces sets imposes minimal requirements upon PROV instances. A valid PROV instance corresponds to a consistent history of objects and interactions to which logical reasoning can be safely applied. By default, PROV statements (or instances) to need not be meaningful, that is, to correspond to a consistent history of objects and interactions. Furthermore, nothing in the PROV-DM specification enables applications to perform inferences over PROV instancesvalid.

This document specifies definitions of some provenance statements in terms of others, inferences over PROV instances that applications may employ, including definitionsand also defines a class of some provenance statements in terms of others, and also defines a class of valid PROV instances by specifying constraints that valid PROV instances must satisfy. Applications should produce valid provenance and may reject provenance that is not valid in order to increase the usefulnessThere are four kinds of provenance and reliability of applications that process it. section 2. Compliance with this document summarizes the requirements for compliance with this document, which are specified in detail in the rest of the document. This specification lists inferences and definitions together in one section (section 3. Inferences and Definitions), defines the induced notion of equivalence (section 4. Equivalence), and then considers two kinds of validity constraints (section 5. Validity Constraints): constraints: structuraluniqueness constraints that prescribe properties of PROV instances that can be checked directly by inspecting the syntax, and, event ordering constraints, impossibility constraints that require that the records, and type constraints. Further discussion of the semantics of PROV statements, which justifies the definitions, inferences and constraints, can be found in a PROV instance are consistent with a sensible ordering of events relating the activities, entities and agents involved. In separate sections we consider additional constraints specific to collections and accounts (section 6. Collection Constraints and section 7. Account Constraints). Finally, the specification includes a section (section 8. Rationale for inferences and constraints) describing the rationale for the inferences and constraints in greater detail, particularly background on events, attributes, the role of inference, and accounts. Athe formal mathematical model that further justifies the constraints and inferences is found insemantics [PROV-SEM].

We define validity and equivalence in terms of a concept called normalization. Definitions, inferences, and uniqueness constraints can be applied to normalize PROV instances, and event ordering, typing, and impossibility constraints can be checked on the normal form to determine validity. Equivalence of two PROV instances can be determined by comparing their normal forms. For PROV documents, validity and equivalence amount to checking the validity or pairwise equivalence of their respective documents.

This document outlines an algorithm for validity checking based on normalization. Applications may implement validity and equivalence checking using normalization, as suggested here, or in any other way as long as the same instances or documents are considered valid or equivalent, respectively.

Checking validity or equivalence are recommended, but not required, for applications compliant with PROV. This specification defines how validity and equivalence are to be checked, if an application elects to support them at all. Applications producing provenance should ensure that it is valid, and similarly applications consuming provenance may reject provenance that is not valid. Applications which are determining whether PROV instances or documents convey the same information should check equivalence as specified here, and should treat equivalent instances or documents in the same way.

1.3 Structure of this document

Section 2 gives a brief rationale for the definitions, inferences and constraints.

Section 3 summarizes the requirements for compliance with this document, which are specified in detail in the rest of the document.

Section 4 presents definitions and inferences. Definitions allow replacing shorthand notation in [PROV-N] with more explicit and complete statements; inferences allow adding new facts representing implicit knowledge about the structure of provenance.

Section 5 presents four kinds of constraints, uniqueness constraints that prescribe that certain statements must be unique within PROV instances, event ordering constraints that require that the records in a PROV instance are consistent with a sensible ordering of events relating the activities, entities and agents involved, impossibility constraints that forbid certain patterns of statements in valid PROV instances, and type constraints that classify the types of identifiers in valid PROV instances.

Section 6 defines the notions of validity, equivalence and normalization.

1.4 Audience

The audience for this document is the same as for [PROV-DM]: developers and users who wish to create, process, share or integrate provenance records on the (Semantic) Web. Not all PROV-compliant applications need to perform inferences or check validity when processing provenance. However, applications that create or transform provenance should attempt to produce valid provenance, but manyto make it more useful to other applications could benefit from the inference rules specified here. Conversely, applications that createby ruling out nonsensical or transform provenance should try to produce valid provenance, to make it more useful to other applications.inconsistent information.

This document assumes familiarity with [PROV-DM] and employs the [PROV-N] notation.

2. Rationale

This section is non-normative.

This section gives a high-level rationale that provides some further background for the constraints, but does not affect the technical content of the rest of the specification.

2.1 Entities, Activities and Agents

One of the central challenges in representing provenance information is how to deal with change. Real-world objects, information objects and Web resources change over time, and the characteristics that make them identifiable in a given situation are sometimes subject to change as well. PROV allows for things to be described in different ways, with different descriptions of their state.

An entity is a thing one wants to provide provenance for and whose situation in the world is described by some fixed attributes. An entity has a lifetime, defined as the period between its generation event and its invalidation event. An entity's attributes are established when the entity is created and (partially) describe the entity's situation and state during the entirety of the entity's lifetime.

A different entity (perhaps representing a different user or system perspective) may fix other aspects of the same thing, and its provenance may be different. Different entities that fix aspects of the same thing are called alternates, and the PROV relations of specializationOf and alternateOf can be used to link such entities.

Besides entities, a variety of other PROV objects have attributes, including activity, generation, usage, invalidation, start, end, communication, attribution, association, delegation, and derivation. Each object has an associated duration interval (which may be a single time point), and attribute-value pairs for a given object are expected to be descriptions that hold for the object's duration.

However, the attributes of entities have special meaning because they are considered to be fixed aspects of underlying, changing things. This motivates constraints on alternateOf and specializationOf relating the attribute values of different entities.

In order to describe the provenance of something during an interval over which relevant attributes of the thing are not fixed, a PROV instance would describe multiple entities, each with its own identifier, lifetime, and fixed attributes, and express dependencies between the various entities using events. For example, in order to describe the provenance of several versions of a document, involving attributes such as authorship that change over time, one can use different entities for the versions linked by appropriate generation, usage, revision, and invalidation events.

There is no assumption that the set of attributes listed in an entity statement is complete, nor that the attributes are independent or orthogonal of each other. Similarly, there is no assumption that the attributes of an entity uniquely identify it. Two different entities that present the same aspects of possibly different things can have the same attributes; this leads to potential ambiguity, which is mitigated through the use of identifiers.

An activity's lifetime is delimited by its start and its end events. It occurs over an interval delimited by two instantaneous events. However, an activity statement need not mention start or end time information, because they may not be known. An activity's attribute-value pairs are expected to describe the activity's situation during its lifetime.

An activity is not an entity. Indeed, an entity exists in full at any point in its lifetime, persists during this interval, and preserves the characteristics provided. In contrast, an activity is something that occurs, happens, unfolds, or develops through time. This distinction is similar to the distinction between 'continuant' and 'occurrent' in logic [Logic].

2.2 Events

Although time is important for provenance, provenance can be used in many different contexts within individual systems and across the Web. Different systems may use different clocks which may not be precisely synchronized, so when provenance statements are combined by different systems, an application may not be able to align the times involved to a single global timeline. Hence, PROV is designed to minimize assumptions about time. Instead, PROV talks about (identified) events.

The PROV data model is implicitly based on a notion of instantaneous events (or just events), that mark transitions in the world. Events include generation, usage, or invalidation of entities, as well as start or end of activities. This notion of event is not first-class in the data model, but it is useful for explaining its other concepts and its semantics [PROV-SEM]. Thus, events help justify inferences on provenance as well as validity constraints indicating when provenance is self-consistent.

Five kinds of instantaneous events are used in PROV. The activity start and activity end events delimit the beginning and the end of activities, respectively. The entity generation, entity usage, and entity invalidation events apply to entities, and the generation and invalidation events delimit the lifetime of an entity. More precisely:

An activity start event is the instantaneous event that marks the instant an activity starts.

An activity end event is the instantaneous event that marks the instant an activity ends.

An entity generation event is the instantaneous event that marks the final instant of an entity's creation timespan, after which it is available for use. The entity did not exist before this event.

An entity usage event is the instantaneous event that marks the first instant of an entity's consumption timespan by an activity. The described usage had not started before this instant, although the activity could potentially have used the same entity at a different time.

An entity invalidation event is the instantaneous event that marks the initial instant of the destruction, invalidation, or cessation of an entity, after which the entity is no longer available for use. The entity no longer exists after this event.

2.3 Types

As set out in other specifications, the identifiers used in PROV documents have associated type information. An identifier can have more than one type, reflecting subtyping or allowed overlap between types, and so we define a set of types of each identifier, typeOf(id). Some types are, however, required not to overlap (for example, no identifier can describe both an entity and an activity). In addition, an identifier cannot be used to identify both an object (that is, an entity, activity or agent) and a property (that is, a named event such as usage, generation, or a relationship such as attribution.) This specification includes disjointness and typing constraints that check these requirements. Here, we summarize the type constraints in Table 1.

Table 1: Summary of Typing Constraints
In relation... identifier has type(s)...
entity(e,attrs) e 'entity'
activity(a,t1,t2,attrs) a 'activity'
agent(ag,attrs) ag 'agent'
used(id; a,e,t,attrs) e 'entity'
a 'activity'
wasGeneratedBy(id; e,a,t,attrs) e 'entity'
a 'activity'
wasInformedBy(id; a2,a1,attrs) a2 'activity'
a1 'activity'
wasStartedBy(id; a2,e,a1,t,attrs) a2 'activity'
e 'entity'
a1 'activity'
wasEndedBy(id; a2,e,a1,t,attrs) a2 'activity'
e 'entity'
a1 'activity'
wasInvalidatedBy(id; e,a,t,attrs) e 'entity'
a 'activity'
wasDerivedFrom(id; e2,e1,a,g,u,attrs) e2 'entity'
e1 'entity'
a 'activity'
wasAttributedTo(id; e,ag,attr) e 'entity'
ag 'agent'
wasAssociatedWith(id; a,ag,pl,attrs) a 'activity'
ag 'agent'
pl 'entity'
actedOnBehalfOf(id; ag2,ag1,a,attrs) ag2 'agent'
ag1 'agent'
a 'activity'
alternateOf(e1,e2) e1 'entity'
e2 'entity'
specializationOf(e1,e2) e1 'entity'
e2 'entity'
mentionOf(e1,e2,b) e1 'entity'
e2 'entity'
b 'entity'
hadMember(c,e) c 'entity'
'prov:Collection'
e 'entity'
entity(c,[prov:type='prov:EmptyCollection,...]) c 'entity'
'prov:Collection'
'prov:EmptyCollection'

2.4 Validation Process Overview

This section collects common concepts and operations that are used throughout the specification, and relates them to background terminology and ideas from logic [Logic], constraint programming [CHR], and database constraints [DBCONSTRAINTS]. This section does not attempt to provide a complete introduction to these topics, but it is provided in order to aid readers familiar with one or more of these topics in understanding the specification, and to clarify some of the motivations for choices in the specification to all readers.

Constants, Variables and Placeholders

PROV statements involve identifiers, literals, placeholders, and attribute lists. Identifiers are, according to PROV-N, expressed as qualified names which can be mapped to URIs [IRI]. However, in order to specify constraints over PROV instances, we also need variables that represent unknown identifiers, literals, or placeholders. These variables are similar to those in first-order logic [Logic]. A variable is a symbol that can be replaced by other symbols, including either other variables or constant identifiers, literals, or placeholders. In a few special cases, we also use variables for unknown attribute lists. To help distinguish identifiers and variables, we also term the former 'constant identifiers' to highlight their non-variable nature.

Several definitions and inferences conclude by saying that some objects exist such that some other formulas hold. Such an inference introduces fresh existential variables into the instance. An existential variable denotes a fixed object that exists, but its exact identity is unknown. Existential variables can stand for unknown identifiers or literal values only; we do not allow existential variables that stand for unknown attribute lists.

In particular, many occurrences of the placeholder symbol - stand for unknown objects; these are handled by expanding them to existential variables. Some placeholders, however, indicate the absence of an object, rather than an unknown object. In other words, the placeholder is overloaded, with different meanings in different places.

An expression is called a term if it is either a constant identifier, literal, placeholder, or variable. We write t to denote an arbitrary term.

Substitution

A substitution is a function that maps variables to terms. Concretely, since we only need to consider substitutions of finite sets of variables, we can write substitutions as [x1 = t1,...,xn=tn]. A substitution S = [x1 = t1,...,xn=tn] can be applied to a term as follows.

  1. If the term is a variable xi, one of the variables in the domain of S, then S(xi) = ti.
  2. If the term is a constant identifier or literal c, then S(c) = c.

In addition, a substitution can be applied to an atomic formula (PROV statement) p(t1,...,tn) by applying it to each term, that is, S(p(t1,...,tn)) = p(S(t1),...,S(tn)). Likewise, a substitution S can be applied to an instance I by applying it to each atomic formula (PROV statement) in I, that is, S(I) = {S(A) | A ∈ I}.

Formulas

For the purpose of constraint checking, we view PROV statements (possibly involving existential variables) as formulas. An instance is analogous to a "theory" in logic, that is, a set of formulas all thought to describe the same situation. The set can also be thought of a single, large formula: the conjunction of all of the atomic formulas.

The atomic constraints considered in this specification can be viewed as atomic formulas:

Similarly, the definitions, inferences, and constraint rules in this specification can also be viewed as logical formulas, built up out of atomic formulas, logical connectives "and" (∧), "implies" (⇒), and quantifiers "for all" (∀) and "there exists" (∃). For more background on logical formulas, see a logic textbook such as [Logic].

Satisfying definitions, inferences, and constraints

In logic, a formula's meaning is defined by saying when it is satisfied. We can view definitions, inferences, and constraints as being satisfied or not satisfied in a PROV instance, augmented with information about the constraints.

  1. A logical equivalence as used in a definition is satisfied when the formula ∀ x1,....,xn. A ⇔ ∃ y1...ym . B1 ∧ ... ∧ Bk holds, that is, for any substitution of the variables x1,....,xn, formula A and formula ∃ y1...ym. B1 ∧ ... ∧ Bk are either both true or both false.
  2. A logical implication as used in an inference is satisfied with the formula ∀ x1,....,xn. A1 ∧ ... ∧ Al ⇒ ∃ y1...ym . B1 ∧ ... ∧ Bk holds, that is, for any substitution of the variables x1,....,xn, if A1 ∧ ... ∧ Al is true, then for some further substitution of terms for variables y1...ym, formula B1 ∧ ... ∧ Bk is also true.
  3. A uniqueness, ordering, or typing constraint is satisfied when its associated formula ∀ x1...xn. A1 ∧ ... ∧ Al ⇒ C holds, that is, for any substitution of the variables x1,....,xn, if A1 ∧ ... ∧ Al is true, then C is also true.
  4. An impossibility constraint is satisfied when the formula ∀ x1...xn. A1 ∧ ... ∧ Al ⇒ False holds. This is logically equivalent to ∄ x1...xn. A1 ∧ ... ∧ Al, that is, there exists no substitution for x1...xn making A1 ∧ ... ∧ Al true.

Merging

Merging is an operation that takes two terms and compares them to see if they are equal, or can be made equal by substituting an existential variable with another term. This operation is a special case of unification, a common operation in logical reasoning, including logic programming and automated reasoning. Merging two terms t,t' results in either substitution S such that S(t) = S(t'), or failure indicating that there is no substitution that can be applied to both t and t' to make them equal.

Applying definitions, inferences, and constraints

Formulas can also be interpreted as having computational content. That is, if an instance does not satisfy a formula, we can often apply the formula to the instance to produce another instance that does satisfy the formula. Definitions, inferences, and uniqueness constraints can be applied to instances:

As noted above, uniqueness or key constraint application can fail, if a required merging step fails. Failure of constraint application means that there is no way to add information to the instance to satisfy the constraint, which in turn implies that the instance is invalid.

The process of applying definitions, inferences, and constraints to a PROV instance until all of them are satisfied is similar to what is sometimes called chasing [DBCONSTRAINTS] or saturation [CHR]. We call this process normalization.

Termination

In general, applying sets of logical formulas of the above definition, inference, and constraint forms is not guaranteed to terminate. A simple example is the inference R(x,y) ⇒ ∃z. R(x,z) ∧R(z,y), which can be applied to {R(a,b)} to generate an infinite sequence of larger and larger instances. To ensure that normalization, validity, and equivalence are decidable, we require that normalization terminates. There is a great deal of work on termination of the chase in databases, or of sets of constraint handling rules. The termination of the notion of normalization defined in this specification is guaranteed because the definitions, inferences and uniqueness/key constraints correspond to a weakly acyclic set of tuple-generating and equality-generating dependencies, in the terminology of [DBCONSTRAINTS]. The termination of the remaining ordering, typing, and impossibility constraints is easy to show. Appendix C gives a proof that the definitions, inferences, and uniqueness and key constraints are weakly acyclic and therefore terminating.

There is an important subtlety that is essential to guarantee termination. This specification draws a distinction between knowing that an identifier has type 'entity', 'activity', or 'agent', and having an explicit entity(id), activity(id), or agent(id) statement in the instance. For example, focusing on entity statements, we can infer 'entity' ∈ typeOf(id) if entity(id) holds in the instance. In contrast, if we only know that 'entity' ∈ typeOf(id), this does not imply that entity(id) holds.

This distinction (for both entities and activities) is essential to ensure termination of the inferences, because we allow inferring that a declared entity(id,attrs) has a generation and invalidation event, using Inference 7 (entity-generation-invalidation-inference). Likewise, for activities, we allow inferring that a declared activity(id,t1,t2,attrs) has a generation and invalidation event, using Inference 8 (activity-start-end-inference). These inferences do not apply to identifiers whose types are known, but for which there is not an explicit entity or activity statement. If we strengthened the type inference constraints to add new entity or activity statements for the entities and activities involved in generating or starting other declared entities or activities, then we could keep generating new entities and activities in an unbounded chain into the past (as in the "chicken and egg" paradox). The design adopted here requires that instances explicitly declare the entities and activities that are relevant for validity checking, and only these can be inferred to have invalidation/generation and start/end events. This inference is not supported for identifiers that are indirectly referenced in other relations and therefore have type 'entity' or 'activity'.

validation process overview
Figure 1 ◊: Overview of the Validation Process

Checking ordering, typing, and impossibility constraints

The ordering, typing, and impossibility constraints are checked rather than applied. This means that they do not generate new formulas expressible in PROV, but they do generate basic constraints that might or might not be consistent with each other. Checking such constraints follows a saturation strategy similar to that for normalization:

  1. For ordering constraints, we check by generating all of the precedes and strictly-precedes relationships specified by the rules. These can be thought of as a directed graph whose nodes are terms, and whose edges are precedes or strictly-precedes relationships. An ordering constraint of the form ∀ x1...xn. A1 ∧ ... ∧ Ap ⇒ precedes(t,t') can be applied by searching for occurrences of A1 ∧ ... ∧ Ap and for each such match adding the atomic formula precedes(t,t') to the instance, and similarly for strictly-precedes constraints. After all such constraints have been checked, and the resulting edges added to the graph, the ordering constraints are violated if there is a cycle in the graph that includes a strictly-precedes edge, and satisfied otherwise.

  2. For typing constraints, we check by constructing a function typeOf(id) mapping identifiers to sets of possible types. We start with a function mapping each identifier to the empty set, reflecting no constraints on the identifiers' types. A typing constraint of the form ∀ x1...xn. A1 ∧ ... ∧ Ap ⇒ 'type' ∈ typeOf(id) is checked by adjusting the function by adding 'type' to typeOf(id) for each conclusion 'type' ∈ typeOf(id) of the rule. Typing constraints with multiple conclusions are handled analogously. Once all constraints have been checked in all possible ways, we check that the disjointness constraints hold of the resulting typeOf function. (These are essentially impossibility constraints).

  3. For impossibility constraints, we check by searching for the forbidden pattern that the impossibility constraint describes. Any match of this pattern leads to failure of the constraint checking process. An impossibility constraint of the form ∀ x1...xn. A1 ∧ ... ∧ Ap ⇒ False can be applied by searching for occurrences of A1 ∧ ... ∧ Ap in the instance, and if any such occurrence is found, signaling failure.

A normalized instance that satisfies all of the checked constraints is called valid. Validity can be, but is not required to be, checked by normalizing and then checking constraints. Any other algorithm that provides equivalent behavior (that is, accepts the same valid instances and rejects the same invalid instances) is allowed. In particular, the checked constraints and the applied definitions, inferences and uniqueness constraints do not interfere with one another, so it is also possible to mix checking and application. This may be desirable in order to detect invalidity more quickly.

Equivalence and Isomorphism

Given two normal forms, a natural question is whether they contain the same information, that is, whether they are equivalent (if so, then the original instances are also equivalent.) By analogy with logic, if we consider normalized PROV instances with existential variables to represent sets of possible situations, then two normal forms may describe the same situation but differ in inessential details such as the order of statements or of elements of attribute-value lists. To remedy this, we can easily consider instances to be equivalent up to reordering of attributes. However, instances can also be equivalent if they differ only in choice of names of existential variables. Because of this, the appropriate notion of equivalence of normal forms is isomorphism. Two instances I1 and I2 are isomorphic if there is an invertible substitution S mapping existential variables to existential variables such that S(I1) = I2. This is similar to the notion of equivalence used in [RDF], where blank nodes play an analogous role to existential variables.

Equivalence can be checked by normalizing instances, checking that both instances are valid, then testing whether the two normal forms are isomorphic. (It is technically possible for two invalid normal forms to be isomorphic, but to be considered equivalent, the two instances must also be valid.) As with validity, the algorithm suggested by this specification is just one of many possible ways to implement equivalence checking; it is not required that implementations compute normal forms explicitly, only that their determinations of equivalence match those obtained by the algorithm in this specification.

From Instances to Bundles and Documents

PROV documents can contain multiple instances: a toplevel instance consisting of the set of statements not appearing within a bundle, and zero or more named instances called bundles. For the purpose of inference and constraint checking, these instances are treated independently. That is, a PROV document is valid provided that each instance in it is valid and the names of its bundles are distinct. Similarly, a PROV document is equivalent to another if their toplevel instances are equivalent, they have the same number of bundles with the same names, and the instances of their corresponding bundles are equivalent. Analogously to blank nodes in [RDF], the scope of an existential variable in PROV is the instance level, so existential variables with the same name occurring in different instances do not necessarily denote the same term. This is a consequence of the fact that the instances of two equivalent documents only need to be pairwise isomorphic; this is a weaker property than requiring that there be a single isomorphism that works for all of the corresponding instances.

2.5 Summary of inferences and constraints

Table 2 summarizes the inferences, and constraints specified in this document, broken down by component and type or relation involved.

Table: work in progress; these entries might change when the document is updated.
Table 2: Summary of inferences and constraints for PROV Types and Relations
Type or Relation Name Inferences and ConstraintsComponent
Entity Inference 7 (entity-generation-invalidation-inference)
Inference 21 (specialization-attributes-inference)
Constraint 23 (key-object)
Constraint 56 (impossible-object-property-overlap)
Constraint 57 (entity-activity-disjoint)
1
Activity Inference 8 (activity-start-end-inference)
Constraint 23 (key-object)
Constraint 29 (unique-startTime)
Constraint 30 (unique-endTime)
Constraint 56 (impossible-object-property-overlap)
Constraint 57 (entity-activity-disjoint)
Generation Inference 6 (generation-use-communication-inference)
Inference 15 (influence-inference)
Constraint 24 (key-properties)
Constraint 25 (unique-generation)
Constraint 36 (generation-within-activity)
Constraint 38 (generation-precedes-invalidation)
Constraint 39 (generation-precedes-usage)
Constraint 41 (generation-generation-ordering)
Constraint 43 (derivation-usage-generation-ordering)
Constraint 44 (derivation-generation-generation-ordering)
Constraint 45 (wasStartedBy-ordering)
Constraint 46 (wasEndedBy-ordering)
Constraint 47 (specialization-generation-ordering)
Constraint 49 (wasAssociatedWith-ordering)
Constraint 50 (wasAttributedTo-ordering)
Constraint 51 (actedOnBehalfOf-ordering)
Constraint 55 (impossible-property-overlap)
Constraint 52 (typing)
Usage Inference 6 (generation-use-communication-inference)
Inference 15 (influence-inference)
Constraint 24 (key-properties)
Constraint 35 (usage-within-activity)
Constraint 39 (generation-precedes-usage)
Constraint 40 (usage-precedes-invalidation)
Constraint 43 (derivation-usage-generation-ordering)
Constraint 55 (impossible-property-overlap)
Constraint 52 (typing)
Communication Inference 5 (communication-generation-use-inference)
Inference 15 (influence-inference)
Constraint 24 (key-properties)
Constraint 37 (wasInformedBy-ordering)
Constraint 55 (impossible-property-overlap)
Constraint 52 (typing)
Start Inference 9 (wasStartedBy-inference)
Inference 15 (influence-inference)
Constraint 24 (key-properties)
Constraint 27 (unique-wasStartedBy)
Constraint 29 (unique-startTime)
Constraint 32 (start-precedes-end)
Constraint 35 (usage-within-activity)
Constraint 36 (generation-within-activity)
Constraint 37 (wasInformedBy-ordering)
Constraint 33 (start-start-ordering)
Constraint 45 (wasStartedBy-ordering)
Constraint 49 (wasAssociatedWith-ordering)
Constraint 55 (impossible-property-overlap)
Constraint 52 (typing)
End Inference 10 (wasEndedBy-inference)
Inference 15 (influence-inference)
Constraint 24 (key-properties)
Constraint 28 (unique-wasEndedBy)
Constraint 30 (unique-endTime)
Constraint 32 (start-precedes-end)
Constraint 35 (usage-within-activity)
Constraint 36 (generation-within-activity)
Constraint 37 (wasInformedBy-ordering)
Constraint 34 (end-end-ordering)
Constraint 46 (wasEndedBy-ordering)
Constraint 49 (wasAssociatedWith-ordering)
Constraint 55 (impossible-property-overlap)
Constraint 52 (typing)
Invalidation Inference 15 (influence-inference)
Constraint 24 (key-properties)
Constraint 26 (unique-invalidation)
Constraint 38 (generation-precedes-invalidation)
Constraint 40 (usage-precedes-invalidation)
Constraint 42 (invalidation-invalidation-ordering)
Constraint 45 (wasStartedBy-ordering)
Constraint 46 (wasEndedBy-ordering)
Constraint 48 (specialization-invalidation-ordering)
Constraint 49 (wasAssociatedWith-ordering)
Constraint 50 (wasAttributedTo-ordering)
Constraint 51 (actedOnBehalfOf-ordering)
Constraint 55 (impossible-property-overlap)
Constraint 52 (typing)
Derivation Inference 11 (derivation-generation-use-inference)
Inference 15 (influence-inference)
Constraint 24 (key-properties)
Constraint 43 (derivation-usage-generation-ordering)
Constraint 44 (derivation-generation-generation-ordering)
Constraint 52 (typing)
2
Revision Inference 12 (revision-is-alternate-inference)
Quotation No specific constraints
Primary Source No specific constraints
Influence No specific constraints
Agent Constraint 23 (key-object)
Constraint 56 (impossible-object-property-overlap)
3
Attribution Inference 13 (attribution-inference)
Inference 15 (influence-inference)
Constraint 24 (key-properties)
Constraint 50 (wasAttributedTo-ordering)
Constraint 55 (impossible-property-overlap)
Constraint 52 (typing)
Association Inference 15 (influence-inference)
Constraint 24 (key-properties)
Constraint 49 (wasAssociatedWith-ordering)
Constraint 55 (impossible-property-overlap)
Constraint 52 (typing)
Delegation Inference 14 (delegation-inference)
Inference 15 (influence-inference)
Constraint 24 (key-properties)
Constraint 51 (actedOnBehalfOf-ordering)
Constraint 55 (impossible-property-overlap)
Constraint 52 (typing)
Influence Inference 15 (influence-inference)
Constraint 24 (key-properties)
Bundle constructor No specific constraints; see section 6.2 Bundles and Documents 4
Bundle type No specific constraints; see section 6.2 Bundles and Documents
Alternate Inference 16 (alternate-reflexive)
Inference 17 (alternate-transitive)
Inference 18 (alternate-symmetric)
Constraint 52 (typing)
5
Specialization Inference 19 (specialization-transitive)
Inference 20 (specialization-alternate-inference)
Inference 21 (specialization-attributes-inference)
Constraint 47 (specialization-generation-ordering)
Constraint 48 (specialization-invalidation-ordering)
Constraint 54 (impossible-specialization-reflexive)
Constraint 52 (typing)
Mention Inference 22 (mention-specialization-inference)
Constraint 31 (unique-mention)
Constraint 52 (typing)
Collection No specific constraints 6
Membership Constraint 58 (membership-empty-collection)
Constraint 52 (typing)

2.3. Compliance with this document

TODO: Add collection and account constraint sections to the compliance list as appropriate.

For the purpose of compliance, the normative sections of this document are section 2.3. Compliance with this document, section 3.4. Definitions and Inferences and Definitions, section 4. Equivalence, and section 5. Validity Constraints, and section 6. Normalization, Validity, and Equivalence. To be compliant:

  1. When processing provenance, an application may apply the inferences and definitions in section 3.4. Definitions and Inferences and Definitions.
  2. If determining whether a PROV instance or document is valid, an application must check that all of the constraints of section 5. Constraints are satisfied on the normal form of the instance or document.
  3. If producing provenance meant for other applications to use, the application should produce valid provenance, as specified in section 6. Normalization, Validity, and Equivalence.
  4. WhenIf determining whether two PROV instances or documents are equivalent, an application must determine whether their normal forms are equal, as specified in section 4.6. Normalization, Validity, and Equivalence.
  5. When determining whether

Compliant applications are not required to explicitly compute normal forms; however, if checking validity or equivalence, the results should be the same as would be obtained by computing normal forms as defined in this specification.

All figures are for illustration purposes only. Information in tables is normative if it appears in a PROV instance normative section; specifically, Table 3 is valid, an application must check that all of the constraints of section 5. Validity Constraints are satisfiednormative. on the normal form of the instance. When producing provenance meant for other applications to use, the application should produce valid provenance. Should we specifyText in appendices and in boxes labeled "Remark" is informative. Where there is any apparent ambiguity between the descriptive text and the formal text in a way for PROV instances to say whether they are meant to be validated "definition", "inference" or not? Seems outside the scope of this document, may require changes to PROV-N. "constraint" box, the formal text takes priority.

3. 4. Definitions and Inferences and Definitions

In this section, we describeThis section describes definitions and inferences and definitions that may be used on provenance data, and preserve equivalence on valid PROV instances (as detailed in section 6. Normalization, Validity, and Equivalence). A definition is a notion of equivalence on rule that can be applied to PROV instances. instances to replace defined expressions with definitions. An inference is a rule that can be applied to PROV instances to add new PROV statements. A definition states that a provenance statement is equivalent to some other statements, whereas an inference only states one direction of an implication; thus, defined provenance statements can be replaced by their definitions.

Definitions have the following general form:

defined_stmt IF AND ONLY IF there exists a rule1,..., am such that statesdefining_stmt1 and ... and defining_stmtn.

A definition can be applied to a PROV instance, since its defined_stmt is defined in terms of other statements. Applying a definition to an instance means that if an occurrence of a defined provenance expressionstatement defined_stmt can be found in a PROV instance, then we can remove it and add all of the statements defining_stmt1 ... defining_stmtn to the instance, possibly after generating fresh identifiers a1,...,am for existential variables. In other words, it is equivalentsafe to replace a defined statement with its definition.

Inferences have the following general form:

IF hyp1 and ... and hypk THEN there exists a1 and ... and am such that concl1 and ... and concln.

Inferences can be applied to PROV instances. Applying an inference to an instance means that if all of the provenance statements matching hyp1... hypk can be found in the instance, then we check whether the conclusion concl1 ... concln is satisfied for some other expressions; thus, defined provenance expressions canvalues of existential variables. If so, application of the inference has no effect on the instance. If not, then a copy the conclusion should be replaced by their definitions, and vice versa. added to the instance, after generating fresh identifiers a1,...,am for the existential variables. These fresh identifiers might later be found to be equal to known identifiers; they play a similar role in PROV constraints to existential variables in logic, to "labeled nulls" in database theory [DBCONSTRAINTS], or to blank nodes in [RDF]. In general, omitted optional parameters to [PROV-N] statements, or explicit - markers, are placeholders for existentially quantified variables; that is, they denote unknown values. There are a few exceptions to this general rule, which are specified in Definition 4 (optional-placeholders).

InferencesDefinitions and inferences can be viewed as logical formulas; similar formalisms are often used in rule-based reasoning [CHR] and in databases [DBCONSTRAINTS]. In particular, the identifiers a1 ... an should be viewed as existentially quantified variables, meaning that through subsequent reasoning steps they may turn out to be equal to other identifiers that are already known, or to other existentially quantified variables. Their treatment is analogous to that of blank nodes in RDF. In contrast, distinct URIs or literal values in PROV are assumed to be distinct for the purpose of checking validity or inferences. This issue is discussed in more detail under Uniqueness Constraints.

In a [definition|inference], term symbols such as id, start, end, e, a, attrs, are assumed to be variables unless otherwise specified. These variables are scoped at the [definition|inference|constraint] level, so the rule is equivalent to any one-for-one renaming of the variable names. When several rules are collected within a [definition|inference] as an ordered list, the scope of the variables in each rule is at the level of list elements, and so reuse of variable names in different rules does not affect the meaning.

4.1 Optional Identifiers and Attributes

Many PROV relation statements have an identifier, identifying a link between two or more related objects. Identifiers can sometimes be omitted in [PROV-O] notation. For the purpose of inference and validity checking, we generate special identifiers called existential variables denoting the unknown values.

Existential variables can be substituted with other terms. Specifically, a substitution is a function from a set of existential variables to identifiers, literals, the placeholder -, or other existential variables. A substitution S can be applied to an instance I by replacing all occurrences of existential variables x in the instance with S(x).

Definition 1 (optional-identifiers), Definition 2 (optional-attributes), and Definition 3 (definition-short-forms), explain how to expand the compact forms of PROV-N notation into a normal form. Definition 4 (optional-placeholders) indicates when other optional parameters can be replaced by existential variables.

For each r in { used, wasGeneratedBy, wasInvalidatedBy, wasInfluencedBy, wasStartedBy, wasEndedBy, wasInformedBy, wasDerivedFrom, wasAttributedTo, wasAssociatedWith, actedOnBehalfOf}, the following general form: definitional rules hold:

  1. r(a1,...,an) IF hyp_1 and ... and hyp_k AND ONLY IF there exists id such that r(id; a1,...,an).
  2. r(-; a1,...,an) THENIF AND ONLY IF there exists a_1 and ... and a_mid such that conclusion_1 and ... and conclusion_n r(id; a1,...,an).

This meansLikewise, many PROV-N statements allow for an optional attribute list. If it is omitted, this is the same as specifying an empty attribute list:

  1. For each p in {entity, activity, agent}, if an is not an attribute list parameter then the following definitional rule holds:

    p(a1,...,an) IF AND ONLY IF p(a1,...,an,[]).

  2. For each r in { used, wasGeneratedBy, wasInvalidated, wasInfluencedBy, wasStartedBy, wasEndedBy, wasInformedBy, wasDerivedFrom, wasAttributedTo, wasAssociatedWith, actedOnBehalfOf}, if an is not an attribute list parameter then the following definition holds:

    r(id; a1,...,an) IF AND ONLY IF r(id; a1,...,an,[]).

Definitions Definition 1 (optional-identifiers) and Definition 2 (optional-attributes). do not apply to alternateOf, specializationOf, and mentionOf, which do not have identifiers and attributes.

Finally, many PROV statements have other optional arguments or short forms that can be used if allnone of the provenance expressions matching hyp_1... hyp_k can be found in a PROV instance, we can add all of the expressions concl_1 ... concl_n to the instance, possibly after generating fresh identifiers a_1,...,a_m for unknown objects.optional arguments is present. These are handled by specific rules listed below.

  1. activity(id,attrs) IF AND ONLY IF activity(id,-,-,attrs).
  2. wasGeneratedBy(id; e,attrs) IF AND ONLY IF wasGeneratedBy(id; e,-,-,attrs).
  3. used(id; a,attrs) IF AND ONLY IF used(id; a,-,-,attrs).
  4. wasStartedBy(id; a,attrs) IF AND ONLY IF wasStartedBy(id; a,-,-,-,attrs).
  5. wasEndedBy(id; a,attrs) IF AND ONLY IF wasEndedBy(id; a,-,-,-,attrs).
  6. wasInvalidatedBy(id; e,attrs) IF AND ONLY IF wasInvalidatedBy(id; e,-,-,attrs).
  7. wasDerivedFrom(id; e2,e1,attrs) IF AND ONLY IF wasDerivedFrom(id; e2,e1,-,-,-,attrs).
  8. wasAssociatedWith(id; e,attrs) IF AND ONLY IF wasAssociatedWith(id; e,-,-,attrs).
  9. actedOnBehalfOf(id; a2,a1,attrs) IF AND ONLY IF actedOnBehalfOf(id; a2,a1,-,attrs).

There are no expansion rules for entity, agent, communication, attribution, influence, alternate, specialization, or mention relations, because these have no optional parameters aside from the identifier and attributes, which are expanded by the rules in Definition 1 (optional-identifiers) and Definition 2 (optional-attributes).

Finally, most optional parameters (written -) are, for the purpose of this document, considered to be distinct, fresh existential variables. Optional parameters are defined in [PROV-DM] and in [PROV-N] for each type of PROV statement. Thus, before proceeding to apply other definitions or inferences, most occurrences of - are to be replaced by fresh existential variables, distinct from any others occurring in the instance. The only exceptions to this general rule, where - are to be left in place, are the activity, generation, and usage parameters in wasDerivedFrom and the plan parameter in wasAssociatedWith. This is further explained in remarks below.

The treatment of optional parameters is specified formally using the auxiliary concept of expandable parameter. An expandable parameter is one that can be omitted using the placeholder -, and if so, it is to be replaced by a fresh existential identifier. Table 3 defines the expandable parameters of the properties of PROV, needed in Definition 4 (optional-placeholders). For emphasis, the four optional parameters that are not expandable are also listed. Parameters that cannot have value -, and identifiers might later be found to be equal to known identifiers; these fresh identifiers play that are expanded by Definition 1 (optional-identifiers), are not listed.

Table 3: Expandable and Non-Expandable Parameters
Relation Expandable Non-expandable
used(id; a,e,t,attrs) e,t
wasGeneratedBy(id; e,a,t,attrs) a,t
wasStartedBy(id; a2,e,a1,t,attrs) e,a1,t
wasEndedBy(id; a2,e,a1,t,attrs) e,a1,t
wasInvalidatedBy(id; e,a,t,attrs) a,t
wasDerivedFrom(id; e2,e1,-,g,u,attrs) g,u
wasDerivedFrom(id; e2,e1,a,g,u,attrs)
(where a similar role in PROV constraints to existential variables in logic. TODO: Is this re-inventing blank nodes in PROV-DM, and do we want to do this? A lot of the inferences have existentially quantified conclusions (and there is some theory that supports this). TODO: Make sure conjunctive reading of conclusion is clear. not placeholder -)
g,u a
wasAssociatedWith(id; a,ag,pl,attrs) ag pl
actedOnBehalfOf(id; ag2,ag1,a,attrs) a

DefinitionsDefinition 4 (optional-placeholders) states how parameters are to be expanded, using the expandable parameters defined in Table 3. The last two parts, 4 and 5, indicate how to handle expansion of parameters for wasDerivedFrom expansion, which is only allowed for the generation and use parameters when the activity is specified. Essentially, the definitions state that parameters g,u are expandable only if the activity is specified, i.e., if parameter a is provided. The rationale for this is that when a is provided, then there have to be two events, namely u and g, which account for the usage of e1 and the generation of e2, respectively, by a. Conversely, if a is not provided, then one cannot tell whether one or more activities are involved in the derivation, and the explicit introduction of such events, which correspond to a single acitivity, would therefore not be justified.

A later constraint, Constraint 53 (impossible-unspecified-derivation-generation-use), forbids specifying generation and use parameters when the activity is unspecified.

  1. activity(id,-,t2,attrs) IF AND ONLY IF there exists t1 such that activity(id,t1,t2,attrs). Here, t2 may be a placeholder.
  2. activity(id,t1,-,attrs) IF AND ONLY IF there exists t2 such that activity(id,t1,t2,attrs). Here, t1 must not be a placeholder.
  3. For each r in { used, wasGeneratedBy, wasStartedBy, wasEndedBy, wasInvalidatedBy, wasAssociatedWith, actedOnBehalfOf }, if the ith parameter of r is an expandable parameter of r as specified in Table 3 then the following general form:definition holds:

    r(a0;...,ai-1, -, ai+1, ...,an) IF AND ONLY IF there exists a' such that r(a0;...,ai-1,a',ai+1,...,an).

  4. If a is not the placeholder -, and u is any term, then the following definition holds:

    wasDerivedFrom(id;e2,e1,a,-,u,attrs) IF AND ONLY IF there exists g such that wasDerivedFrom(id; e2,e1,a,g,u,attrs).

  5. If a is not the placeholder -, and g is any term, then the following definition holds:

    wasDerivedFrom(id;e2,e1,a,g,-,attrs) IF AND ONLY IF there exists u such that wasDerivedFrom(id; e2,e1,a,g,u,attrs).

In an association of the form wasAssociatedWith(id; a,ag,-,attr), the absence of a plan means: either no plan exists, or a plan exists but it is not identified. Thus, it is not equivalent to wasAssociatedWith(id; a,ag,p,attr) where a plan p is given.

A derivation wasDerivedFrom(id; e2,e1,a,gen,use,attrs) that specifies an activity explicitly indicates that this activity achieved the derivation, with a usage use of entity e1, and a generation gen of entity e2. It differs from a derivation of the form wasDerivedFrom(id; e2,e1,-,-,-,attrs) with missing activity, generation, and usage. In the latter form, it is not specified if one or more activities are involved in the derivation.

Let us consider a system, in which a derivation is underpinned by multiple activities. Conceptually, one could also model such a system with a new activity that encompasses the two original activities and underpins the derivation. The inferences defined in this specification do not allow the latter modelling to be inferred from the former. Hence, the two modellings of the same system are regarded as different in the context of this specification.

4.2 Entities and Activities

Communication between activities implies the existence of an underlying entity generated by one activity and used by the other, and vice versa.

defined_exp holds

IF AND ONLY IF wasInformedBy(_id; a2,a1,_attrs) THEN there exists a_1,..., a_m such that defining_exp_1 andexist e, ... and defining_exp_n. This means that a provenance expression defined_exp is defined in terms of other expressions. This can be viewed as a two-way inference: If defined_exp can be found in a PROV instance, we can add all of the expressions defining_exp_1 ... defining_exp_n to the instance, possibly after generating fresh identifiers a_1,...,a_m for unknown objects. Conversely, if there exist identifiers a_1...a_m such that defining_exp_1 and ... and defining_exp_n hold in the instance, we can add the defined expression def_exp. When an expression is defined in terms of others, it is in a sense redundant; it is safe to replace it with its definition. 3.1 Component 1: Entities and Activities Communication between activities is defined in terms as the existence of an underlying entity generated by one activity and used by the other. Given two activities identified by a1 and a2_gen, _t1, _use, and _t2, wasInformedBy(a2,a1) holds such that wasGeneratedBy(_gen; e,a1,_t1,[]) and used(_use; a2,e,_t2,[]) hold.

IF AND ONLY IF wasGeneratedBy(_gen; e,a1,_t1,_attrs1) and used(_id2; a2,e,_t2,_attrs2) hold THEN there is an entity with some identifier e and some sets of attribute-value pairs attrs1 and attrs2,exists _id such that wasGeneratedBy(-,e,a1,-,attrs1) and used(-,a2,e,-,attrs2) hold. wasInformedBy(_id; a2,a1,[])

The relationship wasInformedBy is not transitive.transitive. Indeed, consider the following statements.

wasInformedBy(a2,a1)
wasInformedBy(a3,a2)

We cannot infer wasInformedBy(a3,a1) from these expressions.statements alone. Indeed, from wasInformedBy(a2,a1), we know that there exists e1 such that e1 was generated by a1 and used by a2. Likewise, from wasInformedBy(a3,a2), we know that there exists e2 such that e2 was generated by a2 and used by a3. The following illustration shows a case for which transitivity cannot hold.counterexample to transitivity. The horizontal axis represents the event line. We see that e1 was generated after e2 was used. Furthermore, the illustration also shows that a3 completes before a1. started. So in this example (with no other information) it is impossible for a3 to have used an entity generated by a1. This is illustrated in Figure 2.

non transitivity of wasInformedBy
Figure 2 ◊: Counter-example for transitivity of wasInformedBy

From an entity statement, we can infer the existence of generation and invalidation events.

IF entity(e,_attrs) THEN there exist _gen, _a1, _t1, _inv, _a2, and _t2 such that wasGeneratedBy(_gen; e,_a1,_t1,[]) and wasInvalidatedBy(_inv; e,_a2,_t2,[]).


From an activity statement, we can infer start and end events whose times match the start and end times of the activity, respectively.

Start IF activity(a,t1,t2,_attrs) THEN there exist _start, _e1, _a1, _end, _a2, and _e2 such that wasStartedBy(_start; a,_e1,_a1,t1,[]) and wasEndedBy(_end; a,_e2,_a2,t2,[]).


The start of a2an activity a triggered by entity e1 implies that e1 was generated by the starting activity a1 is defined as follows..

Given two activities with identifiers a1 and a2, wasStartedByActivity(a2,a1) holds

IF AND ONLY wasStartedBy(_id; a,e1,a1,_t,_attrs), THEN there exist _gen and _t1 such that wasGeneratedBy(_gen; e1,a1,_t1,[]).


Likewise, the ending of activity a by triggering entity e1 implies that e1 was generated by the ending activity a1.

IF wasEndedBy(_id; a,e1,a1,_t,_attrs), THEN there exists an entity e exist _gen and _t1 such that wasGeneratedBy(-,e,a1,-,-) and wasStartedBy(-,a2,e,-,-) hold.wasGeneratedBy(_gen; e1,a1,_t1,[]).

3.2 Component 2: 4.3 Derivations


Derivations with explicit activity, generation, and usage admit the following inference:

In this inference, none of a, gen2 or use1 can be placeholders -.

IF wasDerivedFrom(_id; e2,e1,a,gen2,use1,_attrs), THEN there exists _t1 and _t2 such that used(use1; a,e1,_t1,[]) and wasGeneratedBy(gen2; e2,a,_t2,[]).


A revision admits the following inference, stating that the two entities linked by a revision are also alternates.

In this inference, any of _a, _g or _u may be placeholders.

IF wasDerivedFrom(_id; e2,e1,_a,_g,_u,[prov:type='prov:Revision']), THEN alternateOf(e2,e1).

There is no inference stating that wasDerivedFrom is transitive.

4.4 Agents

Attribution identifiesis the ascribing of an agent as responsible forentity to an entity.agent. An entity can only be ascribed to an agent can only be responsible for an entity if itthe agent was associated with an activity that generated the entity. If the activity, generation and association events are not explicit in the instance, they can be inferred.

IF wasAttributedTo(e,ag) holds for some identifiers e and ag, wasAttributedTo(_att; e,ag,_attrs) THEN there exists an activity with some identifier exist a , _t, _gen, _assoc, _pl, such that the following statements hold: activity(a, -, -,-) wasGeneratedBy(-,e, a, -,_) wasAssociatedWith(-,a, ag, -, -) wasGeneratedBy(_gen; e,a,_t,[]) and wasAssociatedWith(_assoc; a,ag,_pl,[]).

Responsibility
In the above inference, _pl is an existential variable, so it can be merged with a constant identifier, another existential variable, or a placeholder -, as explained in the definition of merging.

Delegation relates agents where one agent acts on behalf of another, in the context of some activity. The supervising agent delegates some responsibility for part of the activity to the subordinate agent, while retaining some responsibility for the overall activity. Both agents are associated with this activity.

@@TODO: Could this be an inference? Does it imply

IF actedOnBehalfOf(_id; ag1, ag2, a, _attrs) THEN there exist _id1, _pl1, _id2, and _pl2 such that a1 is associated with all activities a2 is associated with? wasAssociatedWith(_id1; a, ag1, _pl1, []) and wasAssociatedWith(_id2; a, ag2, _pl2, []).

The two associations between the agents and the activity may have different identifiers, different plans, and different attributes. In particular, the plans of the two agents need not be the same, and one, both, or neither can be the placeholder - indicating that there is no plan, because the existential variables _pl1 and _pl2 can be replaced with constant identifiers, existential variables, or placeholders - independently, as explained in the definition of merging.

The wasInfluencedBy relation is implied by other relations, including usage, start, end, generation, invalidation, communication, derivation, attribution, association, and delegation. To capture this explicitly, we allow the following inferences:

  1. IF wasGeneratedBy(id; e,a,_t,attrs) THEN wasInfluencedBy(id; e, a, attrs).
  2. IF used(id; a,e,_t,attrs) THEN wasInfluencedBy(id; a, e, attrs).
  3. IF wasInformedBy(id; a2,a1,attrs) THEN wasInfluencedBy(id; a2, a1, attrs).
  4. IF wasStartedBy(id; a2,e,a1,_t,attrs) THEN wasInfluencedBy(id; a2, e, attrs).
  5. IF wasEndedBy(id; a2,e,_a1,_t,attrs) THEN wasInfluencedBy(id; a2, e, attrs).
  6. IF wasInvalidatedBy(id; e,a,_t,attrs) THEN wasInfluencedBy(id; e, a, attrs).
  7. IF wasDerivedFrom(id; e2, e1, a, g, u, attrs) THEN wasInfluencedBy(id; e2, e1, attrs). Here, a, g, u may be placeholders -.
  8. IF wasAttributedTo(id; e,ag,attrs) THEN wasInfluencedBy(id; e, ag, attrs).
  9. IF wasAssociatedWith(id; a,ag,_pl,attrs) THEN wasInfluencedBy(id; a, ag, attrs). Here, _pl may be a placeholder -.
  10. IF actedOnBehalfOf(id; ag2,ag1,_a,attrs) THEN wasInfluencedBy(id; ag2, ag1, attrs).
The inferences above permit the use of same identifier for an influence relationship and a more specific relationship.

3.3 Component 3: Derivations4.5 Alternate and Specialized Entities


The relation alternateOf is an equivalence relation: that is, it is reflexive, transitive and symmetric. As a consequence, the following inferences can be applied:

Derivations with an explicit activity and no usage admit the following inference: Given an activity a, entities denoted by e1 and e2, IF wasDerivedFrom(-,e2,e1, a, -) and wasGeneratedBy(-,e2,a,-,-) hold, entity(e) THEN used(-,a,e1,-,-) also holds. alternateOf(e,e).

This inference is justified by the fact that the entity denoted by e2 is generated by at most one activity in a given account (see generation-uniqueness). Hence, this activity is also the one referred to by the usage of e1. There is some redundancy in the following discussion. The converse inference does not hold. From wasDerivedFrom(e2,e1) and used(a,e1,-), one cannot derive wasGeneratedBy(e2,a,-) because identifier e1 may occur in usages performed by many activities, which may have not generated the entity denoted by e2. Note that derivation cannot in general be inferred from the existence of related usage and generation events. Indeed, when a generation wasGeneratedBy(g, e2, a, -, attrs2) precedes used(u, a, e1, -, attrs1), for some e1, e2, attrs1, attrs2, and a, one cannot infer derivation wasDerivedFrom(e2, e1, a, g, u) or wasDerivedFrom(e2,e1) since e2 cannot possibly be derived from e1, given the creation of e2 precedes the use of e1. That is, if e1 is generated by an activity before e2 is used, then obviously e2 cannot have been derived from e1. However, even if e2 happens used before e1 is generated, it is not safe to assume that e2 was derived from e1. Derivation is not defined to be transitive. Domain-specific specializations of derivation may be defined in such a way that the transitivity property holds. A revision admits the following inference, linking the two entities by a derivation, and stating them to be alternates. Given two identifiers e1 and e2 identifying two entities, and an identifier ag identifying an agent, IF wasRevisionOf(-,e2,e1,ag) holds, THEN the following hold: wasDerivedFrom(-,e2,e1,-) alternateOf(e1,e2) wasAttributedTo(e2,ag) The following doesn't make sense because wasRevisionOf and wasDerivedFrom have different types. wasRevisionOf is a strict sub-relation of wasDerivedFrom since two entities e2 and e1 may satisfy wasDerivedFrom(e2,e1) without being a variant of each other. Motivation for quotation inference IF wasQuotedFrom(e2,e1,ag2,ag1,attrs) holds for some identifiers e2, e1, ag2, ag1, THEN the following hold: wasDerivedFrom(e2,e1) wasAttributedTo(e2,ag2) wasAttributedTo(e1,ag1) Traceability allows an entity to be transitively linked to another entity it is derived from, to an agent it is attributed to, or another agent having some responsibility, or a trigger of an activity that generated it. Traceability can be inferred from existing statements, or can be asserted stating that a dependency path exists without its individual steps being expressed. This is captured by the following inferences: Given two identifiers e2 and e1 for entities, the following statements hold: IF wasDerivedFrom(e2,e1,a,g2,u1) holds, for some a, g2, u1, THEN tracedTo(e2,e1) also holds. IF wasDerivedFrom(e2,e1) holds, THEN tracedTo(e2,e1) also holds. IF wasAttributedTo(e2,ag1,aAttr) holds, THEN tracedTo(e2,ag1) also holds. IF wasAttributedTo(e2,ag2,aAttr), wasGeneratedBy(-,e2,a,-,gAttr), and actedOnBehalfOf(ag2,ag1,a,rAttr) hold, for some a, ag2, ag1, aAttr, gAttr, and rAttr, THEN tracedTo(e2,ag1) also holds. IF wasGeneratedBy(e2,a,gAttr) and wasStartedBy(a,e1,sAttr) hold, for some a, gAttr , sAttr then tracedTo(e2,e1) holds. IF tracedTo(e2,e) and tracedTo(e,e1) hold for some e, THEN tracedTo(e2,e1) also holds. We note that the inference rule traceability-inference does not allow us to infer anything about the attributes of the related entities, agents or events. 3.4 Component 4: Alternate Entities TODO: There is currently no consensus what inferences on alternate or specialization should be assumed. The following section lists possible inferences that may or may not be adopted. Section is under review, pending ISSUE-29. The relation alternateOf is an equivalence relation: reflexive, transitive and symmetric. For any entity e, we have alternateOf(e,e). For any entities e1, e2, e3, IF alternateOf(e1,e2) and alternateOf(e2,e3) THEN alternateOf(e1,e3). For any entity e1, e2, IF alternateOf(e1,e2) THEN alternateOf(e2,e1).


IF alternateOf(e1,e2) and alternateOf(e2,e3) THEN alternateOf(e1,e3).


IF alternateOf(e1,e2) THEN alternateOf(e2,e1).


Similarly, specialization is a strict partial order:order: it is irreflexive, anti-symmetric and transitive. Irreflexivity is handled later as Constraint 54 (impossible-specialization-reflexive)

IF specializationOf(e1,e2) For any entity e, it is not the case that have specializationOf(e,e)and specializationOf(e2,e3) THEN specializationOf(e1,e3).

For any entities e1, e2, it is not the case that specializationOf(e1,e2) and specializationOf(e2,e1). For any entities e1, e2, e3, IF specializationOf(e1,e2) and specializationOf(e2,e3) THEN specializationOf(e1,e3). Finally, if


If one entity specializes another, then they are also alternates:

For any entities e1, e2,

IF specializationOf(e1,e2) THEN alternateOf(e1,e2).


If one entity specializes another then all attributes of the more general entity are also attributes of the more specific one.

IF entity(e1, attrs) and specializationOf(e2,e1), THEN entity(e2, attrs).


TODO: Possible inferences about attributes, generation, invalidation?

Note: The following inference is associated with a feature "at risk" and may be removed from this specification based on feedback. Please send feedback to public-prov-comments@w3.org.

The following sections are retained from an older version, and are not consistent with the above constraints. This will be revised once the consensus on ISSUE-29

If one entity is clearer.a mention of another in a bundle, then the former is also a specialization of the latter:

IF mentionOf(e2,e1,b) THEN specializationOf(e2,e1).

3.4.1 Specialization Specialization is neither symmetric nor anti-symmetric. "Alice's toyota car on fifth Avenue" is a specialization of "Alice's toyota car", but the converse does not hold.
anti-symmetric counter-example???
Specialization is transitive. Indeed if specializationOf(e1,e2) holds, then there is some common thing, say T1-2 they both refer to, and e1 is a more specific aspect of this thing than e2. Likewise, if specializationOf(e2,e3) holds, then there is some common thing, say T2-3 they both refer to, and e2 is a more specific aspect of this thing than e3. The things T1-2 and T2-3 are the same since e2 is an aspect of both of them, so specializationOf(e1,e3) follows since e1 and e3 are aspects fo the same thing and e1 is more specific than e3. A specialization of "this email message" would be, for example, the "printed version on my desk", which is a specialization of "my thoughts on this email thread". So, the "printed version on my desk" is also a specialization "my thoughts on this email thread". 3.4.2 Alternate Alternate not is reflexive. Indeed, alternate(e,e) does not hold for any arbitrary entity e since e may not be a specialization of another entity. Alternate is symmetric. Indeed, if alternate(e1,e2) holds, then there exists an unspecified entity e, such that both e1 and e2 are specialization of e. Therefore, alternate(e2,e1) also holds. Alternate is not transitive. Indeed, if alternate(e1,e2) holds, then there exists an unspecified entity e1-2, such that both e1 and e2 are specialization of e1-2. Likewise, if alternate(e2,e3) holds, then there exists an unspecified entity e2-3, such that both e2 and e3 are specialization of e2-3. It does not imply that there is a common entity e1-3 that both e1 and e3 specialize. At 6pm, the customer in a chair is a woman in a red dress, who happens to be Alice. After she leaves, another customer arrives at 7pm, a man with glasses, who happens to be Bob. Transitivity does not hold since the womanInRedDress\ is not alternate of customerInChairAt7pm. alternate(womanInRedDress,customerInChairAt6pm) specialization(customerInChairAt6pm,Alice) specialization(womanInRedDress,Alice) alternate(manWithGlasses,customerInChairAt7pm) specialization(customerInChairAt7pm,Bob) specialization(manWithGlasses,Bob) alternate(customerInChairAt6pm, customerInChairAt7pm) specialization(customerInChairAt6pm, customerInChair) specialization(customerInChairAt7pm, customerInChair) The above example shows that customerInChairAt6pm and customerInChairAt7pm are two alternate entities that have no overlap, while womanInRedDress and customerInChairAt6pm do overlap. The following example illustrates another case of non-overlapping alternate entities. Two copies of the same book, where copy A was destroyed before copy B was made. 4. Equivalence For the purpose of checking inferences and constraints, we define a notion of equivalence of PROV s. Equivalence has the following characteristics: Missing attributes that are interpreted as omitted values are handled by generating a fresh identifier for the omitted value. Redundant expressions are merged according to uniqueness constraints. The order of provenance expressions is irrelevant to the meaning of a PROV instance. That is, a PROV instance is equivalent to any other instance obtained by permuting its expressions. Inference rules and definitions preserve equivalence. That is, a PROV instance is equivalent to the instance obtained by applying any inference rule. Equivalence is reflexive, symmetric, and transitive. 4.1 Optional Attributes TODO: Clarify how optional attributes are handled; clarify merging. The following is not very explicit about the difference between "not present" and "omitted but inferred". PROV-DM allows for some attributes to be optionally expressed. Unless otherwise specified, when an optional attribute is not present in a statement, some value should be assumed to exist for this attribute, though it is not known which. The only exceptions are: Activities also allow for an optional start time attribute. If both are specified, they must be the same, as expressed by the following constraint. Activities also allow for an optional end time attribute. If both are specified, they must be the same, as expressed by the following constraint. In a quotation of the form wasQuotedFrom(e2,e1,-,-,attrs), the absence of an agent means: either no agent exists, or an agent exists but it is not identified. In an association of the form wasAssociatedWith(a, ag, -, attr), the absence of a plan means: either no plan exists, or a plan exists but it is not identified. In an association of the form wasAssociatedWith(a, -, pl, attr), an agent exists but it is not identified. In a a delegation of the form actedOnBehalfOf(a, ag2, ag1, -, attr), the absence of an activity means that a2 acts on behalf of a1 for all activities with which a2 is associated. 4.2 Normalization We define the normal form of a PROV instance as the set of provenance expressions resulting from merging all of the overlapping expressions in the instance and applying all possible inference rules to this set. Formally, we say that two PROV instances are equivalent if they have the same normal form (that is, after applying all possible inference rules, the two instances produce the same set of PROV-DM expressions.) We should check that normal forms exist, i.e. that applying rules and definitions eventually terminates. More clarity is needed about enforcing uniqueness via merging vs. constraint checking. An application that processes PROV-DM data should handle equivalent instances in the same way. (Common exceptions to this rule include, for example, pretty printers that seek to preserve the original order of statements in a file and avoid expanding inferences.)

5. Validity Constraints

This section defines a collection of constraints on PROV instances. A PROV instance is valid if, after applying all possible inference and definition rules from Section 2, the resulting instance satisfies all of the constraints specified in this section. There are twothree kinds of constraints:

The PROV data modelAs in a [definition|inference], term symbols such as id, start, end, e, a, attrs in a constraint, are assumed to be variables unless otherwise specified. These variables are scoped at the constraint level, so the rule is implicitly based onequivalent to any one-for-one renaming of the variable names. When several rules are collected within a notionconstraint as an ordered list, the scope of instantaneous events (or just events), that mark transitionsthe variables in the world. Events include generation, usage, or invalidationeach rule is at the level of entities, as well as starting or endinglist elements, and so reuse of activities. This notion of event is variable names in different rules does not first-class in the data model, but it is useful for explaining its other concepts and its semantics [PROV-SEM]. Thus, events help justify inferences on provenance as well as validity constraints indicating when provenance is self-consistent. In section 8.4 Events and Time we discuss the motivation for instantaneous events and their relationship to time in greater detail.affect the meaning.

PROV-DM identifies five kinds of instantaneous events, namely entity generation event, entity usage event, entity invalidation event, activity start event and activity end event. PROV-DM adopts Lamport's clock assumptions [CLOCK] in the form of a reflexive, transitive partial order follows (and its inverse precedes) between instantaneous events. Furthermore, PROV-DM assumes the existence of a mapping from instantaneous events to time clocks, though the actual mapping is not in scope of this specification. TODO: More about what it means for constraints to be satisfied; constraint template(s)

5.1 Uniqueness Constraints

Attribute

In the absence of existential variables, uniqueness constraints? constraints could be checked directly by checking that no identifier appears more than once for a given statement. However, in the presence of existential variables, we need to be more careful to combine partial information that might be present in multiple compatible statements, due to inferences. Uniqueness constraints are enforced through merging pairs of statements subject to equalities. For example, suppose we have two activity statements activity(a,2011-11-16T16:00:00,_t1,[a=1]) and activity(a,_t2,2011-11-16T18:00:00,[b=2]), with existential variables _t1 and _t2. The merge of these two statements (describing the same activity a) is activity(a,2011-11-16T16:00:00,2011-11-16T18:00:00,[a=1,b=2]).

Merging is an operation that can be applied to a pair of terms, or a pair of attribute lists. The result of merging is either a substitution (mapping existentially quantified variables to terms) or failure, indicating that the merge cannot be performed. Merging of pairs of terms, attribute lists, or statements is defined as follows.

Merging for terms is analogous to unification in logic programming and theorem proving, restricted to flat terms with no function symbols. No occurs check is needed because there are no function symbols.

We assume A typical uniqueness constraint is as follows:

IF hyp1 and ... and hypn THEN t1 = u1 and ... and tn = un.

Such a constraint is enforced as follows:

  1. Suppose PROV instance I contains all of the hypotheses hyp1 and ... and hypn.
  2. Attempt to merge all of the equated terms in the conclusion t1 = u1 and ... and tn = un.
  3. If merging fails, then the constraint is unsatisfiable, so application of the constraint to I fails. If this failure occurs during normalization prior to validation, then I is invalid, as explained in Section 6.
  4. If merging succeeds with a substitution S, then S is applied to the instance I, yielding result S(I).

Key constraints are uniqueness constraints that the various identified objectsspecify that a particular key field of PROV-DM have unique statements describing them within a PROV instance.relation uniquely determines the other parameters. Key constraints are written as follows:

Given an entity identifier e, there is at most one expression entity(e,attrs), where attrs is some set of attribute-values. Given an activity identifier a, there is at most one expression activity(a,t1,t2,attrs), where attrs is some set of attribute-values.
TODO: Same goes

The ak field is a KEY for relation r(a0; a1,...,an).

Because of the presence of attributes, key constraints do not reduce directly to uniqueness constraints. Instead, we enforce key constraints as follows.

  1. Suppose r(a0; a1,...an,attrs1) and r(b0; b1,...bn,attrs2) hold in PROV instance I, where the key fields ak = bk are equal.
  2. Attempt to merge all of the corresponding parameters a0 = b0 and ... and an = bn.
  3. If merging fails, then the constraint is unsatisfiable, so application of the key constraint to I fails.
  4. If merging succeeds with substitution S, then we remove r(a0; a1,...an,attrs1) and r(b0; b1,...bn,attrs2) from I, obtaining instance I', and return instance {r(S(a0); S(a1),...S(an),attrs1 ∪ attrs2)}S(I').

Thus, if a PROV instance contains an apparent violation of a uniqueness constraint or key constraint, merging can be used to determine whether the constraint can be satisfied by instantiating some existential variables with other objects: agent, note, generation, usage, invalidation, start, end, communication, start by, attribution, association, responsibility, derivation, revision, quotation.terms. We should findFor key constraints, this is the same as merging pairs of statements whose keys are equal and whose corresponding arguments are compatible, because after merging respective arguments and attribute lists, the two statements become equal and one can be omitted.


The various identified objects of PROV must have unique statements describing them within a way valid PROV instance. This is enforced through the following key constraints:

  1. The identifier field e is a KEY for the entity(e,attrs) statement.
  2. The identifier field a is a KEY for the activity(a,t1,t2,attrs) statement.
  3. The identifier field ag is a KEY for the agent(ag,attrs) statement.

Likewise, the statements in a valid PROV instance must provide consistent information about each identified object or relationship. The following key constraints require that all of saying this once concisely. the information about each identified statement can be merged into a single, consistent statement:

We assume that an

  1. The identifier field id is a KEY for the wasGeneratedBy(id; e,a,t,attrs) statement.
  2. The identifier field id is a KEY for the used(id; a,e,t,attrs) statement.
  3. The identifier field id is a KEY for the wasInformedBy(id; a2,a1,attrs) statement.
  4. The identifier field id is a KEY for the wasStartedBy(id; a2,e,a1,t,attrs) statement.
  5. The identifier field id is a KEY for the wasEndedBy(id; a2,e,a1,t,attrs) statement.
  6. The identifier field id is a KEY for the wasInvalidatedBy(id; e,a,t,attrs) statement.
  7. The identifier field id is a KEY for the wasDerivedFrom(id; e2, e1, a, g2, u1, attrs) statement.
  8. The identifier field id is a KEY for the wasAttributedTo(id; e,ag,attr) statement.
  9. The identifier field id is a KEY for the wasAssociatedWith(id; a,ag,pl,attrs) statement.
  10. The identifier field id is a KEY for the wasAssociatedWith(id; a,ag,-,attrs) statement.
  11. The identifier field id is a KEY for the actedOnBehalfOf(id; ag2,ag1,a,attrs) statement.
  12. The identifier field id is a KEY for the wasInfluencedBy(id; o2,o1,attrs) statement.

Entities may have multiple generation or invalidation events (either or both may, however, be left implicit). An entity has exactlycan be generated by more than one activity, with one generation and invalidation event (either or both may, however,per each entity-activity pair. These events must be left implicit). So, PROV-DM allows for two distinct generations g1 and g2 referencing the same entity provided they occur simultaneouslysimultaneous, as required by Constraint 41 (generation-generation-ordering) and Constraint 42 (invalidation-invalidation-ordering). This implies that the two generation events are actually the same and caused by the same activity, though provenance may contain several statements for the same world activity.

Given an entity denoted by e, two activities denoted by a1 and a2, two time instants t1 and t2, and two sets of attribute-value pairs attrs1 and attrs2,

IF wasGeneratedBy(id1, e, a1, t1, attrs1) and wasGeneratedBy(id2, e, a2, t2, attrs2) exist,wasGeneratedBy(gen1; e,a,_t1,_attrs1) and wasGeneratedBy(gen2; e,a,_t2,_attrs2), THEN id1gen1 =id2, a1=a2, t1=t2 and attrs1=attrs2 gen2. Wouldn't the above constraint violate uniqueness? Invalidation uniqueness? A generation can be used to indicate a generation time without having to specify the involved activity. A generation time is unique, as specified by the following constraint.

Seems redundant given generation-uniqueness Given an entity denoted by e and two time instants t1 and t2, IF wasGeneratedBy(e, -, t1) and wasGeneratedBy(e, -, t2) hold, THEN t1=t2.


IF wasInvalidatedBy(inv1; e,a,_t1,_attrs1) and wasInvalidatedBy(inv2; e,a,_t2,_attrs2), THEN inv1 = inv2.

It follows from the above uniqueness and key constraints that the generation and invalidation events linking an entity and activity are unique, if specified. However, because we apply the constraints by merging, it is possible for a valid PROV instance to contain multiple statements about the same generation or invalidation event, for example:

wasGeneratedBy(id1; e,a,-,[prov:location="Paris"])
wasGeneratedBy(-; e,a,-,[color="Red"])

When the uniqueness and key constraints are applied, the instance is normalized to the following form:

wasGeneratedBy(id1; e,a,_t,[prov:location="Paris",color="Red"])

where _t is a new existential variable.


An activity may have more than one start and end event, each having a different activity (either or both may, however, be left implicit). However, the triggering entity linking any two activities in a start or end event is unique. That is, an activity may be started by several other activities, with shared or separate triggering entities. If an activity is started or ended by multiple events, they must all be simultaneous, as specified in Constraint 33 (start-start-ordering) and Constraint 34 (end-end-ordering).

IF wasStartedBy(start1; a,_e1,a0,_t1,_attrs1) and wasStartedBy(start2; a,_e2,a0,_t2,_attrs2), THEN start1 = start2.

IF wasEndedBy(end1; a,_e1,a0,_t1,_attrs1) and wasEndedBy(end2; a,_e2,a0,_t2,_attrs2), THEN end1 = end2.


An activity start event is the instantaneous event that marks the instant an activity starts. It allows for an optional time attribute. Activities also allow for an optional start time attribute. If both are specified, they must be the same, as expressed by the following constraint.

IF activity(a,t1,t2,-) and wasStartedBy(id,a,e,t,-)activity(a2,t1,_t2,_attrs) and wasStartedBy(_start; a2,_e,_a1,t,_attrs), THEN t1=t=t1.


An activity end event is the instantaneous event that marks the instant an activity ends. It allows for an optional time attribute. Activities also allow for an optional end time attribute. If both are specified, they must be the same, as expressed by the following constraint.

IF activity(a,t1,t2,-) and wasEndedBy(id,a,e,t,-)activity(a2,_t1,t2,_attrs) and wasEndedBy(_end; a2,_e,_a1,t,_attrs1), THEN t2 = t .


Note: The following constraint is associated with a feature "at risk" and may be removed from this specification based on feedback. Please send feedback to public-prov-comments@w3.org.

An entity can be the subject of at most one mention relation.

IF mentionOf(e, e1, b1) and mentionOf(e, e2, b2), THEN e1= t2e2 and b1=b2.

5.2 Event Ordering Constraints

Given that provenance consists of a description of past entities and activities, valid provenance instances must satisfy ordering constraints between instantaneous events, which we introduceare introduced in this section. For instance, an entity can only be used after it was generated; hence, we say thatin other words, an entity's generation event precedes any of this entity's usage events. Should this ordering constraint be violated, the associated generation and usage couldwould not be credible. The rest of this section defines the temporal interpretation of provenance instances as a set of instantaneous event ordering constraints.

To allow for minimalistic clock assumptions, like Lamport [CLOCK], PROV-DMPROV relies on a notion of relative ordering of instantaneous events, without using physical clocks. This specification assumes that a partial orderpreorder exists between instantaneous events.

Specifically, precedes is a partial order preorder between instantaneous events. When we sayA constraint of the form e1 precedes e2, this means that either the two events are equale1 happened at the same time as or e1 happened before e2. For symmetry, follows is defined as the inverse of precedes; that is, whena constraint of the form e1 follows e2 means that e1 happened at the same time as or after e2. Both relations are preorders, meaning that they are reflexive and transitive. Moreover, we sometimes consider strict forms of these orders: we say e1 followsstrictly precedes e2 to indicate that e1 happened before e2, this means that either the two events are equal or e1 happened after e2. Both relations are partial orders, meaning that they are reflexive, transitive, and antisymmetric. but not at the same time. This is a transitive relation.

Do we want

PROV also allows for time observations to allow an event to "precede" itself? Perhaps precedes should be strict. The following discussion is unclear: what is being said here, and why? PROV-DM also allowsinserted in specific provenance statements, for time observations to be inserted in specific provenance statements, for each of the five kinds of instantaneous events introduced in this specification. Times in provenance records arising from different sources might be with respect to different timelines (e.g. different time zones) leading to apparent inconsistencies. For the purpose of checking ordering constraints, the times associated with events are irrelevant; thus, there is no inference that time ordering implies event ordering, or vice versa. However, an application may flag time values that appear inconsistent with the event ordering as possible inconsistencies. When generating provenance, an application should use a consistent imeline for related PROV statements within an instance.

A typical ordering constraint is as follows.

IF hyp1 and ... and hypn THEN evt1 precedes/strictly precedes evt2.

The presence conclusion of an ordering constraint is either precedes or strictly precedes. One way to check ordering constraints is to generate all precedes and strictly precedes relationships arising from the ordering constraints to form a time observation fordirected graph, with edges marked precedes or strictly precedes, and check that there is no cycle containing a given instantaneous event fixes the mapping of this instantaneous event to the timeline. The presence of time information in a provenance statement instantiates the ordering constraint with that time information. It is expected that such instantiated constraints can help corroborate provenance information. We anticipate that verification algorithms could be developed, though this verification is outside the scope of this specification. strictly precedes edge.

The following figure summarizes the ordering constraints in a graphical manner. For each subfigure, an event time line points to the right. Activities are represented by rectangles, whereas entities are represented by circles. Usage, generation and derivation are represented by the corresponding edges between entities and activities. The five kinds of instantaneous events are represented by vertical dotted lines (adjacent to the vertical sides of an activity's rectangle, or intersecting usage and generation edges). The ordering constraints are represented by triangles: an occurrence of a triangle between two instantaneous event vertical dotted lines represents that the event denoted by the left line precedes the event denoted by the right line. Summary of instantaneous event ordering constraints for activities

5.2.1 Activity constraints

In thisThis section we discussspecifies ordering constraints from the perspective of the lifetime of an activity. An activity starts, then during its lifetime uses, generates entities and communicatescan use, generate or invalidate entities, communicate with, start, or end other activities, or be associated with or starts other activities, agents, and finally it ends. The following constraints amount to checking that all of the events associated with an activity take place within the activity's lifetime, and the start and end events mark the start and endpoints of its lifetime.

Figure 3 summarizes the ordering constraints on activities in a graphical manner. For this and subsequent figures, an event time line points to the right. Activities are represented by rectangles, whereas entities are represented by circles. Usage, generation and invalidation are represented by the corresponding edges between entities and activities. The five kinds of instantaneous events are represented by vertical dotted lines (adjacent to the vertical sides of an activity's rectangle, or intersecting usage and generation edges). The ordering constraints are represented by triangles: an occurrence of a triangle between two instantaneous event vertical dotted lines represents that the event denoted by the left line precedes the event denoted by the right line.

Miscellaneous suggestions about figures (originally from Tim Lebo):
  • I think it would help if the "corresponding edges between entities and activities" where the same visual style as the vertical line marking the time the Usage, generation and derivation occurred. A matching visual style provides a Gestalt that matches the concept. I am looking at subfigures b and c in 5.2.
constraints between events
Figure 3 ◊: Summary of instantaneous event ordering constraints for activities

The existence of an activity implies that the activity start event always precedes the corresponding activity end event. This is illustrated by Subfigure ordering-activity-fig Figure 3 (a) and expressed by constraint start-precedes-endConstraint 32 (start-precedes-end).

IF wasStartedBy(start,a,-,-)wasStartedBy(start; a,_e1,_a1,_t1,_attrs1) and wasEndedBy(end,a,-,-)wasEndedBy(end; a,_e2,_a2,_t2,_attrs2) THEN start precedes end.


If an activity is started by more than one activity, the events must all be simultaneous. The following constraint requires that if there are two start events that start the same activity, then one precedes end the other. Using this constraint in both directions means that each event precedes the other.

IF wasStartedBy(start1; a,_e1,_a1,_t1,_attrs1) and wasStartedBy(start2; a,_e2,_a2,_t2,_attrs2) THEN start1 precedes start2.


If an activity is ended by more than one activity, the events must all be simultaneous. The following constraint requires that if there are two end events that end the same activity, then one precedes the other. Using this constraint in both directions means that each event precedes the other, that is, they are simultaneous.

IF wasEndedBy(end1; a,_e2,_a1,_t1,_attrs1) and wasEndedBy(end2; a,_e2,_a2,_t2,_attrs2) THEN end1 precedes end2.


A usage implies ordering of events, since the usage event had to occur during the associated activity. This is illustrated by Subfigure ordering-activity-figFigure 3 (b) and expressed by constraint usage-within-activityConstraint 35 (usage-within-activity).

  1. IF used(use,a,e,-,-) and wasStartedBy(start,a,-,-)wasStartedBy(start; a,_e1,_a1,_t1,_attrs1) and used(use; a,_e2,_t2,_attrs2) THEN start precedes use.
  2. IF used(use,a,e,-,-)used(use; a,_e1,_t1,_attrs1) and wasEndedBy(end,a,-,-)wasEndedBy(end; a,_e2,_a2,_t2,_attrs2) THEN use precedes end.


A generation implies ordering of events, since the generation event had to occur during the associated activity. This is illustrated by Figure 3 (c) and expressed by Constraint 36 (generation-within-activity).

  1. IF wasStartedBy(start; a,_e1,_a1,_t1,_attrs1) and wasGeneratedBy(gen; _e2,a,_t2,_attrs2) THEN start precedes gen.
  2. IF wasGeneratedBy(gen; _e,a,_t,_attrs) and wasEndedBy(end; a,_e1,_a1,_t1,_attrs1) THEN gen precedes end.


Communication between two activities a1 and a2 also implies ordering of events, since some entity must have been generated by the former and used by the latter, which implies that the start event of a1 cannot follow the end event of a2. This is illustrated by Figure 3 (d) and expressed by Constraint 37 (wasInformedBy-ordering).

IF wasInformedBy(_id; a2,a1,_attrs) and wasStartedBy(start; a1,_e1,_a1',_t1,_attrs1) and wasEndedBy(end; a2,_e2,_a2',_t2,_attrs2) THEN start precedes end.

5.2.2 Entity constraints

The figure(s) in this section should have vertical lines with visual styles that match the diagonal arrow that they go with.

As with activities, entities have lifetimes: they are generated, then can be used, other entities can be derived from them, and finally they can be invalidated. The constraints on these events are illustrated graphically in Figure 4 and Figure 5.

ordering constraints for entities
Figure 4 ◊: Summary of instantaneous event ordering constraints for entities


Generation of an entity precedes its invalidation. (This follows from other constraints if the entity is used, but it is stated explicitly here to cover the case of an entity that is generated and invalidated without being used.)

IF wasGeneratedBy(gen; e,_a1,_t1,_attrs1) and wasInvalidatedBy(inv; e,_a2,_t2,_attrs2) THEN gen precedes inv.


A usage and a generation for a given entity implies ordering of events, since the generation event had to precede the usage event. This is illustrated by Figure 4(a) and expressed by Constraint 39 (generation-precedes-usage).

IF wasGeneratedBy(gen; e,_a1,_t1,_attrs1) and used(use; _a2,e,_t2,_attrs2) THEN gen precedes use.


All usages of an entity precede its invalidation, which is captured by Constraint 40 (usage-precedes-invalidation) (without any explicit graphical representation).

IF used(use; _a1,e,_t1,_attrs1) and wasInvalidatedBy(inv; e,_a2,_t2,_attrs2) THEN use precedes inv.



If an entity is generated by more than one activity, the events must all be simultaneous. The following constraint requires that if there are two generation events that generate the same entity, then one precedes the other. Using this constraint in both directions means that each event precedes the other.

A generation implies ordering of events, since the generation event had to occur during the associated activity. This is illustrated by Subfigure ordering-activity-fig (c) and expressed by constraint generation-within-activity. IF wasGeneratedBy(gen,a,e,-,-)wasGeneratedBy(gen1; e,_a1,_t1,_attrs1) and wasStartedBy(start,a,-,-)wasGeneratedBy(gen2; a,_a2,_t2,_attrs2) THEN startgen1 precedes gengen2. IF wasGeneratedBy(gen,a,e,-,-) and wasEndedBy(end,a,-,-) THEN gen precedes end. Communication between two activities a1 and a2 also implies ordering of events, since some entity must have been generated by the former and used by the latter, which implies that the start event of a1 cannot follow the end event of a2. This is illustrated by Subfigure ordering-activity-fig (d) and expressed by constraint wasInformedBy-ordering.

IF wasInformedBy(a2,a1) and wasStartedBy(start,a1,-,-) and wasEndedBy(end,a2,-,-) THEN start precedes end.

If an entity is invalidated by more than one activity, the events must all be simultaneous. The following constraint requires that if there are two invalidation events that invalidate the same entity, then one precedes the other. Using this constraint in both directions means that each event precedes the other, that is, they are simultaneous.

Start IF wasInvalidatedBy(inv1; a,_e2,_a1,_t1,_attrs1) and wasInvalidatedBy(inv2; a,_e2,_a2,_t2,_attrs2) THEN inv1 precedes inv2.

If there is a derivation relationship linking e2 and e1, then this means that the entity e1 had some influence on the entity e2; for this to be possible, some event ordering must be satisfied. First, we consider derivations, where the activity and usage are known. In that case, the usage of a2e1 has to precede the generation of e2. This is illustrated by activity a1 also implies ordering of events, since a1 must have been active before a2 started. This is illustrated by Subfigure ordering-activity-fig (e)Figure 4 (b) and expressed by constraint wasStartedByActivity-ordering Constraint 43 (derivation-usage-generation-ordering).

In this constraint, _a, gen2, use1 must not be placeholders.

IF wasStartedByActivity(a2,a1) and wasStartedBy(start1,a1,-,-) and wasStartedBy(start2,a2,-,-)wasDerivedFrom(_d; _e2,_e1,_a,gen2,use1,_attrs) THEN start1use1 precedes start2gen2.

5.2.2 Entity constraints

As with activities, entities have lifetimes: they are generated, then can be used, revised, or other entities can be derived from them, and finally are invalidated.

Summary of instantaneous event ordering constraints for entities
Generation of an entity precedes

When the activity, generation or usage is unknown, a similar constraint exists, except that the constraint refers to its invalidation. (This follows from other constraints if the entity is used, but we state it explicitly to cover the case of an entity that is generated and invalidated without being used.) generation event, as illustrated by Figure 4 (c) and expressed by Constraint 44 (derivation-generation-generation-ordering).

In this constraint, any _a, _g, _u may be placeholders.

IF wasGeneratedBy(gen,e,_,_) and wasInvalidatedBy(inv,e,-,-) wasDerivedFrom(_d; e2,e1,_a,_g,_u,attrs) and wasGeneratedBy(gen1; e1,_a1,_t1,_attrs1) and wasGeneratedBy(gen2; e2,_a2,_t2,_attrs2) THEN gengen1 strictly precedes inv. A usage and a generation for a given entity implies ordering of events, since the generation event had to precede the usage event. This is illustrated by Subfigure ordering-entity-fig (a) and expressed by constraint generation-precedes-usage. IF wasGeneratedBy(gen,e,_,_) and used(use,_,e,-) THEN gen precedes usegen2.

This constraint requires the derived entity to be generated strictly following the generation of the original entity. This follows from the [PROV-DM] definition of derivation: A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a pre-existing entity, thus the derived entity must be newer than the original entity.

The event ordering is between generations of e1 and e2, as opposed to derivation where usage is known, which implies ordering between the usage of e1 and generation of e2.


All usages

The entity that triggered the start of an entity precede its invalidation, whichactivity must exist before the activity starts. This is captured illustrated by constraint usage-precedes-invalidation (without any explicit graphical representation). IF used(use,_,e,-) and wasInvalidatedBy(inv,e,_,_) THEN use precedes inv. If there is a derivation between e2 and e1, then this means that the entity e1 had some form of influence on the entity e2; for this to be possible, some event ordering must be satisfied. First, we consider derivations, where the activity and usage are known. In that case, the usage of e1 has to precede the generation of e2. This is illustrated by Subfigure ordering-entity-fig (b)Figure 5(a) and expressed by constraint derivation-usage-generation-ordering. IF wasDerivedFrom(d,e2,e1,a,g2,u1,-) THEN u1 precedes g2. When the usage is unknown, a similar constraint exists, except that the constraint refers to its generation event, as illustrated by Subfigure ordering-entity-fig (c) and expressed by constraint derivation-generation-generation-ordering. IF wasDerivedFrom(e2,e1,attrs) and wasGeneratedBy(gen1,e1,_,_) and wasGeneratedBy(gen2,e2,_,_) THEN gen1 precedes gen2. Note that event ordering is between generations of e1 and e2, as opposed to derivation where usage is known, which implies ordering ordering between the usage of e1 and generation of e2. The entity that triggered the start of an activity must exist before the activity starts. This is illustrated by Subfigure ordering-entity-trigger-fig (a) and expressed by constraint wasStartedBy-orderingConstraint 45 (wasStartedBy-ordering).

  1. IF wasStartedBy(start,a,e,-)wasGeneratedBy(gen; e,_a1,_t1,_attrs1) and wasGeneratedBy(gen,e,-,-)wasStartedBy(start; _a,e,_a2,_t2,_attrs2) THEN gen precedes start.
  2. IF wasStartedBy(start,a,e,-)wasStartedBy(start; _a,e,_a1,_t1,_attrs1) and wasInvalidatedBy(inv,e,-,-)wasInvalidatedBy(inv; e,_a2,_t2,_attrs2) THEN start precedes inv.

Similarly, the entity that triggered the end of an activity must exist before the activity ends, as illustrated by Subfigure ordering-entity-trigger-fig Figure 5(b).

  1. IF wasEndedBy(end,a,e,-)wasGeneratedBy(gen; e,_a1,_t1,_attrs1) and wasGeneratedBy(gen,e,-,-) wasEndedBy(end; _a,e,_a2,_t2,_attrs2) THEN gen precedes end.
  2. IF wasEndedBy(end,a,e,-)wasEndedBy(end; _a,e,_a1,_t1,_attrs1) and wasInvalidatedBy(inv,e,-,-)wasInvalidatedBy(inv; e,_a2,_t2,_attrs2) THEN end precedes inv.
ordering constraints for trigger entities
Figure 5 ◊: Summary of instantaneous event ordering constraints for trigger entities

If an entity is a specialization of another, then the more specific entity must have been generated after the less specific entity was generated.

IF specializationOf(e2,e1) and wasGeneratedBy(gen1; e1,_a1,_t1,_attrs1) and wasGeneratedBy(gen2; e2,_a2,_t2,_attrs2) THEN gen1 precedes inv gen2.

Summary


Similarly, if an entity is a specialization of instantaneousanother entity, and then the invalidation event ordering constraints for trigger entities of the more specific entity precedes that of the less specific entity.

IF specializationOf(e1,e2) and wasInvalidatedBy(inv1; e1,_a1,_t1,_attrs1) and wasInvalidatedBy(inv2; e2,_a2,_t2,_attrs2) THEN inv1 precedes inv2.

5.2.3 Agent constraints

Like entities and activities, agents have lifetimes that follow a familiar pattern:pattern. An agent that is also an entity can be generated and invalidated; an agent that is generated,also an activity can be started or ended. During its lifetime, an agent can participate in interactions such as starting,starting or ending other activities, association with an activity, attribution, or association with an activity, attribution, or delegation, and finally the agent is invalidated. delegation.

Further constraints associated with agents appear in Figure ordering-agents6 and are discussed below.

ordering constraints for agents
Figure 6 ◊: Summary of instantaneous event ordering constraints (continued) for agents

An activity that was associated with an agent must have some overlap with the agent. The agent must have been generated (or started), or must have become associated with the activity, after the activity start: so, the agent must exist before the activity end. Likewise, the agent may be generated,destructed (or ended), or may only become associatedterminate its association with the activity, before the activity end: hence, the agent invalidation (or end) is required to happen after the activity start: so, the agentstart. This is required to exist before the activity end. Likewise, the agent may be destructed, or may terminate its association with the activity, before the activity end: hence, the agent invalidation is required to happen after the activity start. This is illustrated by Subfigure ordering-agentsFigure 6 (a) and expressed by constraint wasAssociatedWith-orderingConstraint 49 (wasAssociatedWith-ordering).

In the following inferences, _pl may be a placeholder -.

  1. IF wasAssociatedWith(a,ag)wasAssociatedWith(_assoc; a,ag,_pl,_attrs) and wasStartedBy(start,a,-,-)wasStartedBy(start1; a,_e1,_a1,_t1,_attrs1) and wasInvalidatedBy(inv,ag,-,-)wasInvalidatedBy(inv2; ag,_a2,_t2,_attrs2) THEN startstart1 precedes invinv2.
  2. IF wasAssociatedWith(a,ag)wasAssociatedWith(_assoc; a,ag,_pl,_attrs) and wasGeneratedBy(gen,ag,-,-)wasGeneratedBy(gen1; ag,_a1,_t1,_attrs1) and wasEndedBy(end,a,-,-)wasEndedBy(end2; a,_e2,_a2,_t2,_attrs2) THEN gengen1 precedes end. An entity that was attributed to an agent must have some overlap with the agent. The agent is required to exist before the entity invalidation. Likewise, the entity generation must precede the agent destruction. This is illustrated by Subfigure ordering-agents (b) and expressed by constraint wasAttributedTo-ordering. IF wasAttributedTo(e,ag) and wasGeneratedBy(gen,e,-,-) and wasInvalidatedBy(inv,ag,-,-) THEN gen precedes invend2.
  3. IF wasAttributedTo(e,ag)wasAssociatedWith(_assoc; a,ag,_pl,_attrs) and wasGeneratedBy(gen,ag,-,-)wasStartedBy(start1; a,_e1,_a1,_t1,_attrs1) and wasInvalidatedBy(inv,e,-,-)wasEndedBy(end2; ag,_e2,_a2,_t2,_attrs2) THEN genstart1 precedes invend2.
  4. IF wasAssociatedWith(_assoc; a,ag,_pl,_attrs) and wasStartedBy(start1; ag,_e1,_a1,_t1,_attrs1) and wasEndedBy(end2; a,_e2,_a2,_t2,_attrs2) THEN start1 precedes end2.


An agent to which an entity was attributed, must exist before this entity was generated. This is illustrated by Figure 6 (b) and expressed by Constraint 50 (wasAttributedTo-ordering).

  1. IF wasAttributedTo(_at; e,ag,_attrs) and wasGeneratedBy(gen1; ag,_a1,_t1,_attrs1) and wasGeneratedBy(gen2; e,_a2,_t2,_attrs2) THEN gen1 precedes gen2.
  2. IF wasAttributedTo(_at; e,ag,_attrs) and wasStartedBy(start1; ag,_e1,_a1,_t1,_attrs1) and wasGeneratedBy(gen2; e,_a2,_t2,_attrs2) THEN start1 precedes gen2.


For responsibility,delegation, two agents need to have some overlap in their lifetime.

  1. IF actedOnBehalfOf(ag2,ag1)actedOnBehalfOf(_del; ag2,ag1,_a,_attrs) and wasGeneratedBy(gen,ag1,-,-)wasGeneratedBy(gen1; ag1,_a1,_t1,_attrs1) and wasInvalidatedBy(inv,ag2,-,-)wasInvalidatedBy(inv2; ag2,_a2,_t2,_attrs2) THEN gengen1 precedes invinv2.
  2. IF actedOnBehalfOf(_del; ag2,ag1,_a,_attrs) and wasStartedBy(start1; ag1,_e1,_a1,_t1,_attrs1) and wasEndedBy(end2; ag2,_e2,_a2,_t2,_attrs2) THEN start1 precedes end2.

5.3 Type Constraints

The following rule establishes types denoted by identifiers from their use within expressions. The function typeOf gives the set of types denoted by an identifier. That is, typeOf(e) returns the set of types associated with identifier e. The function typeOf is not a term of PROV, but a construct introduced to validate PROV statements.

For any identifier id, typeOf(id) is a subset of {'entity', 'activity', 'agent', 'prov:Collection', 'prov:EmptyCollection'}. For identifiers that do not have a type, typeOf gives the empty set. Identifiers can have more than one type, because of subtyping (e.g. 'prov:EmptyCollection' is a subtype of 'prov:Collection') or because certain types are not disjoint (such as 'agent' and 'entity'). The set of types does not reflect all of the distinctions among objects, only those relevant for checking validity. In particular, subtypes such as 'plan' and 'bundle' are omitted, and statements such as wasAssociatedWith and mentionOf that have plan or bundle parameters only check that these parameters are entities.

To check if a PROV instance satisfies type constraints, one obtains the types of identifiers by application of Constraint 52 (typing) and check that none of the impossibility constraints Constraint 57 (entity-activity-disjoint) and Constraint 58 (membership-empty-collection) are violated as a result.

  1. IF entity(e,attrs) THEN 'entity' ∈ typeOf(e).
  2. IF agent(ag,attrs) THEN 'agent' ∈ typeOf(ag).
  3. IF activity(a,attrs) THEN 'activity' ∈ typeOf(a).
  4. IF used(u; a,e,t,attrs) THEN 'activity' ∈ typeOf(a) AND 'entity' ∈ typeOf(e).
  5. IF wasGeneratedBy(gen; e,a,t,attrs) THEN 'entity' ∈ typeOf(e) AND 'activity' ∈ typeOf(a).
  6. IF wasInformedBy(id; a2,a1,attrs) THEN 'activity' ∈ typeOf(a2) AND 'activity' ∈ typeOf(a1).
  7. IF wasStartedBy(id; a2,e,a1,t,attrs) THEN 'activity' ∈ typeOf(a2) AND 'entity' ∈ typeOf(e) AND 'activity' ∈ typeOf(a1).
  8. IF wasEndedBy(id; a2,e,a1,t,attrs) THEN 'activity' ∈ typeOf(a2) AND 'entity' ∈ typeOf(e) AND 'activity' ∈ typeOf(a1).
  9. IF wasInvalidatedBy(id; e,a,t,attrs) THEN 'entity' ∈ typeOf(e) AND 'activity' ∈ typeOf(a).
  10. IF wasDerivedFrom(id; e2, e1, a, g2, u1, attrs) THEN 'entity' ∈ typeOf(e2) AND 'entity' ∈ typeOf(e1) AND 'activity' ∈ typeOf(a). In this constraint, a, g2, and u1 must not be placeholders.
  11. IF wasDerivedFrom(id; e2, e1, -, -, -, attrs) THEN 'entity' ∈ typeOf(e2) AND 'entity' ∈ typeOf(e1).
  12. IF wasAttributedTo(id; e,ag,attr) THEN 'entity' ∈ typeOf(e) AND 'agent' ∈ typeOf(ag).
  13. IF wasAssociatedWith(id; a,ag,pl,attrs) THEN 'activity' ∈ typeOf(a) AND 'agent' ∈ typeOf(ag) AND 'entity' ∈ typeOf(pl). In this constraint, pl must not be a placeholder.
  14. IF wasAssociatedWith(id; a,ag,-,attrs) THEN 'activity' ∈ typeOf(a) AND 'agent' ∈ typeOf(ag).
  15. IF actedOnBehalfOf(id; ag2,ag1,a,attrs) THEN 'agent' ∈ typeOf(ag2) AND 'agent' ∈ typeOf(ag1) AND 'activity' ∈ typeOf(a).
  16. IF alternateOf(e2, e1) THEN 'entity' ∈ typeOf(e2) AND 'entity' ∈ typeOf(e1).
  17. IF specializationOf(e2, e1) THEN 'entity' ∈ typeOf(e2) AND 'entity' ∈ typeOf(e1).
  18. IF mentionOf(e2,e1,b) THEN 'entity' ∈ typeOf(e2) AND 'entity' ∈ typeOf(e1) AND 'entity' ∈ typeOf(b).
  19. IF hadMember(c,e) THEN 'prov:Collection' ∈ typeOf(c) AND 'entity' ∈ typeOf(c) AND 'entity' ∈ typeOf(e).
  20. IF entity(c,[prov:type='prov:EmptyCollection']) THEN 'entity' ∈ typeOf(c) AND 'prov:Collection' ∈ typeOf(c)AND 'prov:EmptyCollection' ∈ typeOf(c).

5.4 Impossibility constraints

Impossibility constraints require that certain patterns of statements never appear in valid PROV instances. Impossibility constraints have the following general form:

IF hyp1 and ... and hypn THEN INVALID.

Checking an impossibility constraint on instance I means checking whether there is any way of matching the pattern hyp1, ..., hypn. If there is, then checking the constraint on I fails (which implies that I is invalid).


A derivation with unspecified activity wasDerivedFrom(id;e1,e2,-,g,u,attrs) represents a derivation that takes one or more steps, whose activity, generation and use events are unspecified. It is forbidden to specify a generation or use event without specifying the activity.

In the following rules, g and u must not be -.

  1. IF wasDerivedFrom(_id;_e2,_e1,-,g,-,attrs) THEN INVALID.
  2. IF wasDerivedFrom(_id;_e2,_e1,-,-,u,attrs) THEN INVALID.
  3. IF wasDerivedFrom(_id;_e2,_e1,-,g,u,attrs) THEN INVALID.

As noted previously, specialization is a strict partial order: it is irreflexive and transitive.

IF specializationOf(e,e) THEN INVALID.


Furthermore, identifiers of basic relationships are disjoint.

For each r and s in { used, wasGeneratedBy, wasInvalidatedBy, wasStartedBy, wasEndedBy, wasInformedBy, wasAttributedTo, wasAssociatedWith, actedOnBehalfOf} such that r and s are different relation names, the following constraint holds:

IF r(id; a1,...,an) and s(id; b1,...,bn) THEN INVALID.

Since wasInfluencedBy is a superproperty of many other properties, it is excluded from the set of properties whose identifiers are required to be pairwise disjoint. The following example illustrates this observation:

wasInfluencedBy(id;e2,e1)
wasDerivedFrom(id;e2,e1)

This satisfies the disjointness constraint.

There is, however, no constraint requiring that every influence relationship is accompanied by a more specific relationship having the same identifier. The following valid example illustrates this observation:

wasInfluencedBy(id; e2,e1)

This is valid; there is no inferrable information about what kind of influence relates e2 and e1, other than its identity.

Identifiers of entities, agents and activities cannot also be identifiers of properties.

For each p in {entity, activity or agent} and for each r in { used, wasGeneratedBy, wasInvalidatedBy, wasInfluencedBy, wasStartedBy, wasEndedBy, wasInformedBy, wasDerivedFrom, wasAttributedTo, wasAssociatedWith, actedOnBehalfOf}, the following impossibility constraint holds:

IF p(id,a1,...,an) and r(id; b1,...,bn) THEN INVALID.


The set of entities and activities are disjoint, expressed by the following constraint:

IF 'entity' ∈ typeOf(id) AND 'activity' ∈ typeOf(id) THEN INVALID.

There is no disjointness between entities and agents. This is because one might want to make statements about the provenance of an agent, by making it an entity. For example, one can assert both entity(a1) and agent(a1) in a valid PROV instance. Similarly, there is no disjointness between activities and agents, and one can assert both activity(a1) and agent(a1) in a valid PROV instance. However, one should keep in mind that some specific types of agents may not be suitable as activities. For example, asserting statements such as agent(Bob, [type=prov:Person]) and activity(Bob) is discouraged. In these cases, disjointness can be ensured by explicitly asserting the agent as both agent and entity, and applying Constraint 57 (entity-activity-disjoint).

An empty collection cannot contain any member, expressed by the following constraint:

IF hasMember(c,e) and 'prov:EmptyCollection' ∈ typeOf(c) THEN INVALID.

6. Collection ConstraintsNormalization, Validity, and Equivalence

Work on collections and on

We define the notions of normalization, validity and equivalence of PROV documents and instances. We first define these constraints is deferred until after the next working draft, so this section may not be stable. Membership is a convenience notation, since it can be expressed in terms of an insertion into some collection. The membership definition is formalized by constraint membership-as-insertion.concepts for PROV instances and then extend them to PROV documents.

memberOf(c, {(k1, v1), ...}) holds IF AND ONLY IF there exists

6.1 Instances

Implementations should decide up front what reasoning about co-reference should be applied, and rewrite the instance (by replacing co-referent identifiers with a collection c0, suchsingle common identifier) to make this explicit, before doing validation, equivalence checking, or normalization. All of the following definitions assume that derivedByInsertionFrom(c, c0, {(k1, v1), ...}). the application has already determined which URIs in the PROV instance are co-referent (e.g. owl:sameAs as a result of OWL reasoning).

We define the normal form of a PROV instance as the set of provenance statements resulting from applying all definitions, inferences, and uniqueness constraints.

  1. Apply all definitions to I by replacing each defined statement by its definition (possibly introducing fresh existential variables in the process), yielding an instance I1.
  2. Apply all inferences to I1 by adding the conclusion of each inference whose hypotheses are satisfied and whose entire conclusion does not already hold (again, possibly introducing fresh existential variables), yielding an instance I2.
  3. Apply all uniqueness constraints to I2 by merging terms or statements and applying the resulting substitution to the instance, yielding an instance I3. If some uniqueness constraint cannot be applied, then normalization fails.
  4. If no definitions, inferences, or uniqueness constraints can be applied to instance I3, then I3 is the normal form of I.
  5. Otherwise, the normal form of I is the same as the normal form of I3 (that is, proceed by recursively normalizing I3).

Because of the potential interaction among definitions, inferences, and constraints, the above algorithm is recursive. Nevertheless, all of our constraints fall into a class of tuple-generating dependencies and equality-generating dependencies that satisfy a termination condition called weak acyclicity that has been studied in the context of relational databases [DBCONSTRAINTS]. Therefore, the above algorithm terminates, independently of the order in which inferences and constraints are applied. Appendix C gives a proof that normalization terminates and produces a unique (up to isomorphism) normal form.

A collection mayPROV instance is valid if its normal form exists and satisfies all of the validity constraints; this implies that the instance satisfies all of the definitions, inferences, and constraints. The following algorithm can be used to test validity:

  1. Normalize the instance I, obtaining normalized instance I'. If normalization fails, then I is not valid.
  2. Apply all event ordering constraints to I' to build a graph G whose nodes are event identifiers and edges are labeled by "precedes" and "strictly precedes" relationships among events induced by the constraints.
  3. Determine whether there is a cycle in G that contains a "strictly precedes" edge. If so, then I is not valid.
  4. Apply the type constraints (section 5.3) to determine whether there are any violations of disjointness. If so, then I is not valid.
  5. Check that none of the impossibility constraints (section 5.4) are violated. If any are violated, then I is not valid. Otherwise, I is valid

A normal form of a PROV instance does not exist when a uniqueness constraint fails due to merging failure.

Two PROV instances are equivalent if they have isomorphic normal forms (that is, after applying all possible inference rules, the two instances produce the same set of PROV statements, up to reordering of statements and attributes within attribute lists, and renaming of existential variables). Equivalence has the following characteristics:

An application that processes PROV data should handle equivalent instances in the same way. (Common exceptions to this rule include, for example, pretty-printers that seek to preserve the original order of collections, PROV-DM restricts one collection to be involvedstatements in a single derivation by insertion or removal, or to one membership relation. PROV-DM does not provide an interpretation for statements that consist of two (or more) insertion, removal, membership relations that result in the same collection.file and avoid expanding inferences.)

The following constraint ensures unique derivation. The following constraint is unclear. A collection must not be derived through multiple insertions, removal, or membership relations.
Consider the following statements about three collections. entity(c1, [prov:type="prov:Collection" %% xsd:QName]) entity(c2, [prov:type="prov:Collection" %% xsd:QName]) entity(c3, [prov:type="prov:Collection" %% xsd:QName]) derivedByInsertionFrom(c3, c1, {("k1", e1), ("k2", e2)}) derivedByInsertionFrom(c3, c2, {("k3", e3)}) There is no interpretation for such statements since c3 is derived multiple times by insertion. As a particular case, collection c is derived multiple times from the same c1. derivedByInsertionFrom(id1, c, c1, {("k1", e1), ("k2", e2)}) derivedByInsertionFrom(id2, c, c1, {("k3", e3), ("k4", e4)}) The interpretation of such statements is also unspecified. To describe the insertion of the 4 key-entity pairs, one would instead write: derivedByInsertionFrom(id1, c, c1, {("k1", e1), ("k2", e2), ("k3", e3), ("k4", e4)}) The same is true for any combination of insertions, removals, and membership relations: The following statements derivedByInsertionFrom(c, c1, {("k1", e1)}) derivedByRemovalFrom(c, c2, {"k2"}) have no interpretation. Nor have the following: derivedByInsertionFrom(c, c1, {("k1", e1)}) memberOf(c, {"k2"}). 6.1 Collection branching It is allowed to have multiple derivations from a single root collection, as long as the resulting entities are distinct, as shown in the following example. entity(c0, [prov:type="prov:EmptyCollection" %% xsd:QName]) // c0 is an empty collection entity(c1, [prov:type="prov:Collection" %% xsd:QName]) entity(c2, [prov:type="prov:Collection" %% xsd:QName]) entity(c3, [prov:type="prov:Collection" %% xsd:QName]) entity(e1) entity(e2) entity(e3) derivedByInsertionFrom(c1, c0, {("k1", e1)}) derivedByInsertionFrom(c2, c0, {("k2", e2)}) derivedByInsertionFrom(c3, c1, {("k3", e3)}) From this set of statements, we conclude: c1 = { ("k1", e1) } c2 = { ("k2", e2) } c3 = { ("k1", e1), ("k3", e3)}

6.2 Collections and Weaker Derivation RelationBundles and Documents

The statedefinitions, inferences, and constraints, and the resulting notions of normalization, validity and equivalence, assume a collectionPROV document that consists only of a toplevel instance, containing all PROV statements in the top level of the document (that is, not enclosed in a named bundle). In this section, we describe how to deal with general PROV documents, possibly including multiple named bundles. Briefly, each bundle is only known to the extent that a chain handled independently; there is no interaction between bundles from the perspective of derivations starting from an empty collection can be found. Since a set of statements regarding a collection's evolution may be incomplete, so is the reconstructed state obtained by querying those statements. In general, all statements reflect partial knowledge regarding a sequence of data transformation events. In the particular case of collection evolution, in which some of the state changes may have been missed, the more generic derivation relation should be used to signal that some updates may have occurred, which cannot be expressed as insertionsapplying definitions, inferences, or removals. The following example illustrates this. In the example, the state of c2 is only partially known because the collection is constructed from partially known other collections. entity(c0, [prov:type="prov:EmptyCollection" %% xsd:QName]) // c0 is an empty collection entity(c1, [prov:type="prov:Collection" %% xsd:QName]) entity(c2, [prov:type="prov:Collection" %% xsd:QName]) entity(c3, [prov:type="prov:Collection" %% xsd:QName]) entity(e1) entity(e2) derivedByInsertionFrom(c1, c0, {("k1", e1)}) wasDerivedFrom(c2, c1) derivedByInsertionFrom(c3, c2, {("k2", e2)}) From this set of statements, we conclude: c1 = { ("k1", e1) } c2 is somehow derived from c1, but the precise sequence of updates is unknown c3 includes ("k2", e2) but the earlier "gap" leaves uncertainty regarding ("k1", e1) (it may have been removed)constraints, computing normal forms, or any other pair that may have been added as part of the derivation activities. Do the insertion/removal derivation steps imply wasDerivedFrom, wasVersionOf, alternateOf? 7. Account Constraints Work on accounts has been deferred until after the next working draft, so this section is very unstable PROV-DM allows for multiple descriptions of entities (and in general any identifiable object) to be expressed. Let us consider two statements about the same entity, which we have taken from two different contexts. A working draft published by the w3:Consortium: entity(tr:WD-prov-dm-20111215, [ prov:type="pr:RecsWD" %% xsd:QName ]) The second version of a document edited by some authors: entity(tr:WD-prov-dm-20111215, [ prov:type="document", ex:version="2" ]) Both statements are about the same entity identified by tr:WD-prov-dm-20111215, but they contain different attributes, describing the situationchecking validity or partial state of the these entities according to the context in which they occur. Two different statements about the same entity cannot co-exist in PROV instance as formalized in entity-unique. In some cases, there may be a requirement for two different statements concerning the same entity to be included in the same account. To satisfy the constraint entity-unique, we can adopt a different identifier for one of them, and relate the two statements with the alternateOf relation. We now reconsider the same two statements of a same entity, but we change the identifier for one of them: entity(tr:WD-prov-dm-20111215, [ prov:type="pr:RecsWD" %% xsd:QName ]) entity(ex:alternate-20111215, [ prov:type="document", ex:version="2" ]) alternateOf(tr:WD-prov-dm-20111215,ex:alternate-20111215) Since we are not specifying ways to take the union of two accounts, we may drop this discussion Taking the union of two accounts is another account, formed by the union of the statements they respectively contain. We note that the resulting union may or may not invalidate some constraints: Two entity statements with the same identifier but different sets of attributes exist in each PROV instance may invalidate entity-unique in the union, unless some form of statement merging or renaming (as per Example) occurs. Structurally well-formed accounts are not closed under union because the constraint generation-uniqueness may no longer be satisfied in the resulting union. How to reconcile such accounts is beyond the scope of this specification. Material transplanted from old structural well-formedness constraints section. This example isn't very clear, since the sub-workflow-ness isn't represented in the data. According to what was written above, we should conclude that a0 and a2 are equal! In the following statements, a workflow execution a0 consists of two sub-workflow executions a1 and a2. Sub-workflow execution a2 generates entity e, so does a0. activity(a0, [prov:type="workflow execution"]) activity(a1, [prov:type="workflow execution"]) activity(a2, [prov:type="workflow execution"]) wasInformedBy(a2,a1) wasGeneratedBy(e,a0) wasGeneratedBy(e,a2) So, we have two different generations for entity e. Such an example is permitted in PROV-DM if the two activities denoted by a0 and a2 are a single thing happening in the world but described from different perspectives. While this example is permitted in PROV-DM, it does not make the inter-relation between activities explicit, and it mixes statements expressed from different perspectives together. While this may acceptable in some specific applications, it becomes challenging for inter-operability. Indeed, PROV-DM does not offer any relation describing the structure of activities. Such instances are said not to be structurally well-formed.equivalence.

Structurally well-formed provenance can We model a general PROV document, containing n named bundles b1...bn, as a tuple (B0,[b1=B1,...,bn=Bn]) where B0 is the set of statements of the toplevel instance, and for each i, Bi is the set of statements of bundle bi. Names b1...bn are assumed to be obtained by partitioning the generations into different accounts. distinct. This makes it clear that these generations provide alternative descriptions of the same real-world generation event, rather than describing two distinct generation eventsnotation is shorthand for the same entity. When accounts are used, the example can be encoded as follows. following PROV-N syntax:

The same example is now revisited, with the following statements that are structurally well-formed. Two accounts are introduced, and there is a single generation for entity e per account.
B0
bundle b1
  B1
endBundle
...
bundle bn
  Bn
endBundle

In The normal form of a first account, entitled "summary", we find: activity(a0,t1,t2,[prov:type="workflow execution"]) wasGeneratedBy(e,a0,-) In a second account, entitled "detail", we find: activity(a1,t1,t3,[prov:type="workflow execution"]) activity(a2,t3,t2,[prov:type="workflow execution"]) wasInformedBy(a2,a1) wasGeneratedBy(e,a2,-) Structurally well-formed provenance satisfies some constraints, which force the structurePROV document (B0,[b1=B1,...,[bn=Bn]) is (B'0,[b1=B'1,...,bn=B'n]) where B'i is the normal form of statements to be exposed by means of accounts. With these constraints satisfied, further inferences can be made about structurally well-formed statements. The uniqueness of generations in accounts is formulated as follows. 8. RationaleBi for inferences and constraintsThis section is non-normative. This section collects all of the explanatory material that I was not certain how to interpret as an unambiguous inference or constraint. Some of these observations may need to be folded into the explanatory text in respective sections (for example for events, accounts or collections). Editing is also needed to decrease redundancy. 8.1 Entities and Attributes When we talk about things in the world in natural language and even when we assign identifiers, we are often imprecise in ways that make it difficult to clearly and unambiguously report provenance: a resource with a URL may be understood as referring to a report available at that URL, the version of the report available there today, the report independent of where it is hosted over time, etc. However, to write precise descriptions of the provenance of things that change over time, we need ways of disambiguating which versions of things we are talking about. To describe the provenance of things that can change over time, PROV-DM uses the concept of entities with fixed attributes. From a provenance viewpoint, it is important to identify a partial state of something, i.e. something with some aspects that have been fixed, so that it becomes possible to express its provenance (i.e. what caused the thing with these specific aspects). An entity encompasses a part of a thing's history during which some of the attributes are fixed. An entity can thus be thought of as a part of a thing with some associated partial state. Attributes in PROV-DM are used to fix certain aspects of entities. An entity is a thing one wants to provide provenance for and whose situation in the world is described by some fixed attributes. An entity has a characterization interval, or lifetime, defined as the period each i between its generation event and its invalidation event0 and n. An entity's attributes are established when the entity is created and describe the entity's situation and (partial) state during an entity's lifetime. A different entity (perhaps representing a different user or system perspective) may fix other aspects of the same thing, and its provenance may be different. Different entities that are aspects of the same thing are called alternate, and the PROV-DM relations of specialization and alternate can be used to link such entities. Different users may take different perspectives on a resource with a URL. A provenance record might use one (or more) different entities to talk about different perspectives, such as: a report available at a URL: fixes the nature of the thing, i.e. a document, and its location; the version of the report available there today: fixes its version number, contents, and its date; the report independent of where it is hosted and of its content over time: fixes the nature of the thing as a conceptual artifact. The provenance of these three entities may differ, and may be along the following lines: the provenance of a report available at a URL may include: the act of publishing it and making it available at a given location, possibly under some license and access control; the provenance of the version of the report available there today may include: the authorship of the specific content, and reference to imported content; the provenance of the report independent of where it is hosted over time may include: the motivation for writing the report, the overall methodology for producing it, and the broad team involved in it. We do not assume that any entity is a better or worse description of reality than any other. That is, we do not assume an absolute ground truth with respect to which we can judge correctness or completeness of descriptions. In fact, it is possible to describe the processing that occurred for the report to be commissioned, for individual versions to be created, for those versions to be published at the given URL, etc., each via a different entity with attribute-value pairs that fix some aspects of the report appropriately. Besides entities, a variety of other PROV-DM objects have attributes, including activity, generation, usage, start, end, communication, attribution, association, responsibility, and derivation. Each object has an associated duration interval (which may be a single time point), and attribute-value pairs for a given object are expected to be descriptions that hold for the object's duration. However, the attributes of entities have special meaning because they are considered to be fixed aspects of underlying, changing things. This motivates constraints on alternateOf and specializationOf relating the attribute values of different entities. TODO: Constraints on alternateOf/specializationOf for this? TODO: Further discussion of entities moved from the old "Definitional constraints" section. Should merge with the surrounding discussion to avoid repetition. An entity is a thing one wants to provide provenance for and whose situation in the world is described by some attribute-value pairs. An entity's attribute-value pairs are established as part of the entity statement and their values remain unchanged for the lifetime of the entity. An entity's attribute-value pairs are expected to describe the entity's situation and (partial) state during an entity's characterization interval. If an entity's situation or state changes, this may result in its statement being invalid, because one or more attribute-value pairs no longer hold. In that case, from the PROV viewpoint, there exists a new entity, which needs to be given a distinct identifier, and associated with the attribute-value pairs that reflect its new situation or state. Further considerations: In order to describe the provenance of something during an interval over which relevant attributes of the thing are not fixed, it is required to create multiple entities, each with its own identifier, characterization interval, and fixed attributes, and express dependencies between the various entities using events. For example, if we want to describe the provenance of several versions of a document, involving attributes such as authorship that change over time, we need different entities for the versions linked by appropriate generation, usage, revision, and invalidation events. There is no assumption that the set of attributes is complete, nor that the attributes are independent or orthogonal of each other. There is no assumption that the attributes of an entity uniquely identify it. Two different entities that are aspects of different things can have the same attributes. A characterization interval may collapse into a single instant. 8.2 Activities TODO: Further discussion of activities moved from old "Definitional constraints and inferences" section. Edit to avoid repeating information. An activity is delimited by its start and its end events; hence, it occurs over an interval delimited by two instantaneous events. However, an activity record need not mention start or end time information, because they may not be known. An activity's attribute-value pairs are expected to describe the activity's situation during its interval, i.e. an interval between two instantaneous events, namely its start event and its end event. Further considerations: An activity is not an entity. Indeed, an entity exists in full at any point in its lifetime, persists during this interval, and preserves the characteristics that makes it identifiable. In contrast, an activity is something that occurs, happens, unfolds, or develops through time, but is typically not identifiable by the characteristics it exhibits at any point during its duration. This distinction is similar to the distinction between 'continuant' and 'occurrent' in logic [Logic]. 8.3 Description, Assertion, and Inference PROV-DM is a provenance data model designed to express descriptions of the world. A file at some point during its lifecycle, which includes multiple edits by multiple people, can be described by its type, its location in the file system, a creator, and content. The data model is designed to capture activities that happened in the past, as opposed to activities that may or will happen. However, this distinction is not formally enforced. Therefore, PROV-DM descriptions are intended to be interpreted as what has happened, as opposed to what may or will happen. This specification does not prescribe the means by which descriptions can be arrived at; for example, descriptions can be composed on the basis of observations, reasoning, or any other means. Sometimes, inferences about the world can be made from descriptions conformant to the PROV-DM data model. This specification defines some such inferences, allowing new descriptions to be inferred from existing ones. Hence, descriptions of the world can result either from direct assertion or from inference by application of inference rules defined by this specification. 8.4 Events and Time Time is critical in the context of provenance, since it can help corroborate provenance claims. For instance, if an entity is claimed to be obtained by transforming another, then the latter must have existed before the former. If it is not the case, then there is something wrong with such a provenance claim. Although time is critical, we should also recognize that provenance can be used in many different contexts within individual systems and across the Web. Different systems may use different clocks which may not be precisely synchronized, so when provenance records are combined by different systems, we may not be able to align the times involved to a single global timeline. Hence, PROV-DM is designed to minimize assumptions about time. Hence, to talk about the constraints on valid PROV-DM data, we refer to instantaneous events that correspond to interactions between activities and entities. The term "event" is commonly used in process algebra with a similar meaning. For instance, in CSP [CSP], events represent communications or interactions; they are assumed to be atomic and instantaneous. 8.4.1 Event Ordering The following paragraphs are unclear and need to be revised, to address review concerns: if we aren't saying anything about how events and time relate, and time is the only concrete information about event ordering in PROV-DM, then how can implementations check that event ordering constraints are satisfied? How the precedes partial order is implemented in practice is beyond the scope of this specification. This specification only assumes that each instantaneous event can be mapped to an instant in some form of timeline. The actual mapping is not in scope of this specification. Likewise, whether this timeline is formed of a single global timeline or whether it consists of multiple Lamport-style clocks is also beyond this specification. The follows and precedes orderings of events should be consistent with the ordering of their associated times over these timelines. This specification defines event ordering constraints between instantaneous events associated with provenance descriptions. PROV-DM data must satisfy such constraints.

PROV-DM also allows for time observations to be inserted in specific statements, forA PROV document is valid if each recognized instantaneous event introduced in this specification. The presence of a time observation for a given instantaneous event fixes the mappingthe bundles B0, ..., Bn are valid and none of this instantaneous event to the timeline. It can also help with the verification of associated ordering constraints (though, again, this verification is outside the scope of this specification). 8.4.2 Types of Events Five kinds of instantaneous eventsthe bundle identifiers bi are used for the PROV-DM data model. The activity start and activity end events delimit the beginning and the end of activities, respectively. The entity usage, entity generation, and entity invalidation events apply to entities, and the generation and invalidation events delimit the characterization interval of an entity. More specifically: repeated.

An activity start eventTwo (valid) PROV documents (B0,[b1=B1,...,bn=Bn]) and (B'0,[b1'=B'1,...,b'm=B'm]) are equivalent if B0 is the instantaneous event equivalent to B'0 and n = m and there exists a permutation P : {1..n} -> {1..n} such that marks the instant an activity starts. An activity end eventfor each i, bi = b'P(i) and Bi is the instantaneous event that marks the instant an activity ends. An entity usage event is the instantaneous event that marks the first instant of an entity's consumption timespan by an activity. Before this instant the entity had not begunequivalent to be used by the activity. An entity generation event is the instantaneous event that marks the final instant of an entity's creation timespan, after which it is available for use. The entity did not exist before this event. An entity invalidation event is the instantaneous event that marks the initial instant of the destruction, invalidation, or cessation of an entity, after which the entity is no longer available for use. The entity no longer exists after this event.B'P(i).

8.5 Account Some of this discussion may belong in the account constraint section as motivation, or as formal constraints/inferences. In particular, the must, may, should statements should be clarified and put into the normative section. It is common for multiple provenance records to co-exist. For instance, when emailing a file, there could be a provenance record kept by the mail client, and another by the mail server. Such provenance records may provide different explanations about something happening in the world, because they are created by different parties or observed by different witnesses. A given party could also create multiple provenance records about an execution, to capture different levels of details, targeted at different end-users: the programmer of an experiment may be interested in a detailed log of execution, while the scientists may focus more on the scientific-level description. Given that multiple provenance records can co-exist, it is important to have details about their origin, who they are attributed to, how they were generated, etc. In other words, an important requirement is to be able to express the provenance of provenance. See ISSUE-343. An account is an entity that contains an instance, or set of PROV statements. PROV-DM does not provide an actual mechanism for creating accounts, i.e. for bundling up provenance descriptions and naming them. Accounts must satisfy some properties: An account is a bundle of provenance descriptions whose content may change over time. If an account's set of statements changes over time, it should increase monotonically with time. A given description of e.g. an entity in a given account, in terms of its identifier and attribute-value pairs, does not change over time. The last point is important. It indicates that within an account: It is always possible to add new provenance statements, e.g. stating that a given entity was used by an activity, or derived from another. This is very much an open world assumption. It is not permitted to add new attributes to a given entity (a form of closed world assumption from the attributes point of view), though it is always permitted to create a new statement describing an entity, which is a "copy" of the original statement extended with novel attributes (cf Example merge-with-rename). There is no construct in PROV-DM to create such bundles of statements. Instead, it is assumed that some mechanism, outside PROV-DM can create them. However, from a provenance viewpoint, such accounts are things whose provenance we may want to describe. In order to be able to do so, we need to see accounts as entities, whose origin can be described using PROV-DM vocabulary. Thus, PROV-DM introduces the reserved type Account. A. Acknowledgements WG membership to be listed here.

B.7. Glossary

A. Termination of normalization

This section is non-normative.

We will show that normalization terminates, that is, that applying definitions, inferences and uniqueness/key constraints eventually either fails (due to constraint violation) or terminates with a normal form.

First, since the inferences and constraints never introduce new defined statements, for the purpose of termination we always expand the definitions first and then consider only normalization of instances in which there are no remaining defined statements.

We will prove termination for the simple case where there are no attributes. For the general case, we will show that any nontermination arising from an instance that does involve attributes would also arise from one with no attributes.

Termination for instances without attributes. For these instances, uniqueness and key constraints can be As shown in [DBCONSTRAINTS], termination of normalization can be shown by checking that the inference rules are weakly acyclic. In addition, weak acyclicity can be checked in a modular fashion for our system, because there are only a few possible cycles among statements. The following table summarizes seven stages of the inference rules; because there are no cycles among stages, it is sufficient to check weak acyclicity of each stage independently.

Stage # Inference Hypotheses Conclusions
1 19, 20, 21, 22 specializationOf, mentionOf specializationOf, entity
2 7, 8, 13, 14 entity, activity, wasAttributedTo, actedOnBehalfOf wasInvalidatedBy, wasStartedBy, wasEndedBy
3 9, 10 wasStartedBy, wasEndedBy wasGeneratedBy
4 11, 12 wasDerivedFrom wasGeneratedBy, used, alternateOf
5 16, 17, 18 alternateOf, entity alternateOf
6 5, 6 wasInformedBy, generated, used wasInformedBy, generated, used
7 15 many wasInfluencedBy

For each stage, we show that the stage is weakly acyclic.

Graph illustrating weak
       acyclicity of stage 6

Termination for instances with attributes. We can translate an instance with attributes to an alternative, purely relational language by introducing a relation attribute(id,a,v) and replacing every statement of the form r(id;a1,...,an,[(k1,v1),...,(km,vm)]) with r(id;a1,...,an),attribute(id,k1,v1),...,attribute(id,km,vm), and similarly for entity, activity and agent attributes. The inference rules can also be translated so as to work on these instances, and a similar argument to the above shows that inference is terminating on instances with explicit attributes. Any infinite sequence of normalization steps on the original instance would lead to an infinite sequence of translated normalization steps on instances with explicit attributes.

B. Acknowledgements

This document has been produced by the PROV Working Group, and its contents reflect extensive discussion within the Working Group as a whole. The editors extend special thanks to Ivan Herman (W3C/ERCIM), Paul Groth, Tim Lebo, Simon Miles, Stian Soiland-Reyes, for their thorough reviews.

Members of the PROV Working Group at the time of publication of this document were: Ilkay Altintas (Invited expert), Reza B'Far (Oracle Corporation), Khalid Belhajjame (University of Manchester), James Cheney (University of Edinburgh, School of Informatics), Sam Coppens (IBBT), David Corsar (University of Aberdeen, Computing Science), Stephen Cresswell (The National Archives), Tom De Nies (IBBT), Helena Deus (DERI Galway at the National University of Ireland, Galway, Ireland), Simon Dobson (Invited expert), Martin Doerr (Foundation for Research and Technology - Hellas(FORTH)), Kai Eckert (Invited expert), Jean-Pierre EVAIN (European Broadcasting Union, EBU-UER), James Frew (Invited expert), Irini Fundulaki (Foundation for Research and Technology - Hellas(FORTH)), Daniel Garijo (Universidad Politécnica de Madrid), Yolanda Gil (Invited expert), Ryan Golden (Oracle Corporation), Paul Groth (Vrije Universiteit), Olaf Hartig (Invited expert), David Hau (National Cancer Institute, NCI), Sandro Hawke (W3C/MIT), Jörn Hees (German Research Center for Artificial Intelligence (DFKI) Gmbh), Ivan Herman, (W3C/ERCIM), Ralph Hodgson (TopQuadrant), Hook Hua (Invited expert), Trung Dong Huynh (University of Southampton), Graham Klyne (University of Oxford), Michael Lang (Revelytix, Inc.), Timothy Lebo (Rensselaer Polytechnic Institute), James McCusker (Rensselaer Polytechnic Institute), Deborah McGuinness (Rensselaer Polytechnic Institute), Simon Miles (Invited expert), Paolo Missier (School of Computing Science, Newcastle university), Luc Moreau (University of Southampton), James Myers (Rensselaer Polytechnic Institute), Vinh Nguyen (Wright State University), Edoardo Pignotti (University of Aberdeen, Computing Science), Paulo da Silva Pinheiro (Rensselaer Polytechnic Institute), Carl Reed (Open Geospatial Consortium), Adam Retter (Invited Expert), Christine Runnegar (Invited expert), Satya Sahoo (Invited expert), David Schaengold (Revelytix, Inc.), Daniel Schutzer (FSTC, Financial Services Technology Consortium), Yogesh Simmhan (Invited expert), Stian Soiland-Reyes (University of Manchester), Eric Stephan (Pacific Northwest National Laboratory), Linda Stewart (The National Archives), Ed Summers (Library of Congress), Maria Theodoridou (Foundation for Research and Technology - Hellas(FORTH)), Ted Thibodeau (OpenLink Software Inc.), Curt Tilmes (National Aeronautics and Space Administration), Craig Trim (IBM Corporation), Stephan Zednik (Rensselaer Polytechnic Institute), Jun Zhao (University of Oxford), Yuting Zhao (University of Aberdeen, Computing Science).

C. References

C.1 Normative references

[IRI]
M. Duerst, M. Suignard. Internationalized Resource Identifiers (IRI). January 2005. Internet RFC 3987. URL: http://www.ietf.org/rfc/rfc3987.txt
[RDF]
Graham Klyne and Jeremy J. Carroll (eds.) Resource Description Framework (RDF): Concepts and Abstract Syntax. 2004, W3C Recommendation. URL: http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Internet RFC 2119. URL: http://www.ietf.org/rfc/rfc2119.txt

C.2 Informative references

[CHR]
Thom Frühwirth Constraint Handling Rules. Cambridge University Press URL: http://constraint-handling-rules.org/
[CLOCK]
Lamport, L. Time, clocks, and the ordering of events in a distributed system.Communications. Communications of the ACM 21 (7): 558–565. 1978. URL: http://research.microsoft.com/users/lamport/pubs/time-clocks.pdf DOI: doi:10.1145/359545.359563.
[CSP]
[DBCONSTRAINTS]
Hoare, C. A. R. Ronald Fagin, Phokion G. Kolaitis, Renée J. Miller, and Lucian Popa Communicating Sequential ProcessesData exchange: Semantics and query answering.Prentice-Hall. 1985URL: http://www.usingcsp.com/cspbook.pdf. Theoretical computer science 336(1):89-124 Elsevier URL: http://dx.doi.org/10.1016/j.tcs.2004.10.033
[Logic]
W. E. JohnsonLogic: Part III.1924. URL: http://www.ditext.com/johnson/intro-3.html
[PROV-DM]
Luc Moreau and Paolo Missier (eds.) Khalid Belhajjame, Reza B'Far, James Cheney, Stephen Cresswell, Yolanda Gil, Paul Groth, Graham Klyne, Jim McCusker, Simon Miles, James Myers, Satya Sahoo, and Curt Tilmes PROV-DM: The PROV Data Model. 2012, Working Draft. URL: http://www.w3.org/TR/prov-dm/
[PROV-N]
Luc Moreau and Paolo Missier (eds.), James Cheney, Stian Soiland-Reyes PROV-N: The Provenance Notation. 2011, Working Draft. URL: http://www.w3.org/TR/prov-n/
[PROV-O]
Timothy Lebo, Satya Sahoo and Deborah McGuinness (eds.) Khalid Belhajjame, James Cheney, David Corsar, Daniel Garijo, Stian Soiland-Reyes, and Stephan Zednik Provenance Formal Model. 2011, Working Draft. URL: http://www.w3.org/TR/prov-o/
[PROV-SEM]
James Cheney Formal Semantics Strawman. 2011, Work in progress. URL: http://www.w3.org/2011/prov/wiki/FormalSemanticsStrawman