Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. PROV-DM is the conceptual data model that forms a basis for the W3C provenance (PROV) family of specifications.

This document presents a model-theoretic semantics for the PROV data model, viewing PROV-DM statements as atomic formulas in the sense of first-order logic, and viewing the constraints and inferences specified in PROV-CONSTRAINTS as a first-order theory. It is shown that valid PROV instances (in the sense of PROV-CONSTRAINTS) correspond to satisfiable theories. This information may be useful to researchers or users of PROV to understand the intended meaning and use of PROV for modeling information about the actual history, derivation or evolution of Web resources. It may also be useful for development of additional constraints or inferences for reasoning about PROV or integration of PROV with other Semantic Web vocabularies. It is not proposed as a canonical or required semantics of PROV and does not place any constraints on the use of PROV.

The PROV Document Overview describes the overall state of PROV, and should be read before other PROV documents.

#### PROV Family of Documents

This document is part of the PROV family of documents, a set of documents defining various aspects that are necessary to achieve the vision of inter-operable interchange of provenance information in heterogeneous environments such as the Web. These documents are listed below. Please consult the [[PROV-OVERVIEW]] for a guide to reading these documents.
• PROV-OVERVIEW (Note), an overview of the PROV family of documents [[PROV-OVERVIEW]];
• PROV-PRIMER (Note), a primer for the PROV data model [[PROV-PRIMER]];
• PROV-O (Recommendation), the PROV ontology, an OWL2 ontology allowing the mapping of PROV to RDF [[PROV-O]];
• PROV-DM (Recommendation), the PROV data model for provenance [[PROV-DM]];
• PROV-N (Recommendation), a notation for provenance aimed at human consumption [[PROV-N]];
• PROV-CONSTRAINTS (Recommendation), a set of constraints applying to the PROV data model [[PROV-CONSTRAINTS]];
• PROV-XML (Note), an XML schema for the PROV data model [[PROV-XML]];
• PROV-AQ (Note), the mechanisms for accessing and querying provenance [[PROV-AQ]];
• PROV-DICTIONARY (Note) introduces a specific type of collection, consisting of key-entity pairs [[PROV-DICTIONARY]];
• PROV-DC (Note) provides a mapping between PROV and Dublin Core Terms [[PROV-DC]];
• PROV-SEM (Note), a declarative specification in terms of first-order logic of the PROV data model (this document);

#### Implementations Encouraged

The Provenance Working Group encourages implementations that make use of or extend the semantics in this document. Although work on this document by the Provenance Working Group is complete, errors may be recorded in the errata or and these may be addressed in future revisions.

## Introduction

Provenance is a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing. [[PROV-DM]] This document complements the PROV-DM specification [[PROV-DM]] that defines a data model for provenance on the Web, and the PROV-CONSTRAINTS specification [[PROV-CONSTRAINTS]] that specifies definitions, inferences, and constraints that can be used to reason about PROV documents, or determine their validity. This document provides a formal semantics of PROV, providing a formal counterpart to the informal descriptions and motivations given elsewhere in PROV specifications.

### Purpose of this document

The PROV-DM and PROV-CONSTRAINTS specifications give motivating examples that provide an intuition about the meaning of the constructs. For some concepts, such as use, start, end, generation, invalidation, and derivation, the meaning is either obvious or situation-dependent. However, during the development of PROV, the importance of additional concepts became evident, but the intuitive meaning or correct use of these concepts were not clear. For example, the $alternateOf$ and $specializationOf$ relations are used in PROV to relate different entities that present aspects of "the same thing". Over time the working group came to a consensus about these concepts and how they are to be used, but this understanding is based on abstract notions that are not explicit in PROV documents; instead, some of their properties are captured formally through certain constraints and inferences, while others are not captured in PROV specifications at all.

The purpose of this document is to present the working group's consensus view of the semantics of PROV, using tools from mathematical logic, principally model theory (though our use of these tools is lightweight). This information may be useful to users for understanding the intent behind certain features of PROV, to researchers investigating richer forms of reasoning over provenance, or to future efforts building upon PROV. It is intended as an exploration of one semantics for PROV, not a definitive specification of the only semantics of PROV. We provide a semantics that satisfies all of the constraints on valid PROV instances, and such that valid PROV instances correspond to satisfiable theories: every valid instance has a model, and vice versa.

The semantics has some appealing properties. Specifically, it provides a declarative counterpart to the operational definition of validity taken in PROV-CONSTRAINTS. In the specification, validity is defined via a normalization process followed by constraint checking on the normal form. This approach was adopted to keep the specification closer to implementations, although other implementations are possible and allowed. In addition to providing a semantics, this document shows that the operational presentation of PROV validity checking is equivalent to the declarative presentation adopted here. This could help justify alternative approaches to validity checking.

This document mostly considers the semantics of PROV statements and instances. PROV documents can consist of multiple instances, such as named bundles. The semantics do not cover general PROV documents, but the semantics can be used on each instance in a document separately, just as PROV-CONSTRAINTS specifies that each instance in a document is to be validated separately. So, in the rest of this document, we discuss only PROV instances and not PROV documents. The semantics of extensions of PROV, such as dictionaries [[PROV-DICTIONARY]] and linking across bundles [[PROV-LINKS]], are beyond the scope of this document.

This document has been reviewed by the Working Group, but the theorems and proofs have not been formally peer-reviewed in the sense of an academic paper. Thus, the Working Group believes this document is an appropriate starting point for future study of the semantics of PROV, but further work may be needed.

#### Structure of this document

• Section 2 summarizes the basic concepts from mathematical logic used in the semantics, recapitulates how PROV statements can be viewed as atomic formulas, and introduces some auxiliary formulas.
• Section 3 presents the mathematical structures used for situations that PROV statements can describe.
• Section 4 defines the semantics of PROV statements and auxiliary formulas, indicating when a given formula is satisfied in a structure.
• Section 5 presents the inferences and constraints from PROV-CONSTRAINTS as first-order formulas, and gives brief justifications for their soundness.
• Section 6 summarizes the main results relating PROV-CONSTRAINTS validation to the semantics, including soundness and a weak form of completeness: a PROV instance is valid if and only if it has a model.

### Audience

This document assumes familiarity with [[PROV-DM]] and [[PROV-CONSTRAINTS]] and employs (a simplified form of) [[PROV-N]] notation. In particular it assumes familiarity with the concepts from logic, and the relationship between PROV statements and instances and first-order formulas and theories, respectively, presented in Section 2.5 of PROV-CONSTRAINTS.

This document may be useful to users of PROV who have a formal background and are interested in the rationale for some of the constructs of PROV; for researchers investigating extensions of PROV or alternative approaches to reasoning about PROV; or for future efforts on provenance standardization.

## Basics

### Identifiers

A lowercase symbol $x,y,...$ on its own denotes an identifier. Identifiers are viewed as variables from the point of view of logic. Identifiers denote objects, and two different identifiers $x$ and $y$ may denote equal or different objects. We write $Identifiers$ for the set of identifiers of interest in a given situation (typically, the set of identifiers present in the PROV instance of interest).

### Attributes and Values

We assume a set $Attributes$ of attribute labels and a set $Values$ of possible values of attributes. To allow for the fact that some attributes can have undefined or multiple values, we sometimes use the set $P(Values)$, that is, the set of sets of values. Thus, we can use the empty set to stand for an undefined value and $\{a,b\}$ to stand for the set of values of a two-valued attribute.

### Times

We assume an ordered set $(Times,\leq)$ of time instants, where $Times \subseteq Values$ and $\leq$ is a linear order.

## Atomic Formulas

The following atomic formulas correspond to the statements of PROV-DM. We assume that definitions 1-4 of PROV-CONSTRAINTS have been applied in order to expand all optional parameters; thus, we use uniform notation $r(id,a_1,\ldots,a_n)$ instead of the semicolon notation $r(id;a_1,\ldots,a_n)$.

Each parameter is either an identifier, a constant (e.g. a time or other literal value in an attribute list), or a null symbol "$-$". Placeholder symbols "$-$" can only appear in the specified arguments $pl$ in $wasAssociatedWith$ and $a,g,u$ in $wasDerivedFrom$, as shown in the grammar below.


We include the standard PROV collection types ($Collection$ and $EmptyCollection$) and the membership relation $hadMember$; however, we do not model dictionaries or the insertion or deletion relations in PROV-DICTIONARY [[PROV-DICTIONARY]], since these are not part of the PROV recommendations. If these features are incorporated into future standards, their semantics (and the soundness of the associated constraints) should be modeled. We omit the $prov$ prefixes from the $Collection$ and $EmptyCollection$ types.

As stated in the Introduction, we do not explicitly model bundles or PROV documents; however, each instance can be viewed as a set of formulas and can be modeled separately. The semantics of the standard features of PROV can be defined without talking about multiple instances; however, the $mentionOf$ relation in PROV-LINKS [[PROV-LINKS]] is intended to support linking across bundles. Future editions of PROV may incorporate $mentionOf$ or other cross-instance assertions, and if so this semantics should be generalized in order to provide a rationale for such an extension and to establish the soundness of constraints associated with $mentionOf$.

### First-Order Formulas

We also consider the usual connectives and quantifiers of first-order logic.

$$\begin{array}{rcl} \phi &{::=}& atomic\_formula\\ & | & True\\ & | & False\\ &|&x = y\\ & | & \neg~\phi\\ &|& \phi_1 \wedge \phi_2\\ &|& \phi_1 \vee \phi_2\\ &|& \phi_1 \Rightarrow \phi_2\\ &|& \forall x. \phi\\ &|& \exists x. \phi\\ \end{array}$$

## Structures and Interpretations

In this section we define mathematical structures $W$ that can be used to interpret PROV formulas and instances. A structure consists of a collection of sets, functions and relations. The components of a structure $W$ are given in the rest of the section in components, highlighted in boxes.

We use the term "component" here in a different sense than in PROV-DM. Here, the components are parts of a large definition, whereas PROV-DM defines six components that group different parts of the PROV data model.

### Things

Things is a set of things in the situation being modeled. Each thing has an associated set of $Events$ and attributes whose values can change over time. Different kinds of $Events$ are specified further below.

To model this, a structure $W$ includes:

1. a set $Things$ of things
2. a set $Events$ of events
3. a function $events : Things \to P(Events)$ from things to associated sets of events.
4. a function $value : Things \times Attributes \times Events \to P(Values)$ giving the possible values of each attribute of a $Thing$ at the instant of a given event.
5. Attributes are only defined during the events of a thing, that is, $value(T,a,evt) \neq \emptyset$ implies $evt \in events(T)$.

The range of $value$ is the set $P(Values)$, indicating that $value$ is essentially a multi-valued function that returns a set of values (possibly empty). When $value(x,a,evt) = \emptyset$, we say that attribute $a$ is undefined for $x$ at event $evt$.

Note that this description does not say what the structure of a $Thing$ is, only how it may be described in terms of its events and attribute values. A thing could be a record of fixed attribute values; it could be a bear; it could be the Royal Society; it could be a transcendental number like $\pi$. All that matters from our point of view is that we know how to map the $Thing$ to its events and attribute mapping.

The identity of a Thing is not observable through its attributes or events, so it is possible for two different $Things$ to be indistinguishable by their attribute values and events. That is, if the set of $Things = \{T_0,T_1\}$ and the attributes are specified as $value(T_0,a,evt) = value(T_1,a,evt)$ for each $evt\in Events$ and $a \in Attributes$, this does not imply that $T_0 = T_1$.

$Things$ are associated with certain kinds of $Objects$ called $Entities$, defined in the next subsection. Specifically, the function $thingOf$ associates an $Entity$ to a $Thing$.

### Objects

$Things$ are things in the world that have attributes that can change over time. $Things$ may not have distinguishing features that are readily observable and permanent. In PROV, we do not talk explicitly about $Things$, but instead we talk about various objects that have discrete, fixed features, and relationships among these objects. Some objects, called $Entities$, are associated with $Things$, and their fixed attributes need to match those of the associated $Thing$ during their common events. Others correspond to agents, activities, or identifiable interactions among them.

In this section, we detail the different subsets of $Objects$, and give disjointness constraints and associated functions. Generally, these constraints are necessary to validate disjointness constraints from PROV-CONSTRAINTS [[PROV-CONSTRAINTS]].

An Object is described by a set of events and attributes with fixed values. Objects encompass entities, activities, agents, and interactions (i.e., usage, generation, and other events or influence relations). To model this, a structure includes:

1. a set $Objects$
2. a function $events : Objects \to P(Events)$ from objects to associated sets of events.
3. a function $value : Objects \times Attributes \to P(Values)$.

Intuitively, $events(e)$ is the set of events in which $e$ participated. The set $value(e,a)$ is the set of values of attribute $a$ during the object's events.

As with Things, the range of $value$ is sets of values, making $value$ effectively a multivalued function. It is also possible to have two different objects that are indistinguishable by their attributes and associated events. Objects are not things, and the sets of $Objects$ and $Things$ are disjoint; however, certain objects, namely entities, are associated with things.

Disjointness between $Objects$ and $Things$ is not necessary but is assumed in order to avoid confusion between the different categories (time-varying $Things$ vs fixed $Objects$).

#### Entities

An entity is a kind of object that fixes some aspects of a thing. We assume:

1. a set $Entities \subseteq Objects$ of entities, disjoint from $Activities$ below.
2. a function $thingOf : Entities \to Things$ that associates each $Entity$ $e$ with a $Thing$, such that $events(e) \subseteq events(thingOf(e))$ and for each $evt \in events(e)$ and for each attribute $a$ we have $value(e,a) \subseteq value(thingOf(e),a,evt)$.

Although both entities and things can have undefined or multiple attribute values, their meaning is slightly different: for a thing, $value(x,a,evt) = \emptyset$ means that the attribute $a$ has no value at event $evt$, whereas for an entity, $value(x,a) = \emptyset$ only means that the thing associated to entity $x$ need not have a fixed value for $a$ during the events of $x$. This does not imply that $value(thingOf(e),a,evt) = \emptyset$ when $evt \in events(e)$.

Furthermore, all of the attribute values of the entity must be present in the associated thing throughout the events of the entity. For example, suppose $value(thingOf(e),a,evt)$ is $\{1\}$ at some event $evt \in events(e)$ and $value(thingOf(e),a,evt') = \{2\}$ at some other event $evt'$. Then $value(e,a)$ must be $\emptyset$ because there is no other set of values that is simultaneously contained in both $\{1\}$ and $\{2\}$.

In the above description of how $Entities$ relate to $Things$, we require $value(e,a) \subseteq value(thingOf(e),a,evt)$ whenever $evt \in events(e)$. Intuitively, this means that if we are talking about a $Thing$ indirectly by describing an $Entity$, then any attributes we ascribe to the $Entity$ must also describe the associated $Thing$ during their common events. Attributes of both $Entities$ and $Things$ are multi-valued, so there is no inconsistency in saying that an entity has two different values for some attribute. In some situations, further uniqueness constraints or range constraints could be imposed on attributes.

Only $Entities$ are associated with $Things$, and this association is necessary to provide an interpretation for the $alternateOf$ and $specializationOf$ relations. It might also make sense to associate $Agents$, $Activities$, and $Interactions$ with $Things$, or with some other structures; however, this is not necessary to model any of the current features of PROV, so in the interest of simplicity we do not do this.

##### Plans

We identify a specific subset of the entities called plans:

A set $Plans \subseteq Entities$ of plans.

##### Collections

We identify another specific subset of the entities called collections, with the following associated structure:

• A set $Collections \subseteq Entities$
• A membership function $members : Collections \to P( Entities)$ mapping each collection to its set of members.

#### Activities

An activity is an object corresponding to a continuing process rather than an evolving thing. We introduce:

1. A set $Activities \subseteq Objects$ of activities.
2. Functions $startTime : Activities \to Times$ and $endTime :Activities \to Times$ giving the start and end time of each activity.
3. Activities are disjoint from Entities: $Entities\cap Activities = \emptyset$.

#### Agents

An agent is an object that can act, by controlling, starting, ending, or participating in activities. An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity. Agents can act on behalf of other agents. An agent may be a particular type of entity or activity; an agent cannot be both entity and activity because the sets of entities and activities are disjoint. We introduce:

A set $Agents \subseteq Objects$ of agents.

There is no requirement that every agent is either an activity or an entity.

#### Influences

We consider a set $Influences \subseteq Objects$ which has disjoint subsets $Events$ connecting entities and activities, $Associations$ between agents and activities, $Attributions$ between entities and agents, $Communications$ between pairs of activities, $Delegations$ between pairs of agents, and $Derivations$ that describe chains of generation and usage steps. These kinds of influences are discussed further below. Influences are disjoint from entities, activities and agents.

1. A set $Influences = Events \cup Associations \cup Communications \cup Delegations \cup Derivations \subseteq Objects$
2. The sets $Events$, $Associations$, $Communications$, $Delegations$ and $Derivations$ are all pairwise disjoint.
3. Influences are disjoint from entities, agents and activities: $Influences \cap (Entities \cup Activities \cup Agents) = \emptyset$
4. An associated function $influenced : Influences \to Objects \times Objects$ giving the source and target of each influence.
##### Events

An $Event$ is an instantaneous influence that relates an activity to an entity (either of which could also be an agent). Events have types including usage, generation, invalidation, starting and ending. Events are instantaneous. We introduce:

1. A set $Events \subseteq Influences$ of events, partitioned into disjoint subsets $Starts, Ends, Generations, Usages, Invalidations$.
2. A function $time : Events \to Times$.
3. A quasi-ordering on events $\preceq \subset Events \times Events$. We write $e \prec e'$ when $e \preceq e'$ and $e' \not\preceq e$ hold.
4. A function $started : Starts \to Activities \times Entities \times Activities$, such that $started(start) = (a,e,a')$ implies $start \in events(a) \cap events(e) \cap events(a')$.
5. A function $ended : Ends \to Activities \times Entities \times Activities$, such that $ended(end) = (a,e,a')$ implies $end \in events(a) \cap events(e) \cap events(a')$.
6. A function $used : Usages \to Activities \times Entities$ such that $used(use) = (a,e)$ implies $use \in events(a) \cap events(e)$.
7. A function $generated : Generations \to Entities \times Activities$ such that $generated(gen) = (a,e)$ implies $gen \in events(a) \cap events(e)$.
8. A function $invalidated : Invalidations \to Entities \times Activities$ such that $invalidated(inv) = (a,e)$ implies $inv \in events(a) \cap events(e)$.
##### Associations

An $Association$ is an influence relating an agent to an activity and optional plan. To model associations, we introduce:

A set $Associations \subseteq Influences$ with associated function $associatedWith : Associations \to Agents \times Activities \times Plans_\bot$.

An $Attribution$ is an influence relating an entity to an agent. To model attributions, we introduce:

A set $Attributions \subseteq Influences$ with associated function $attributedTo : Attributions \to Entities \times Agents$.

##### Communications

A $Communication$ is an influence indicating exchange of information between activities. To model communications, we introduce:

A set $Communications \subseteq Influences$ with associated function $communicated : Communications \to Activities \times Activities$.

##### Delegations

A $Delegation$ is an influence relating two agents. To model delegations, we introduce:

A set $Delegations \subseteq Influences$ and associated function $actedFor : Delegations \to Agents \times Agents \times Activities$

##### Derivations

A $Derivation$ is an influence chaining one or more generation and use steps. To model derivations, we introduce an auxiliary notion of derivation path. These paths are of the form

$$ent_n\cdot g_n\cdot act_n\cdot u_n\cdot ent_{n-1}\cdot ...\cdot ent_1\cdot g_1\cdot act_1\cdot u_1\cdot ent_0$$

where the $ent_i$ are entities, $act_i$ are activities, $g_i$ are generations, and $u_i$ are usages.

Formally, we consider the (regular) language:

$$DerivationPaths = Entities \cdot (Generations \cdot Activities \cdot Usages \cdot Entities)^+$$

with the constraints that for each derivation path:

• for each substring $ent\cdot g \cdot act$ we have $generated(g) = (ent,act)$, and
• for each substring $act \cdot u \cdot ent$ we have $used(u) = (act,ent)$.

A set $Derivations \subseteq Influences$ with an associated function $derivationPath : Derivations \to DerivationPaths$ linking each derivation to a derivation path.

The $derivationPath$ function links each $d \in Derivations$ to a derivation path. A derivation has exactly one associated derivation path. However, if the PROV-N statement wasDerivedFrom(e_2,e_1,-,-,-) is asserted in an instance, there may be multiple derivation paths linking $e_2$ to $e_1$, each corresponding to a different path, identified by different derivations $d \in Derivations$.

A derivation path implies the existence of at least one chained generation and use step. However, not all such potential derivation paths are associated with derivations; there can (and in general will) be many such paths that are not associated with derivation steps. In other words, because we require derivations to be explicitly associated with derivation paths, it is not sound to infer the existence of a derivation from the existence of an alternating generation/use chain.

The reason why we need paths and not just individual derivation steps is to reflect that $wasDerivedFrom(id,e_2,e_1,-,-,-,attrs)$ formulas can represent multiple derivation steps. However, there is no way to force a derivation to take multiple steps. Any valid PROV instance has a model in which all derivation paths are one-step.

Above we have stated some properties of the components. We impose some additional properties that relate several components, as follows:

1. If $generated(g) = (e,a_1)$ and $used(u) = (a_2,e)$ then there exists $c \in Communications$ such that $communicated(c) = (a_2,a_1)$.
2. If $e \in Entities$ then there exist $gen,inv,a,a'$ such that $generated(gen) = (e,a)$ and $invalidated(inv) = (e,a')$.
3. If $started(start) = (a_2,e,a_1)$ then there exists $gen$ such that $generated(gen) = (e,a_1)$.
4. If $ended(end) = (a_2,e,a_1)$ then there exists $gen$ such that $generated(gen) = (e,a_1)$.
5. If $d \in Derivations$ and $prov:Revision \in value(d,prov:type)$ and there exists $w \in (Generations \cup Activities \cup Uses \cup Entities)^*$ such that $derivationPath(deriv) = e_2 \cdot w \cdot e_1 \in DerivationPaths$ then $thingOf(e_1) = thingOf(e_2)$.
6. If $attributedTo(att) = (e,ag)$ then there exist $gen$, $assoc$ and $a$ such that $generated(gen) = (e,a)$ and $associatedWith(assoc) = (a,ag)$.
7. If $actedFor(deleg) = (ag_2,ag_1,act)$ then there exist $assoc_1,assoc_2,pl_1,pl_2$ such that $associatedWith(assoc_1) = (ag_1,act,pl_1)$ and $associatedWith(assoc_2) = (ag_2,act,pl_2)$.
8. If $generated(id) = (e,a)$ then $influenced(id) = (e,a)$.
9. If $used(id) = (e,a)$ then $influenced(id) = (e,a)$.
10. If $communicated(id) = (a_2,a_1)$ then $influenced(id) = (a_2,a_1)$.
11. If $started(id) = (a_2,e,a_1)$ then $influenced(id) = (a_2,e)$.
12. If $ended(id) = (a_2,e,a_1)$ then $influenced(id) = (a_2,e)$.
13. If $invalidated(id) = (e,a)$ then $influenced(id) = (e,a)$.
14. If $derivationPath(id) = e_2 \cdot w \cdot e_1$ then $influenced(id) = (e_2,e_1)$.
15. If $attributedTo(id) = (e,ag)$ then $influenced(id) = (e,ag)$.
16. If $associatedWith(id) = (a,ag,pl)$ then $influenced(id) = (a,ag)$.
17. If $actedFor(id) = (ag_2,ag_1)$ then $influenced(id) = (ag_2,ag_1)$.
18. If $generated(gen) = (e,a) = generated(gen')$ then $gen = gen'$.
19. If $invalidated(inv) = (e,a) = invalidated(inv')$ then $inv=inv'$.
20. If $started(st) = (a,e_1,a')$ and $started(st') = (a,e_2,a')$ then $st=st'$.
21. If $ended(end) = (a,e_1,a')$ and $ended(end') = (a,e_2,a')$ then $end=end'$.
22. If $started(st) = (a,e,a')$ then $st \preceq evt$ for all $evt \in events(a) - Invalidations$.
23. If $ended(end) = (a,e,a')$ then $evt \preceq end$ for all $evt \in events(a) - Invalidations$.
24. If $generated(gen) = (e,a)$ then $gen \preceq evt$ for all $evt \in events(e)$.
25. If $invalidated(inv) = (e,a)$ then $evt\preceq inv$ for all $evt \in events(e)$.
26. For any derivation $deriv$, with path $derivationPath(deriv) = w$, if $e_2 \cdot g \cdot a \cdot u \cdot e_1$ is a substring of $w$ where $e_1,e_2 \in Entities$, $g \in Generations$, $u \in Usages$ and $a \in Activities$ then $u \preceq g$.
27. For any derivation $deriv$, with path $derivationPath(deriv) = e_2 \cdot w \cdot e_1$, if $generated(gen_1) = (e_1,a_1)$ and $generated(gen_2) = (e_2,a_2)$ then $gen_1 \prec gen_2$.
28. If $associatedWith(assoc) = (a,ag,pl)$ and $started(start) = (a,e_1,a_1)$ and $invalidated(inv) = (ag,a_2)$ then $start \preceq inv$.
29. If $associatedWith(assoc) = (a,ag,pl)$ and $generated(gen) = (ag,a_1)$ and $ended(end) = (a,e_2,a_2)$ then $gen \preceq end$.
30. If $associatedWith(assoc) = (a,ag,pl)$ and $started(start) = (a,e_1,a_1)$ and $ended(end) = (ag,e_2,a_2)$ then $start \preceq end$.
31. If $associatedWith(assoc) = (a,ag,pl)$ and $started(start) = (ag,e_1,a_1)$ and $ended(end) = (a,e_2,a_2)$ then $start \preceq end$.
32. If $attributedTo(attrib) = (e,ag)$ and $generated(gen_1) = (ag_1,a_1)$ and $generated(gen_2) = (e,a_2)$ then $gen_1 \preceq gen_2$.
33. If $attributedTo(attrib) = (e,ag)$ and $started(start) = (ag_1,e_1,a_1)$ and $generated(gen) = (e,a_2)$ then $start \preceq gen$.
34. If $actedFor(deleg) = (ag_2,ag_1,a)$ and $generated(gen) = (ag_1,a_1)$ and $invalidated(inv) = (ag_2,a_2)$ then $gen \preceq inv$.
35. If $actedFor(deleg) = (ag_2,ag_1,a)$ and $started(start) = (ag_1,e_1,a_1)$ and $ended(end) = (ag_2,e_2,a_2)$ then $start \preceq end$.
36. If $e \in Entity$ and $prov:emptyCollection \in value(e,prov:type)$ then $e \in Collections$ and $members(e) = \emptyset$.

These properties are called axioms, and they are needed to ensure that the PROV-CONSTRAINTS inferences and constraints hold in all structures.

Axioms 22 and 23 do not require that invalidation events originating from an activity follow the activity's start event(s) or precede its end event(s). This is because there is no such constraint in PROV-CONSTRAINTS. Arguably, there should be a constraint analogous to Constraint 34 that specifies that any invalidation event in which an activity participates must follow the activity's start event(s) and precede its end event(s).

Here, we exempt invalidations from axioms 22 and 23 in order to simplify the proof of weak completeness.

### Putting it all together

A PROV structure $W$ is a collection of sets, functions, and relations containing all of the above described components and satisfying all of the associated properties and axioms. If we need to talk about the objects or relations of more than one structure then we may write $W_1.Objects$, $W_1.Things$, etc.; otherwise, to decrease notational clutter, when we consider a fixed structure then the names of the sets, relations and functions above refer to the components of that model.

Some features of PROV structures are relatively obvious or routine, corresponding directly to features of PROV and associated inferences. For example, the functions $used, generated, invalidated, started, ended$ mapping events to their associated entities or activities, and $communicated, associatedWith, attributedTo, actedFor$ associating other types of influences with appropriate data.

On the other hand, some features are more distinctive, and represent areas where formal modeling has been used to guide the development of PROV. Derivation paths are one such distinctive feature; they correspond to an intuition that derivations may describe one or multiple generation-use steps leading from one entity to another. Another distinctive feature is the use of $Things$, which correspond to changing, real-world things, as opposed to $Entities$, which correspond to limited views or perspectives on $Things$, with some fixed aspects. The semantic structures of $Things$ and $Entities$ provide a foundation for the $alternateOf$ and $specializationOf$ relations.

### Interpretations

We need to link identifiers to the objects they denote. We do this using a function which we shall call an interpretation. An interpretation is a function $\rho : Identifiers \to Objects$ describing which object is the target of each identifier. The mapping from identifiers to objects may not change over time; only $Objects$ can be denoted by $Identifiers$.

## Semantics

In what follows, let $W$ be a fixed structure with the associated sets and relations discussed in the previous section, and let $\rho$ be an interpretation of identifiers as objects in $W$. The annotations [WF] refer to well-formedness constraints that correspond to typing constraints.

### Satisfaction

Consider a formula $\phi$, a structure $W$ and an interpretation $\rho$. We define notation $W,\rho \models \phi$ which means that $\phi$ is satisfied in $W,\rho$. For atomic formulas, the definition of the satisfaction relation is given in the next few subsections. We give the standard definition of the semantics of the other formulas:

1. $W,\rho \models True$ always holds.
2. $W,\rho \models False$ never holds.
3. $W,\rho \models x = y$ holds if and only if $\rho(x) = \rho(y)$.
4. $W,\rho \models \neg \phi$ holds if and only if $W,\rho \models \phi$ does not hold.
5. $W,\rho \models \phi \wedge \psi$ holds if and only if $W,\rho \models \phi$ and $W,\rho \models \psi$.
6. $W,\rho \models \phi \vee \psi$ holds if either $W,\rho \models \phi$ or $W,\rho \models \psi$.
7. $W,\rho \models \phi \Rightarrow \psi$ holds if $W,\rho \models \phi$ implies $W,\rho \models \psi$.
8. $W,\rho \models \exists x. \phi$ holds if there exists some $obj \in Objects$ such that $W,\rho[x:=obj] \models \phi$.
9. $W,\rho \models \forall x. \phi$ holds if there for every $obj \in Objects$ we have $W,\rho[x:=obj] \models \phi$.

In the semantics above, note that the domain of quantification is the set of $Objects$; that is, quantifiers range over entities, activities, agents, or influences (which are in turn further subdivided into types of influences). $Things$ and relations cannot be referenced directly by identifiers.

A PROV instance $I$ consists of a set of statements, each of which can be translated to an atomic formula following the definitional rules in PROV-CONSTRAINTS, possibly by introducing fresh existential variables. Thus, we can view an instance $I$ as a set of atomic formulas $\{\phi_1,\ldots,\phi_n\}$, or equivalently a single formula $\exists x_1,\ldots,x_k.~\phi_1 \wedge \cdots \wedge \phi_n$, where $x_1,\ldots,x_k$ are the existential variables of $I$.

### Attribute matching

We say that an object $obj$ matches attributes $[attr_1=val_1,...]$ in structure $W$ provided: for each attribute $attr_i$, we have $val_i \in W.value(obj,attr_i)$. This is sometimes abbreviated as: $match(W,obj,attrs)$.

### Semantics of Element Formulas

#### Entity

An entity formula is of the form $entity(id,attrs)$ where $id$ denotes an entity.

Entity formulas $entity(id,attrs)$ can be interpreted as follows:

$W,\rho \models entity(id,attrs)$ holds if and only if:

1. [WF] $id$ denotes an entity $ent = \rho(id) \in Entities$.
2. the attributes match: $match(W,ent, attrs)$.

Not all of the attributes of an entity object are required to be present in an entity formula about that object. For example, the following formulas all hold if $x$ denotes an entity $e$ such that $value(e,a) = \{4,5\}, value(e,b) = \{6\}$ hold:

 entity(x,[])
entity(x,[a=5])
entity(x,[a=4,a=5])
entity(x,[a=4,b=6])


Note that PROV-CONSTRAINTS normalization will merge these formulas to a single one:

  entity(x,[a=4,a=5,b=6])


#### Activity

An activity formula is of the form $activity(id,st,et,attrs)$ where $id$ is a identifier referring to the activity, $st$ is a start time and $et$ is an end time, and $attrs$ are the attributes of activity $id$.

$W,\rho \models activity(id,st,et,attrs)$ holds if and only if:

1. [WF] The identifier $id$ maps to an activity $act = \rho(id) \in Activities$.
2. $\rho(st) \in Times$ is the activity's start time, that is: $startTime(act) = \rho(st)$.
3. $\rho(et)$ is the activity's end time, that is: $endTime(act) = \rho(et)$.

## Acknowledgements

This document has been produced by the PROV Working Group, and its contents reflect extensive discussion within the Working Group as a whole as well as feedback and comments from external reviewers. Thanks specifically to Khalid Belhajjame, Tom De Nies, Paolo Missier, Simon Miles, Luc Moreau, Satya Sahoo, Jan van den Bussche, Joachim Van Herwegen, and Antoine Zimmermann for detailed feedback.

We would also like to acknowledge Schloss Dagstuhl - Leibniz Center for Informatics, because significant progress was made on this document at Dagstuhl Seminar 12091 (Principles of Provenance) that took place from February 26 to March 2, 2012.

Thanks also to Robin Berjon for the ReSPec.js specification writing tool and to MathJax for their LaTeX-to-HTML conversion tools, both of which aided in the preparation of this document.

Members of the Provenance Working Group at the time of publication of this document were: Ilkay Altintas (Invited expert), Reza B'Far (Oracle Corporation), Khalid Belhajjame (University of Manchester), James Cheney (University of Edinburgh, School of Informatics), Sam Coppens (iMinds - Ghent University), David Corsar (University of Aberdeen, Computing Science), Stephen Cresswell (The National Archives), Tom De Nies (iMinds - Ghent University), Helena Deus (DERI Galway at the National University of Ireland, Galway, Ireland), Simon Dobson (Invited expert), Martin Doerr (Foundation for Research and Technology - Hellas(FORTH)), Kai Eckert (Invited expert), Jean-Pierre EVAIN (European Broadcasting Union, EBU-UER), James Frew (Invited expert), Irini Fundulaki (Foundation for Research and Technology - Hellas(FORTH)), Daniel Garijo (Universidad Politécnica de Madrid), Yolanda Gil (Invited expert), Ryan Golden (Oracle Corporation), Paul Groth (Vrije Universiteit), Olaf Hartig (Invited expert), David Hau (National Cancer Institute, NCI), Sandro Hawke (W3C/MIT), Jörn Hees (German Research Center for Artificial Intelligence (DFKI) Gmbh), Ivan Herman, (W3C/ERCIM), Ralph Hodgson (TopQuadrant), Hook Hua (Invited expert), Trung Dong Huynh (University of Southampton), Graham Klyne (University of Oxford), Michael Lang (Revelytix, Inc.), Timothy Lebo (Rensselaer Polytechnic Institute), James McCusker (Rensselaer Polytechnic Institute), Deborah McGuinness (Rensselaer Polytechnic Institute), Simon Miles (Invited expert), Paolo Missier (School of Computing Science, Newcastle university), Luc Moreau (University of Southampton), James Myers (Rensselaer Polytechnic Institute), Vinh Nguyen (Wright State University), Edoardo Pignotti (University of Aberdeen, Computing Science), Paulo da Silva Pinheiro (Rensselaer Polytechnic Institute), Carl Reed (Open Geospatial Consortium), Adam Retter (Invited Expert), Christine Runnegar (Invited expert), Satya Sahoo (Invited expert), David Schaengold (Revelytix, Inc.), Daniel Schutzer (FSTC, Financial Services Technology Consortium), Yogesh Simmhan (Invited expert), Stian Soiland-Reyes (University of Manchester), Eric Stephan (Pacific Northwest National Laboratory), Linda Stewart (The National Archives), Ed Summers (Library of Congress), Maria Theodoridou (Foundation for Research and Technology - Hellas(FORTH)), Ted Thibodeau (OpenLink Software Inc.), Curt Tilmes (National Aeronautics and Space Administration), Craig Trim (IBM Corporation), Stephan Zednik (Rensselaer Polytechnic Institute), Jun Zhao (University of Oxford), Yuting Zhao (University of Aberdeen, Computing Science).