This document defines PROV-DM, a data model for provenance, and PROV-ASN, an abstract syntax, which allows serializations of PROV-DM instances to be created in a technology independent manner, facilitates its mapping to concrete syntax, and is used as the basis for a formal semantics.

The name of the data model still has to be decided by the WG. Current placeholder name is PIDM. This is ISSUE-31

This is a document for internal discussion, which will ultimately evolve in the first Public Working Draft of the Conceptual Model.

Introduction

Introduction needs to be written. It should indicate the aims and scope of this document, and position this document in the family of documents produced by the PROV WG.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [[!RFC2119]].

Preliminaries

A Conceptualization of the World

This specification is based on a conceptualization of the world that is described in this section. In the world (whether real or not), there are things, which can be physical, digital, conceptual, or otherwise, and activities involving things. Words such thing or activity should be understood with their informal meaning.

When we talk about things in the world in natural language and even when we assign identifiers, we are often imprecise in ways that make it difficult to clearly and unambiguously report provenance: a resource with a URL may be understood as referring to a report available at that URL, the version of the report available there today, the report independent of where it is hosted over time, etc.

This is the single most important issue IMO: we hit readers with this "characterised thing" which is unexpected. We need to be absolutely clear about it...
ok, so it seems that characterized thing is introduced to deal with (1) imprecision, (2) disagreement amongst different "observers" of the same data-related events. I don't think this is about "disambiguation". It's about accommmodating different perspectives on the what is the same abstract "thing". This interpretation fits with the example: "different users may take different perspective..."

Hence, to disambiguate things and their situation in the world as perceived by us, we introduce the concept characterized thing, which refers to a thing and its situation in the world, as characterized by someone.

Different users may take different perspective about a resource with a URL, which are referred to as three different characterized things:
  • a report available at URL,
  • the version of the report available there today,
  • the report independent of where it is hosted over time.
can we follow through from the example. thare three perspectives, possibly by just one observer or multiple ones. but why is it so important for reporting provenance that this distinction is made? I feel we need to connect this approach to provenance recording strongly and right away

We do not assume that any characterization is more important than any other, and in fact, it is possible to describe the processing that occurred for the report to be commissioned, for individual versions to be created, for those versions to be published at the given URL, etc., each via a different characterized thing that unambiguously characterizes the report appropriately.

In the world, activities involve things in multiple ways: they consume them, they process them, they transform them, they modify them, they change them, they relocate them, they use them, they generate them, they are controlled by them, etc.

In our conceptualization of the world, punctual events, or events for short, happen in the world, which mark changes in the world, in its activities, and in its things. In this specification, it is assumed that a partial order exists between events. How practically such order is realized is beyond the scope of this specification. Possible implementations of that ordering include a single global notion of time and Lamport's style clocks.

In this specification, the qualifier 'identifiable' is implicit whenever a reference is made to an activity or characterized thing.

PROV-ASN: The Provenance Abstract Syntax Notation

This specification defines PROV-DM, a data model for provenance, and relies on a language, PROV-ASN, the Provenance Abstract Syntax Notation, to express instances of that data model.

PROV-ASN is an abstract syntax, whose goals are:

This specification provides a grammar for PROV-ASN. Each expression of the PROV-DM data model is explained in terms of the production of this grammar.

The formal semantics of PROV-DM is defined at [[PROV-SEMANTICS]] and its encoding in the OWL2 Web Ontology Language at [[PROV-OWL2]].

TODO: We need to define the BNF notation. We propose to use the same notation as in [[!OWL2-SYNTAX]] (see section BNF Notation).
Data model vs Language. Misc comments raised at ISSUE-62
Formalism used is not explained, not applied to concepts ISSUE-87.

Representation, Assertion, and Inference

PROV-DM is a provenance data model designed to express representations of the world.

A file at some point during its lifecycle, which includes multiple edits by multiple people, can be represented by its location in the file system, a creator, and content.

These representations are relative to an asserter, and in that sense constitute assertions stating properties of the world, as represented by an asserter. Different asserters will normally contribute different representations, and no attempt is made to define a notion of consistency of such different sets of assertions. The data model provides the means to associate attribution to assertions.

An alternative representation of the above file is a set of blocks in a hard disk.

The data model is designed to capture events that happened in the past, as opposed to event that may or will happen. However, this distinction is not formally enforced. Therefore, all PROV-DM assertions SHOULD be interpreted as a record of what has happened, as opposed to what may or will happen.

Can this be enforced formally?

This specification does not prescribe the means by which an asserter arrives at assertions; for example, assertions can be composed on the basis of observations, reasoning, or any other means.

Sometimes, inferences about the world can be made from representations conformant to the PROV-DM data model. When this is the case, this specification defines such inferences. Hence, representations of the world can result either from direct assertions by asserters or from application of inferences defined by this specification.

PROV-DM Overview

Conceptual model needs a high level overview ISSUE-86.
<<<<<<< local The following diagram provides a high level overview of the PROV-DM. Examples of a set of provenance assertions that conform to this schema are provided in the next section. =======

The following diagram provides a very high level overview of the PROV-DM. Examples of a set of provenance assertions that conform to this schema are provided in the next section.

>>>>>>> other

The model includes the following fundamental entity types:

The model includes two additional entity types: qualifiers and annotations. These are both structured as sets of attribute-value pairs. Attributes, qualifiers, and annotation are the main extensibility points in the model: individual interest groups are expected to extend PROV-DM by introducing new sets of attributes, qualifiers, and annotations as needed to address applications-specific provenance modelling requirements.

Example

To illustrate PROV-DM, this section presents an example encoded according to PROV-ASN. For more detailed explanations of how PROV-DM should be used, and for more examples, we refer the reader to the Provenance Primer [[PROV-PRIMER]].
Alter example to cater for multiple ivpOf. This is ISSUE-33.
Some comments on the example. This is ISSUE-63.
Comments on section 3.2. This is ISSUE-71

A File Scenario

This scenario is concerned with the evolution of a crime statistics file (referred to as e0) stored on a shared file system and which journalists Alice, Bob, Charles, David, and Edith can share and edit. We consider various events in the evolution of file e0; events listed below follow each other, unless otherwise specified.

Event evt1: Alice creates (pe0) an empty file in /share/crime.txt. We denote this e1.

Event evt2: Bob appends (pe1) the following line to /share/crime.txt:

There was a lot of crime in London last month.
We denote this e2.

Event evt3: Charles emails (pe2) the contents of /share/crime.txt, as an attachment, which we refer to as e4. (Our description is not precise about the nature of e4, it could be a copy of the file that is local the mail client, that is uploaded on the mail server, or even in transit to a recipient.)

Event evt4: David edits (pe3) file /share/crime.txt as follows.

There was a lot of crime in London and New-York last month.
We denote this e3.

Event evt5: Edith emails (pe4) the contents of /share/crime.txt as an attachment, referred to as e5.

Event evt6: between events evt4 and evt5, someone (unspecified) runs a spell checker (pe5) on the file /share/crime.txt. The file after spell checking is referred to as e6.

Encoding using PROV-ASN

In this section, the example is encoded according to the provenance data model (specified in section concepts) and expressed in PROV-ASN.

Entity Expressions (described in Section Entity). The file in its various forms and its copies are modelled as entities.

entity(e0, [ type= "File", location= "/shared/crime.txt", creator= "Alice" ])
entity(e1, [ type= "File", location= "/shared/crime.txt", creator= "Alice", content= "" ])
entity(e2, [ type= "File", location= "/shared/crime.txt", creator= "Alice", content= "There was a lot of crime in London last month."])
entity(e3, [ type= "File", location= "/shared/crime.txt", creator= "Alice", content= "There was a lot of crime in London and New York last month."])
entity(e4, [ ])
entity(e5, [ ])
entity(e6, [ type= "File", location= "/shared/crime.txt", creator= "Alice", content= "There was a lot of crime in London and New York last month.", spellchecked= "yes"])

The entities are characterized by attributes that have given values during intervals delimited by events; such intervals are referred to as characterization intervals. The following table lists all entities and their corresponding characterization intervals. When the end of the characterization interval is not delimited by an event described in this scenario, it is marked by "...".

EntityCharacterization Interval
e0evt1 - ...
e1evt1 - evt2
e2evt2 - evt4
e3evt4 - ...
e4evt3 - ...
e5evt5 - ...
e6evt6 - ...

Process Execution Expressions (described in Section Process Execution): process execution represents an activity in the scenario.

processExecution(pe0,create-file,t,,[])
processExecution(pe1,add-crime-in-london,t+1,,[])
processExecution(pe2,email,t+2,,[])
processExecution(pe3,edit-London-New-York,t+3,,[])
processExecution(pe4,email,t+4,,[])
processExecution(pe5,spellcheck,,,[])

Generation Expressions (described in Section Generation): generation is the event at which a file is created in a specific form. To describe the modalities according to which the various entities are generated by a given process execution, a qualifier (construct described in Section Qualifier) is introduced. The interpretation of qualifiers is application specific. Illustrations of such qualifiers are: no specific qualifier is provided for e0; e2 was generated by the editor's save function; e4 can be found on the smtp port, in the attachment section of the mail message; e6 was produced on the standard output of pe5.

wasGeneratedBy(e0, pe0, qualifier())
wasGeneratedBy(e1, pe0, qualifier(fct="create"))
wasGeneratedBy(e2, pe1, qualifier(fct="save"))     
wasGeneratedBy(e3, pe3, qualifier(fct="save"))     
wasGeneratedBy(e4, pe2, qualifier(port=smtp, section="attachment"))  
wasGeneratedBy(e5, pe4, qualifier(port=smtp, section="attachment"))    
wasGeneratedBy(e6, pe5, qualifier(file=stdout))

Used Expressions (described in Section Use): use is the event by which a file is read by a process execution. Likewise, to describe the modalities according to which the various entities are used by a given process execution, a qualifier (construct described in Section Qualifier) is introduced. Illustrations of such qualifiers are: e1 is used in the context of pe1's load functionality; e2 is used by pe2 in the context of its attach functionality; e3 is used on the standard input by pe5.

used(pe1,e1,qualifier(fct="load"))
used(pe3,e2,qualifier(fct="load"))
used(pe2,e2,qualifier(fct="attach"))
used(pe4,e3,qualifier(fct="attach"))
used(pe5,e3,qualifier(file=stdin))

Derivation Expressions (described in Section Derivation): derivations express that an entity is derived from another. The first two are expressed in their compact version, whereas the following two are expressed in their full version, including the process execution underpinnng the derivation, and relevant qualifier qualifying the use and generation of entities.

wasDerivedFrom(e2,e1)
wasDerivedFrom(e3,e2)
wasDerivedFrom(e4,e2,pe2,qualifier(port=smtp, section="attachment"),qualifier(fct="attach"))
wasDerivedFrom(e5,e3,pe4,qualifier(port=smtp, section="attachment"),qualifier(fct="attach"))

wasComplementOf: (this relation is described in Section wasComplementOf). The crime statistics file (e0) has various contents over its existence (e1,e2,e3); the entites e1,e2,e3 complement e0 with an attribute content. Likewise, e6 complements e3 with an attributed spellchecked.

wasComplementOf(e1,e0)
wasComplementOf(e2,e0)
wasComplementOf(e3,e0)
wasComplementOf(e6,e3) 

Agent Expressions (described at Section Agent): the various users are represented as agents, themselves being a type of entity.

entity(a1, [ type= "Person", name= "Alice" ])
agent(a1)

entity(a2, [ type= "Person", name= "Bob" ])
agent(a2)

entity(a3, [ type= "Person", name= "Charles" ])
agent(a3)

entity(a4, [ type= "Person", name= "David" ])
agent(a4)

entity(a5, [ type= "Person", name= "Edith" ])
agent(a5)

Control Expressions (described in Section Control): the influence of an agent over a process execution is expressed as control, and the nature of this influence is described by qualifier (construct described in Section Qualifier). Illustrations of such qualifiers include the role of the participating agenr: are creator, author and communicator.

wasControlledBy(pe0,a1, qualifier(role="creator"))
wasControlledBy(pe1,a2, qualifier(role="author"))
wasControlledBy(pe2,a3, qualifier(role="communicator"))
wasControlledBy(pe3,a4, qualifier(role="author"))
wasControlledBy(pe4,a5, qualifier(role="communicator"))

Graphical Illustration

Provenance assertions can be illustrated as a graph. Details about the graphical illustration can be found in appendix.

PROV-DM: The Provenance Data Model

This section contains the normative specification of PROV-DM, the PROV data model.

Expression

PROV-DM consists of a set of constructs to formulate representations of the world and constraints that must be satisfied by them. In PROV-ASN, such representations of the world MUST be conformant with the toplevel production expression of the grammar. These expressions are grouped in three categories: elementExpression (see section Element), relationExpression (see section Relation), and accountExpression (see section Account).

expression := elementExpression | relationExpression | accountExpression

elementExpression := entityExpression | processExecutionExecution | agentExpression | annotationExpression

relationExpression := generationExpression | useExpression | derivationExpression | controlExpression | complementExpression | peOrderingExpression | revisionExpression | participationExpression | annotationAssociationExpression
Furthermore, the PROV data model includes a "house-keeping construct" acting as a wrapper for interchanging PROV-DM expressions, which is compliant with the production provenanceContainer (see section Provenance Container).

Element

This section describes all the PROV-ASN expressions conformant to the elementExpression production of the grammar.

Entity

In PROV-DM, an entity expression is a representation of an identifiable characterized thing.

In PROV-ASN, an entity expression's text matches the entityExpression production of the grammar defined in this specification document.

entityExpression := entity ( identifier , [ attribute-values ] )
attribute-values := attribute-value |attribute-value , attribute-values
attribute-value := attribute = Literal

An instance of an entity expression, noted entity(id, [ attr1= val1, ...]) in PROV-ASN:

  • contains an identifier id identifying a characterized thing;
  • contains a set of attribute-value pairs [ attr1= val1, ...], representing this characterized thing's situation in the world.

The assertion of an instance of an entity expression, entity(id, [ attr1= val1, ...]), states, from a given asserter's viewpoint, the existence of an identifiable characterized thing, whose situation in the world is represented by the attribute-value pairs, which remain unchanged during a characterization interval, i.e. a continuous interval between two events in the world.

The following entity assertion,

entity(e0, [ type = "File", location = "/shared/crime.txt", creator = "Alice" ])
states the existence of a thing of type File and location /shared/crime.txt, and creator alice, denoted by identifier e0, during some characterization interval.

Further considerations:
  • If an asserter wishes to characterize a thing with same attribute-value pairs over several intervals, then they are required to assert multiple entity expressions, each with its own identifier.
  • There is no assumption that the set of attributes is complete and that the attributes are independent/orthogonal of each other.
  • A characterization interval may collapse into a single instant
  • An entity assertion is about a characterized thing, whose situation in the world may be variant. An entity assertion is made at a particular point and is invariant, in the sense that its attributes are assigned a value as part of that assertion.
  • Activities are not represented by entities, but instead by process executions, as explained below.
Characterized entity may be variant. This is ISSUE-32
How is domain specific data combined with the provenance model? This is ISSUE-65.
Comments on bob ISSUE-60.
The name entity is used as a replacement for placeholder BOB. This is ISSUE-30.
Definition of Entity is confusing, maybe over-complex ISSUE-85.

Process Execution

In PROV-DM, a process execution expression is a representation an identifiable activity, which performs a piece of work.

In PROV-ASN, a process execution expression's text matches the processExecutionExpression production of the grammar defined in this specification document.

processExecutionExpression := processExecution ( identifier [ , recipeLink ] , [ time ] , [ time ] , other-attribute-values )
other-attribute-values := attribute-values

The activity that a process execution expression is a representation of has a duration, delimited by its start and its end events; hence, it occurs over an interval delimited by two events. However, a process execution expression need not mention time information, nor duration, because they may not be known. Further characteristics of the activity in the world world can be represented by the attribute-value pairs, which remain unchanged during the activity duration.

An instance of a process execution expression, noted processExecution(id, rl, st, et, [ attr1= val1, ...]) in PROV-ASN:

  • contains an identifier id;
  • MAY contain a recipe link rl;
  • MAY contain a start time st;
  • MAY contain an end time et;
  • contains a set of attribute-value pairs [ attr1= val1, ...], representing other attributes of this activity that hold for its all duration.

The following process execution assertion

processExecution(pe1,add-crime-in-london,t+1,t+1+epsilon,[host="server.example.com"])
identified by identifier id, states the existence of an activity with recipe link add-crime-in-london, start time t+1, and end time t+1+epsilon, running on host server.example.com. The attribute host is application specific, but MUST hold for the duration of activity.

The mere existence of a process execution assertion entails some event ordering in the world, since the start event precedes the end event. This is expressed by constraint start-precedes-end.

A process execution expression is not an entity expression. Indeed, an entity expression represents a thing that exists in full at any point in its characterization interval, persists during this interval, and preserves the characteristics that makes it identifiable. Alternatively, an activity in something that happens, unfolds or develops through time, but is typically not identifiable by the characteristics it exhibits at any point during its duration.

Should process execution be defined as a subclass of BOB. This is ISSUE-66.

Agent

An agent expression is a representation a characterized thing capable of activity.

In PROV-ASN, an agent expression's text matches the agentExpression production of the grammar defined in this specification document.

agentExpression := agent ( identifier )

An agent expression, written agent(e) in PROV-ASN: refers to an entity denoted by identifier e and representing the characterized thing capable of activity.

For a characterized thing, one can assert an agent expresson or alternatively, one can infer an agent expression by involvement in an activity represented by a process execution expression.

With the following assertions,

entity(e1, [employee= "1234", name= "Alice"])  and agent(e1)

entity(e2) and wasControlledBy(pe,e2,qualifier(role="author"))
the entity expression identified by e1 is accompanied of an explicit assertion of an agent expression, and this assertion holds irrespective of process executions it may be involved in. On the contrary, from the entity expression identified by e2, one can inferred an agent expression, as per the following inference.

If the expressions entity(e,av) and wasControlledBy(pe,e) hold for some identifiers pe, e, and attribute-values av, then the expression agent(e) also holds.

Annotation

An annotation expression is a set of name value pairs, whose meaning is application specific. It may or may not be a representation of something in the world.

In PROV-ASN, an annotation expression's text matches the annotationExpression production of the grammar defined in this specification document.

annotationExpression := annotation ( identifier , name-values )
name-values := name-value | name-value , name-values
name-value := name = Literal

A separate PROV-DM expression is used to associate an annotation with an expression (see Section on annotation association). A given annotation may be associated with multiple expressions.

The following annotation expression

annotation(ann1,[color="bue", screenX=20, screenY=30])
consists of a list of application-specific name-value pairs, intended to help the rendering of the expression it is associated with, by specifying its color and its position on the screen. In this example, these name-value pairs do not constitute a representation of something in the world; they are just used to help render provenance.

Name-value pairs occurring in annotations differ from attribute-value pairs (occurring in entity expressions and process execution expressions). Attribute-value pairs MUST be a representation of something in the world, which remain constant for the duration of the characterization interval (for entity expression) or the activity duration (for process execution expressions). It is OPTIONAL for name-value pairs to be representations of something in the world. If they are a representation of something in the world, then it MAY change value for the corresponding duration.

Relation

This section describes all the PROV-ASN expressions conformant to the relationExpression production of the grammar.

Generation

In PROV-DM, a generation expression is a representation of a world event, the creation of a new characterized thing by an activity. This characterized thing did not exist before creation. The representation of this invent encompasses a description of the modalities of generation of this thing by this activity.

In PROV-ASN, a generation expression's text matches the generationExpression production of the grammar defined in this specification document.

generationExpression := wasGeneratedBy ( identifier , identifier , generationQualifier [, time] )

An instance of a generation expression, noted wasGeneratedBy(e,pe,q,t) in PROV-ASN:

  • contains an identifier e identifying an entity expression that represents the characterized thing that is created;
  • contain an identifier pe identifying a process execution expression that represents the activity that creates the characterized thing;
  • contains a generationQualifier q that describes the modalities of generation of this thing by this activity;
  • MAY contain a "generation time" t, the time at which the characterized thing was created.

The following generation assertion

  wasGeneratedBy(e1,pe1,qualifier(port="p1", order=1),t1)
  wasGeneratedBy(e2,pe1,qualifier(port="p1", order=2),t2)
states the existence of two events in the world, at which new characterized things, represented by entity expressions identified by e1 and e2 are created by an activity, itself represented by a process execution expression identified by pe1. The first one is available as the first value on port p1, whereas the other is the second value on port p1. The semantics of port and order in these expressions are application specific.

A given entity can be generated at most by one process execution in the scope of a given account. The rationale for this constraint is as follows. If two process executions sequentially set different values to some attribute by means of two different generate events, then they generate distinct entities. Alternatively, for two process executions to generate an entity simultaneously, they would require some synchronization by which they agree the entity is released for use; the end of this synchronization would constitute the actual generation of the entity, but is performed by a single process execution. This unicity constraint is formalized as follows.

Given an entity expression denoted by e, two process execution expressions denoted by pe1 and pe2, and two qualifiers q1 and q2, if the expressions wasGeneratedBy(e,pe1,q1) and wasGeneratedBy(e,pe2,q2) exist in the scope of a given account, then pe1=pe2 and q1=q2.
A generation event SHOULD have some visibility on the attributes of the generated entity, as expressed by the following constraint. The assertion of a generation event implies ordering of events in the world.
Need to say identifiable activity. This is ISSUE-39. The qualifier 'identifiable' is said to be implicit in section 4.
Comments on generation ISSUE-59.
Added justification for generation by a single process ISSUE-67.

Use

In PROV-DM, a use expression is a representation of a world event: the consumption of a characterized thing by an activity. The representation includes a description of the modalities of use of this thing by this activity.

In PROV-ASN, a use expression's text matches the useExpression production of the grammar defined in this specification document.

useExpression := used ( identifier , identifier , useQualifier [, time] )

An instance of a use expression, noted used(pe,e,q,t) in PROV-ASN:

  • refers to a process execution expression identified by pe, which represents the consuming activity;
  • refers to an entity expression identified by e, which represents the characterized thing that is consumed;
  • contains a useQualifier q, which describes the modalities of use of this thing by this activity;
  • MAY contain a "use time" t, the time at which the characterized thing was used.

The following use assertions

  used(pe1,e1,qualifier(parameter="p1"),t1)
  used(pe1,e2,qualifier(parameter="p2"),t2)
state that the activity, represented by the process execution expression identified by pe1, consumed two characterized things, represented by entity expressions identified by e1 and e2, at times t1 and t2, respectively; the first one being found as the value of parameter p1, whereas the second is being found as value of parameter p2. The semantics of parameter in these expressions is application specific.

A reference to a given entity expression MAY appear in multiple use expressions that refer to a given process execution expression. If one wants to annotate a use edge expression or if one wants express a pe-linked-derivationExpression referring to this entity and process execution expressions, the qualifier occuring in this use assertion MUST be unique among the qualifiers qualifiying use expressions for this process execution.

Should we define a taxonomy of use? This is ISSUE-23.
Various comments raised at ISSUE-64.

Derivation

In PROV-DM, a derivation expression is a representation that some characterized thing is transformed from, created from, or affected by another characterized thing in the world.

PROV-DM offers two different forms of derivation expressions. The first one is tightly connected to the notion of activity (represented by a process execution expression), whereas the second one is not. The first kind of assertion is particularly suitable for asserters who have an intimate knowledge of activities, is more prescriptive, but offers a more precise description of derivation, whereas the second does not put such a requirement on the asserter, and allows a less precise description of derivation to be formulated. Both expressions need to be asserted by asserters, since PROV-DM does not provide the means to infer them; however, from these assertions, further derivations can be inferred by transitive closure.

In PROV-ASN, a derivation expression's text matches the derivationExpression production of the grammar defined in this specification document.

derivationExpression := pe-linked-derivationExpression | pe-independent-derivationExpression | transitiveDerivationExpression
pe-linked-derivationExpression:= wasDerivedFrom ( identifier , identifier [, identifier , generationQualifier , useQualifier] )
pe-independent-derivationExpression:= wasEventuallyDerivedFrom ( identifier , identifier )
transitiveDerivationExpression:= dependedOn ( identifier , identifier )

The three kinds of derivation expressions are successively introduced.

Process Execution Linked Derivation Assertion

A process execution linked derivation expression, which, by definition of a derivation expression, is a representation that some characterized thing is transformed from, created from, or affected by another characterized thing, also entails the existence of a process execution expression that represents an activity that transforms, creates or affects this characterized thing.

In its full form, a process-execution linked derivation expression, noted wasDerivedFrom(e2,e1,pe,q2,q1) in PROV-ASN:

  • refers to an entity expression identified by e2, which is a representation of the generated characterized thing;
  • refers to an entity expression identified by e1, which is a representation of the used characterized thing;
  • refers to a process execution expression identified by pe, which is a representation of the activity using and generating the above characterized things;
  • contains a qualifier q2, which qualifies the generation expression pertaining to e2 and pe;
  • contains a qualifier q1, which qualifies in the use expression pertaining to e1 and pe.

For convenience, PROV-DM allows for a compact, process-execution linked derivation assertion, written wasDerivedFrom(e2,e1) in PROV-ASN, which:

  • refers to an entity expression identified by e2, which is a representation of the generated characterized thing;
  • refers to an entity expression identified by e1, which is a represenation of the used characterized thing.

The following derivation assertions

wasDerivedFrom(e5,e3,pe4,qualifier(channel=out),qualifier(channel=in))
wasDerivedFrom(e3,e2)

state the existence of process-linked derivations; the first expresses that the activity represented by the process execution pe4, by using the thing represented by e3 obtained on the in channel derived the thing represented by entity e5 and generated it on channel out. The second is similar for e3 and e2, but it leaves the process execution expression and associated qualifiers implicit. The meaning of "channel" is application specific.

The following inference rule states that a generation and use event can be inferred from a process execution linked derivation expression.

The compact version has the same meaning as the fully formed process-execution linked derivation expression, except that a process execution expression is known to exist, though it does not need to be asserted. This is formalized by the following inference rule, referred to as process execution introduction:

If a derivation expression holds for e2 and e1, then it means that the thing represented by the entity expression identified by e1 has an influence on the thing represented by the entity expression identified by e2, which is captured by a dependency between their attribute values; it also implies temporal ordering. These are specified as follows:

Should this dependency of attributes be made explicit as argument of the derivation expression? By making it explicit, we would allow someone to verify the validity of the derivation expression.

Note that inferring derivation from use and generation does not hold in general. Indeed, when a generation wasGeneratedBy(e2,pe,r2) precedes used(pe,e1,r1), for some e1, e2, r1, r2, and pe, one cannot infer derivation wasDerivedFrom(e2,e1,pe,r2,r1) or wasDerivedFrom(e2,e1) since the values of attributes of e2 cannot possibly be determined by the values of attributes of e1, given the creation of e2 precedes the use of e1.

A further inference is permitted from the compact version of derivation expression:

This inference is justified by the fact that the characterized thing represented by entity expression identified by e2 is generated by at most one activity in a given account (see generation-unicity). Hence, this process execution expression is also the one referred to in the use expression of e1.

We note that the "symmetric" inference, does not hold. From wasDerivedFrom(e2,e1) and used(pe,e1), one cannot derive wasGeneratedBy(e2,pe,r2) because identifier e1 may occur in use expressions referring to many process execution expressions, but they may not be referred to in generation expressions containing identifier e2.

Process Execution Independent Derivation Expression

A process execution independent derivation expression is a representation of a derivation, which occurred by any means whether direct or not, and regardless of any activity in the world.

A process-execution independent derivation expression, written wasEventuallyDerivedFrom (e2, e1) in PROV-ASN,

  • contains an identifier e2, denoting an entity expression, which represents the generated characterized thing;
  • contains an identifier e1, denoting an entity expression, which represents the used characterized thing.

If a derivation expression (wasEventuallyDerivedFrom) holds for e2 and e1, then this means that the thing represented by entity expression identified by e1 has an influence on the thing represented entity expression identified by e2, which at the minimum implies temporal ordering, specified as follows:

Note that temporal ordering is between generations of e1 and e2, as opposed to process execution linked derivation, which implies temporal ordering between the use of e1 and generation of e2 (see derivation-use-generation-ordering). Indeed, in the case of wasEventuallyDerivedFrom, nothing is known about the use of e1, since there is no associated process execution.

A process execution linked derivation expression is a richer than a process execution independent derivation expression, since it contains or implies the existence of a process execution expression. Hence, from the former, we can infer the latter.

Given two entity expressions denoted by e1 and e2, if the assertion wasEventuallyDerivedFrom(e2,e1) holds, then the the assertion wasEventuallyDerivedFrom(e2,e1) also holds.

Hence, a process-execution independent derivation expression can be directly asserted or can be inferred (by means of derivation-linked-independent).

Should we link wasEventuallyDerivedFrom to attributes as we did for wasDerivedFrom? If so, this type of inference should be presented upfront, for both.

Transitive Derivation Expression

If wasDerivedFrom(e2,e1) holds because attribute a2.1 of e2 is determined by attribute a1.1 of e1, and if wasDerivedFrom(e3,e2) holds because attribute a3.1of e3 is determined by attribute a2.2 of e1, it is not necessarily the case that an attribute of e3 is determined by an attribute of e1; so, an asserter may not be able to assert wasDerivedFrom(e3,e1), since it would fail to satisfy constraint derivation-attributes. Hence, the constraint on attributes as expressed in derivation-attributes invalidates transitivity in the general case.

However, there is sense that e3 still depends on e1, since e3 could not be generated without e1 existing. Hence, we introduce a weaker notion of derivation expression, which is transitive.

An instance of a transitive derivation expression, written dependedOn(e2, e1) in PROV-ASN:
  • contains an identifier e2, denoting an entity expresson, which represents the characterized thing that is the result of the derivation;
  • contains an identifier e1, denoting an entity expresson, which represents the characterized thing that the derivation relies upon.

The expression dependedOn can only be inferred; in other word, it cannot be asserted. It is transitive by definition and relies on the previously defined derivation assertions for its base case.

  • If wasDerivedFrom(e2,e1) or wasDerivedFrom(e2,e1,pe,r2,r1) holds, then dependedOn(e2,e1) holds.
  • If wasEventuallyDerivedFrom(e2,e1) holds, then dependedOn(e2,e1) holds.
  • If dependedOn(e3,e2) and dependedOn(e2,e1) hold, then dependedOn(e3,e1) holds.
Is derivation transitive? If so, it should not be introduced as an assertion. This is ISSUE-45.
Should derivation have a time? Which time? This is ISSUE-43.
Should we specifically mention derivation of agents? This is ISSUE-42.
Transitivity does not seem to follow from above definition. This is ISSUE-56.
What's the difference between one step and multi-step derivation assertion. Justification of why one entity can be generated at most once. Multi-step derivation is also transitive. This is all in ISSUE-67.

Control

A control expression is a representation of the involvement of characterized thing (represented as an agent expresson or an entity expression) in an activity, which is represented by a process execution expressoin; a control qualifier qualifies this involvement.

In PROV-ASN, a control expression's text matches the controlExpression production of the grammar defined in this specification document.

controlExpression := wasControlledBy ( identifier, identifier, controlQualifier )

An instance of a control expression, noted wasControlledBy(pe,ag,q) in PROV-ASN :

  • contains an identifier pe denoting a process execution expression, representing the controlled activity;
  • refers to an agent expression or an entity expression ag, representing the controlling characterized thing;
  • contains a qualifier q, qualifying the involvement of the thing in the activity.

The following control assertion

wasControlledBy(pe3,a4,[role=author])
states that the activity, represented by the process execution expression denoted by pe3 saw the involvement of a characterized thing, represented by entity expression denoted by a4 in the capacity of author.

Complementarity

A complementarity expression is a relationship between two characterized things stated to have compatible characterization over some continuous interval between two events.

The rationale for introducing this relationship is that in general, at any given time, for a thing in the world, there may be multiple ways of characterizing it, and hence multiple representations can be asserted by different asserters. In the example that follows, suppose thing "Royal Society" is represented by two asserters, each using a different set of attributes. If the asserters agree that both representations refer to "The Royal Society", the question of whether any correspondence can be established between the two representations arises naturally. This is particularly relevant when (a) the sets of properties used by the two representations overlap partially, or (b) when one set is subsumed by the other. In both these cases, we have a situation where each of the two asserters has a partial view of "The Royal Society", and establishing a correspondence between them on the shared properties is beneficial, as in case (a) each of the two representation complements the other, and in case (b) one of the two (that with the additional properties) complements the other.

This intuition is made more precise by considering the entities that form the representations of characterised things at a certain point in time. An entity expression represents, by means of attribute-value pairs, a thing and its situation in the world, which remain constant over a characterization interval. As soon as the thing's situation changes, this marks the end of the characterization interval for the entity expression representing it. The thing's novel situation is represented by an attribute with a new value, or an entirely different set of attribute-value pairs, embodied in another entity expression, with a new characterization interval. Thus, if we overlap the timelines (or, more generally, the sequences of value-changing events) for the two characterised things, we can hope two establish correspondences amongst the entity expressions that represent them at various points along that events line. The figure below illustrates this intuition.

Relation complement-of between two entity expressions is intended to capture these correspondences, as follows. Suppose entity expressions A and B share a set P of properties, and each of them has other properties in addition to P. If the values assigned to each property in P are compatible between A and B, then we say that A is-complement-of B, and B is-complement-of A, in a symmetrical fashion. In the particular case where the set P of properties of B is a struct superset of A's properties, then we say that B is-complement-of A, but in this case the opposite does not hold. In this case, the relation is not symmetric. (as a special case, A and B may not share any attributes at all, and yet the asserters may still stipulate that they are representing the same thing "Royal Society". The symmetric relation may hold trivially in this case).

The term compatible used above means that a mapping can be established amongst the values of attributes in P and found in the two entities. This is generalizes to the case where attribute sets P1 and P2 of A, and B, respectively, are not identical but they can be mapped to one another. The simplest case is the identity mapping, in which A and B share attribute set P, and furthermore the values assigned to attributes in P match exactly.

It is important to note that the relation holds only as long as the entities involved are valid. As soon as one attribute changes value in one of them, new correspondences need to be found amongst the new entities. Thus, the relation has a validity span that can be expressed in terms of the event lines of the thing.

In PROV-ASN, a complementarity expression's text matches the complementarityExpressoin production of the grammar defined in this specification document.

complementarityExpression := wasComplementOf ( identifier , identifier )

An instance of a complementarity expression, written wasComplementOf(e2,e1), where e1 and e2 are two identifiers denoting entity expressions.

entity(rs,[created: "1870"])

entity(rs_l1,[location: "loc2"])
entity(rs_l2,[location: "The Mall"])

entity(rs_m1,[membership: "250", year: "1900"])
entity(rs_m2,[membership: "300", year: "1945"])
entity(rs_m3,[membership: "270",  year: "2010"])

wasComplementOf(rs_m3, rs_l2)
wasComplementOf(rs_m2, rs_l1)
wasComplementOf(rs_m2, rs_l2)
wasComplementOf(rs_m1, rs_l1)

wasComplementOf(rs_l1, rs)
wasComplementOf(rs_l2, rs)

Mutual ivpOf each other should be agreed. This is ISSUE-29
Do we need a sameAsEntity relation. This is ISSUE-35
Is ivpOf transitive? This is ISSUE-45
Comments on ivpof in ISSUE-57.

Ordering of Process Executions

PROV-DM allows two forms of temporal relationships between activities to be expressed. An information flow ordering expression is a representation that a characterized thing was generated by an activity, represented by a process execution expresion, before it was used by another activity, also represented by a process execution expression. A control ordering expression is a representation that the end of an activity, represented by a process execution, precedes the start of another activity, represented by process execution.

In PROV-ASN, a process execution ordering expression's text matches the peOrderingExpression production of the grammar defined in this specification document.

peOrderingExpression := informationFlowOrderingExpression | controlOrderingExpression
informationFlowOrderingExpression  := wasInformedBy ( identifier , identifier )
controlOrderingExpression  := wasScheduledAfter ( identifier , identifier )

An instance of an information flow ordering expression, written as wasInformedBy(pe2,pe1) in PROV-ASN:

and states information flow ordering between the activities represented by these expression, specified as follows.
Given two process execution expressions denoted by pe1 and pe2, the expression wasInformedBy(pe2,pe1) holds, if and only if there is an entity expression denoted by e and qualifiers q1 and q2, such that wasGeneratedBy(e,pe1,q1) and used(pe2,e,q2) hold.

An instance of a control ordering expression, written as wasScheduledAfter(pe2,pe1) in PROV-ASN:

and states control ordering between pe2 and pe1, specified as follows.
Given two process execution expressions denoted by pe1 and pe2, the expression wasScheduledAfter(pe2,pe1) holds, if and only if there are two entity expressions denoted by e1 and e2, such that wasControlledBy(pe1,e1,qualifier(role="end")) and wasControlledBy(pe2,e2,qualifier(role="start")) and wasDerivedFrom(e2,e1).
This definition assumes that the activities represented by process execution expressions identified by pe1 and pe2 are controlled by some agents, represented by expressions identified by e1 and e2, where the first agent terminates (control qualifier qualifier(role="end")) the first activity, and the second initiates (control qualifier qualifier(role="start")) the second. The second agent being "derived" from the first enforces temporal ordering.

In the following assertions, we find two process execution expressions, identified by pe1 and pe2, denoting two activities, which took place on two separate hosts.

processExecution(pe1,long-workflow,t1,t2,[host="server1.example.com"])
processExecution(pe2,long-workflow,t3,t4,[host="server2.example.com"])
entity(e1,[type="scheduler",state=1])
entity(e2,[type="scheduler",state=2])
wasControlledBy(pe1,e1,qualifier(role="end"))
wasControlledBy(pe2,e2,qualifier(role="start"))
wasDerivedFrom(e2,e1)
wasScheduledAfter(pe2,pe1)
The one identified by pe2 is said to be scheduled after the one identified by pe1 because the scheduler terminated the activity (represented by process execution identified by pe1) to relocate it to the new host.

Suggested definition for process ordering. This is ISSUE-50.

Revision

A revision expression is a representation of the creation of a characterized thing considered to be a variant of another. Deciding whether something is made available as a revision of something else usually involves an agent who represents someone in the world who takes responsibility for declaring that the former is variant of the latter.

In PROV-ASN, a revision expression's text matches the revisionExpression production of the grammar defined in this specification document.

revisionExpression := wasRevisionOf ( identifier , identifier [, identifier] )

An instance of a revision expression, noted wasRevisionOf(e2,e1,ag) in PROV-ASN:

  • contains an identifier e2 denoting an entity that represents a newer version of a thing;
  • contains an identifier e1 denoting an entity that represents an older version of a thing;
  • MAY refer to a responsible agent denoted by identifier ag.

A revision expression can only be asserted, since it needs to include a reference to an agent who represents someone in the real world who bears responsibility for declaring a variant of a thing. However, it needs to satisfy the following constraint, linking the two entity expressions by a derivation derivation, and seing them as a complement of a third entity expression.

Given two identifers old and new denoting two entities, and an identifier ag denoting an agent, if an expression wasRevisionOf(new,old,ag) is asserted, then there exists an entity e and attribute-values av, such that the following expressions hold:
  • wasEventuallyDerivedFrom(new,old);
  • entity(e,av);
  • wasComplement(new,e);
  • wasComplement(old,e).

wasRevisionOf is a strict sub-relation of wasEventuallyDerivedDerivedFrom since two entities e2 and e1 may satisfy wasEventuallyDerivedDerivedFrom(e2,e1) without being a variant of each other.

The following revision assertion

wasRevisionOf(e3,e2,a4)
states that the document represented by entity expression identified by e3 is declared to be revision of document represented by entity expression identified by e2 by agent representy by entity expression denoted by a4.

Revision should be a class not a property. This is ISSUE-48.
What's the difference with derivation? Is it necessary? This is ISSUE-61.

Participation

A participation expression is a representation of the involvement of a characterized thing in an activity. A participation expression can be asserted or inferred.

In PROV-ASN, a participation expression's text matches the participationExpression production of the grammar defined in this specification document.

participationExpression := hadParticipant ( identifier , identifier )

An instance of a participation expression, noted hadParticipant(pe,e) in PROV-ASN:

  • refers to a process execution expression denoted by identifier pe and representing an activity;
  • contains an identifier e denoting an entity expression, which is a representation of a characterized thing involved in this activity.

A thing's participation in an activity can be by direct use or direct control. But also, if a thing and situation are characterized in two complementary manners (and are represented by two entity expressions related by isComplementOf), if one of them participates in an activity, so does the other. The following captures the definition of participation.

Is there a need for a similar concept that includes generated entities?
Suggested definition for participation. This is ISSUE-49.

Annotation Association

An annotation association expression establishes a link between an identifiable PROV-DM expression and an annotation expression referred to by its identifier. Multiple annotation expressions can be associated with a given PROV-DM expression; symmetrically, multiple PROV-DM expressions can be associated with a given annotation expression. Since annotation expressions have identifiers, they can also be annotated. The annotation mechanism (with annotation expression and the annotation association expression) forms a key aspect of the extensibility mechanism of PROV-DM (see extensibility section).

In PROV-ASN, an annotation expression's text matches the annotationExpression production of the grammar defined in this specification document.

annotationAssociationExpression := hasAnnotation ( identifier , identifier ) | hasAnnotation ( relationIdentification , identifier )
relationIdentification := identifier identifier qualifier

The following expressions

entity(e1,[type="document"])
entity(e2,[type="document"])
annotation(ann1,[icon="doc.png"])
hasAnnotation(e1,ann1)
hasAnnotation(e2,ann1)
assert the existence of two documents in the world (attribute-value pair: type="document") represented by entity expressions identified by e1 and e2, and annotate these expressions with an annotation indicating that the icon (an application specific way of rendering provenance) is doc.png.

Bundle

In this section, two constructs are introduced to group PROV-DM expressions. The first one, accountExpression is itself an expression, whereas the second one provenanceContainer is not.

Account

In PROV-DM, an account expression is a wrapper of expressions with a dual purpose:

  • It is the mechanism by which attribution of provenance can be assserted; it allows asserters to bundle up their assertions, and assert suitable attribution;
  • It provides a scoping mechanism for expression identifiers and for some contraints (such as generation-unicity and derivation-use).

In PROV-ASN, an account expression's text matches the accountExpression production of the grammar defined in this specification document.

accountExpression := account ( identifier , asserter , { expression } )

An instance of an account expression, noted account(id, http://x.com/user1, exprs) in PROV-ASN:

  • contains an identifier id to identify this account;
  • contains an asserter identified by URI http://x.com/user1;
  • contains a set of provenance expressions exprs.

Currently, the non-terminal asserter is defined as URI. We may want the asserter to be an agent instead, and therefore use PROV-DM to express the provenance of PROV-DM. We seek inputs on how to resolve this issue.

The following account expression

account(acc0,
        http://x.com/asserter, 
          entity(e0, [ type= "File", location= "/shared/crime.txt", creator= "Alice" ])
          ...
          wasDerivedFrom(e2,e1)
          ...
          processExecution(pe0,create-file,t)
          ...
          wasGeneratedBy(e0,pe0,outFile)     
          ...
          wasControlledBy(pe4,a5, qualifier(role=communicator))  )
contains the set of provenance expressions of section example-prov-asn-encoding, is asserted by agent http://x.com/asserter, and is identified by identifier acc0.

Account expressions constitue a scope for identifiers. An identifier within the scope of an account is intended to denote a single expression. However, nothing prevents an asserter from asserting an account containing, for example, multiple entity expressions with a same identifier but different attribute-values. In that case, they should be understood as a single entity expression with this identifier, with the union of all attributes values, as formalized in identified-entity-in-account.

Given an identifier e, two sets of attribute-values av1 and av2, two entity expressions entity(e,av1) and entity(e,av2) occurring in an account are equivalent to the entity expression entity(e,av) where av is the set formed by the union of av1 and av2.

Whilst constraint identified-entity-in-account specifies how to understand multiple entity expressions with a same identifier within a given account, it does not guarantee that the entity expression formed with the union of all attributes is consistent. Indeed, a given attribute may be assigned multiple values, resulting in an inconsistent entity expression, as illustrated by the following example.

In the following account expression, we find two entity expressions with a same identifier e.

account(acc1,
        http://x.com/id,
          entity(e,[type="person",age=20])
          entity(e,[type="person",age=30])
          ...)
Application of identified-entity-in-account results in an entity expression containing the attributes age=20 and age=30. This results in an inconsistent characterization of a person. We note that deciding whether a set of attribute-values is consistent or not is application specific.

Account expressions can be nested since an account expression can occur among the expressions being wrapped by another account.

An account is said to be well-formed if it satisfied the constraints generation-unicity and derivation-use.

The union of two accounts is another account, containing the unions of their respective expressions, where expressions with a same identifier should be understood according to constraint identified-entity-in-account. Well-formed accounts are not closed under union because the constraint generation-unicity may no longer be satisfied in the resulting union.

Indeed, let us consider another account expression

account(acc2,
        http://x.com/asserter2, 
          entity(e0, [ type= "File", location= "/shared/crime.txt", creator= "Alice" ])
          ...
          processExecution(pe1,create-file,t1)
          ...
          wasGeneratedBy(e0,pe1,outFile)     
          ... )
with identifier acc2, containing assertions by asserter by http://x.com/asserter2 stating that thing represented by entity expression identified by e0 was generated by an activity represented by process execution expression identified by pe1 instead of pe0 in the previous account acc0. If accounts acc0 and acc2 are merged together, the resulting set of expressions violates generation-unicity.

Account expressions constitue a scope for identifiers. Since accounts can be nested, their scope can also be nested; thus, the scope of identifiers should be understood in the context of such nested scopes. When an identifiable expression occurs directly within an account, then its identifier denotes this identifiable expression in the scope of this account, except in sub-accounts where expressions with the same identifier occur.

The following account expression is inspired from section example-prov-asn-encoding. This account, identified by acc3, declares entity expression identified by e0, which is being referred to in the nested account acc4. The scope of identifier e0 is account acc3, including subaccount acc4.

account(acc3,
        http://x.com/asserter1, 
          entity(e0, [ type= "File", location= "/shared/crime.txt", creator= "Alice" ])
          processExecution(pe0,create-file,t)
          wasGeneratedBy(e0,pe0,outFile)  
          account(acc4,
                 http://x.com/asserter2,
                 entity(e1, [ type= "File", location= "/shared/crime.txt", creator= "Alice", content="" ])
                 processExecution(pe0,copy-file,t)
                 wasGeneratedBy(e1,pe0,outFile)
                 isComplement(e1,e0)))
Alternatively, a process execution expression identified by pe0 occurs in each of the two accounts. Therefore, each process execution expression is asserted in a separate scope, and therefore may represent different activities in the world.

The account expression is the hook by which further meta information can be expressed about provenance, such as asserter, time of creation, signatures. How general meta-information is expressed is beyond the scope of this specification, except for asserters.

Provenance Container

A provenance container is a house-keeping construct of PROV-DM, also capable of bundling PROV-DM expressions. A provenance container is not an expression, but can be exploited to return all the provenance assertions in response to a request for the provenance of something ([[PROV-PAQ]]).

In PROV-ASN, a provenance container's text matches the provenanceContainer production of the grammar defined in this specification document.

provenanceContainer := provenanceContainer ( { namespaceDeclaration } , { identifier } , { expression } )

An instance of a provenance container, noted provenanceContainer(decls, ids, exprs) in PROV-ASN:

  • contains a set of namespace declarations decls, declaring namespaces and associated prefixes, which can be used in attributes (conformant to production attribute) and in names (conformant to production name) in exprs;
  • contains a set of identifiers ids naming all accounts occurring (at any nesting level) in exprs;
  • contains one or more expressions exprs.

All the expressions in exprs are implictly wrapped in a default account, scoping all the identifiers they declare directly, and constituting a toplevel account, in the hierarchy of accounts.

The following container

container([x http://x.com],[acc1,acc2]
          account(acc1,http://x.com/asserter1,...)
          account(acc2,http://x.com/asserter1,...))
illustrates how two accounts with identifiers acc1 and acc2 can be returned in a PROV-ASN serialization of the provenance of something.

Asserter needs to be defined. This is ISSUE-51.
Scope and Identifiers. This is ISSUE-81.

Other Expressions

This section specifies the productions of sub-expressions of PROV-DM expressions.

Qualifier

A qualifier is an ordered list of name-value pairs, used to qualify use expressions, generation expressions and control expressions.

In PROV-ASN, a qualifier's text matches the qualifier production of the grammar defined in this specification document.

useQualifier := qualifier
generationQualifier := qualifier
controlQualifier := qualifier

qualifier := qualifier ( name-values )
name-values := name-value | name-value , name-values
name-value := name = Literal

Use, generation, and control expressions MUST contain a qualifier. A qualifier's sequence of name-value pairs MAY be empty.

The interpretation of a qualifier is specific to the process execution expression it occurs in, which means that a same qualifier may appear in two different process execution expressions with different interpretations. From this specification's viewpoint, a qualifier's interpretation is out of scope.

By definition, a use (resp. generation, control) expression does not contain an identifier. If one wants to annotate a use (resp. generation, control) expression, this expression MUST be identifiable from its constituents, i.e. its source's identifier, its destination's identifier, and its qualifier.

To be able to annotate use (resp. generation, control) expressions that refer to a given process execution identifier, any qualifier occuring in use expressions (resp. generation, control) with this identifier and a given entity expression identifier MUST be unique.

It may seem strange that we do not require use expressions to have an identifier. Mandating the presence of identifiers in use expressions would facilitate their annotation. However, this would make it difficult for use expressions to be encoded as properties in OWL2.

Qualifiers are used in determining the exact source and destination of a pe-linked-derivationExpression. Hence, if one wants to express a pe-linked-derivationExpression referring to an entity expression and a process execution expression,

  • the useQualifier MUST be unique among the qualifiers occuring in use expressions for this process execution expression;
  • the generationQualifier MUST be unique among the qualifiers occuring in generation expressions for this process execution expression.

The PROV data model introduces a specific qualifier role to denote the function of a characterized thing with respect to an activity, in the context of a use/generation/control relation. The value associated with a role attribute MUST be conformant with Literal.

The following control expression qualifies the role of the agent identified by a5 in this control relation.

          wasControlledBy(pe4,a5, qualifier(role="communicator"))

Decide the level of requirements: MUST/SHOULD and justify. This is ISSUE-40 and ISSUE-41. The circumstances in which the requirement is of type must MUST are now made explicit.

Attribute

An attribute is a finite sequence of characters. (Exact production TBD).

attribute := a qualified name

If a namespace prefix occurs in the qualified name, it refers to a namespace declared in the provenance container.

Name

A name is a finite sequence of characters. (Exact production TBD).

name := a qualified name

If a namespace prefix occurs in the qualified name, it refers to a namespace declared in the provenance container.

Proposed to adopt the abbreviatedIRI definition of OWL2 [[!OWL2-SYNTAX]] (see section IRIs).

Identifier

A identifier is a finite sequence of characters.

Do we require identifiers to be URIs? All the examples in this document so far use simple labels as identifiers? Would this be acceptable? Maybe understood as default namespace and local name?
identifier := ?????

Literal

Literals represent data values such as particular string or integers.
It is proposed to use the definition of OWL2 Literal [[!OWL2-SYNTAX]] (see section Literals).

Time

Time instants are defined according to xsd:dateTime [[!XMLSCHEMA-2]].

It is OPTIONAL to assert time in use, generation, and process execution.

Is it appropriate to refer to ISO8601. Point in Time, Interval? This is ISSUE-58.

Temporal Events

Four kinds of discrete events underpin the PROV-DM data model. They are:
  1. Generation of an entity by a process execution: identifies the final instant of an entity's creation timespan, after which the characterized thing represented by an entity becomes available for use.
  2. Use of an entity by a process execution: identifies the first instant of an entity's consumption timespan.
  3. Start of a process execution: identifies the instant an activity represented by a process execution starts
  4. End of a process execution: identifies the instant an activity represented by a process execution ends

Event Ordering

Follows is a partial order between events, indicating that an event occurs after another. For convenience, precedes is defined as the symmetric of follows.

This specification introduces inference rules allowing such event ordering to be inferred from provenance constructs.

Asserter

An asserter is a creator of PROV-DM expressions. An asserter is denoted by an IRI.

asserter := an IRI
Currently, the non-terminal asserter is defined as URI. We may want the asserter to be an agent instead, and therefore use PROV-DM to express the provenance of PROV-DM. We seek inputs on how to resolve this issue.

Namespace Declaration

A namespace declaration declares a namespace and a prefix to denote it.

namespaceDeclaration := ... TBD

Location

Location is an identifiable geographic place (ISO 19112). As such, there are numerous ways in which location can be expressed, such as by a coordinate, address, landmark, row, column, and so forth. This document does not specify how to concretely express locations, but instead provide a mechanism to introduce locations in assertions.

Location is an OPTIONAL attribute of entity expressions and process execution expressions. The value associated with a attribute location MUST be a Literal, expected to denote a location.

Collections

The purpose of this section is to enable modelling of part-of relationships amongst entities. In particular, a form of collection entity type is introduced, with relations for asserting
  • that a new entity has been added to the collection
  • that a new entity has been removed from the collection
  • that an entity is a member of the collection
A collection expression is defined as follows.
collectionExpression := membershipExpression | entityRemovalExpression | entityAdditionExpression
membershipExpression := wasMemberOf ( entityExpression , entityExpression , entityExpression , position )
entityRemovalExpression := wasRemovedFrom ( entityExpression , entityExpression , entityExpression )
entityAdditionExpression := wasAddedTo ( entityExpression , entityExpression , position )
In the expression: wasMemberOf(e1,e2,p):
  • entity e1 is a member of a collection;
  • entity e1 denotes the collection;
  • p denotes the position of e1 in e2. This argument is optional. The nature of the position depends on the specific collection type, and is left unspecified (in the case of a simple list, this is the position in the list)

Similarly, in expression wasRemovedFrom(e1,e2):

  • e1 is a former member of a collection, which is now no longer a member;
  • e1 denotes the collection;
Simmetrically, in expression wasAddedTo(e1,e2,p):
  • e1 is a new member of a collection, which was previously not a member;
  • e1 denotes the collection;
  • p denotes the position of e1 in e2 after insertion.

The following collection expressions:

wasAddedTo(e1,c1,p1)
memberOf(e1,c1,p1)
memberOf(e2,c1,p2)
memberOf(c1,c2,p3)
wasRemovedFrom(e1,c1)
assert that:
  • e1 was added to collection c1 at position p1
  • e1 is a member of collection c1 at position p1
  • e2 is a member of collection c1 at position p2
  • c1 (itself a collection) is a member of collection c1 at position p3. This illustrates that collections can be nested within one another.
  • e1 was removed from collection c1

PROV-DM Extensibility Points

The PROV data model provides several extensibility points that allow designers to specialize it to specific applications or domains. We summarize these extensibility points here:

  • Attributes are constructs of the data model that allow representations of aspects of the world's things and activities to be expressed. Applications are free to introduce application-specific attributes, according to their perspective on the world. Attributes for a given application can be distinguished by qualifying them with a prefix denoting a namespace declared in a namespace declaration.

    The PROV DM namespace (TBD) declares a set of reserved attributes: type, location.

  • Annotation expressions allow arbitrary metadata to be associated with identifiable expressions of PROV-DM. Annotation expressions consist of name-value pairs. Like attributes, names are qualified by a namespace.
  • Use, generation, and control qualifiers offer a mechanism to describe modal aspects of use, generation, and control expressions. They consist of ordered sequence of name-value pairs. Such names are also qualified by a namespace.
  • The PROV DM namespace (TBD) declares a reserved qualifier: role.

  • Namespaces allow attributes and names to be qualified.
  • Domain specific values can be expressed by means of typed literals.

The PROV data model is designed to be application and technology independent, but specializations of PROV-DM are welcome and encouraged. To ensure inter-operability, specializations of the PROV data model that exploit the extensibility points summarized in this section MUST preserve the semantics specified in this document. For instance, a qualified attribute on a domain specific entity expression MUST represent an aspect of a characterized thing and this aspect MUST remain unchanged during the characterization's interval of this entity expression.

Shortcuts and extensions

There are a number of commonly used provenance relations in particular for the web that are not in the model. For practical use and uptake, it would be good to have definitions of these in the provenance model. These concepts should be defined in terms of the already existing "core" concepts. This is ISSUE-44.

Quotation

Quotation represents the repeating or copying of some part of a characterized thing.

An assertion wasQuoteOf, noted wasQuoteOf(e2,e1,ag,ag2):

  • refers to an entity e2, denoting the quote;
  • refers to an entity e1, denoting the entity being quoted;
  • MAY refer to an agent who is doing the quoting ag;
  • MAY refer to the agent that is quoted ag

wasQuoteOf is a sub-relation of wasRevisionOf

Attribution

Attribution represents that a characterized thing is ascribed to an agent.

An assertion wasAttributedTo, noted wasAttributedTo(e1,ag):

  • refers to an entity e2, denoting the entity;
  • refers to an agent who the entity is attributed to ag.

wasQuoteOf is a strict sub-relation of wasEventuallyDerivedFrom

Summary

Represents a characterized thing that is a synopsis or abbreviation of another entity.

An assertion wasSummaryOf, noted wasSummaryOf(e2,e1):

  • refers to an entity e2, denoting the summary;
  • refers to an entity e1, denoting the entity being summarized.

wasSummaryOf is a strict sub-relation of wasEventuallyDerivedFrom

Original Source

Represents a characterized thing in which another characterized thing first appeared.

An assertion hasOriginalSource, noted hasOriginalSource(e2,e1):

  • refers to an entity e2, denoting the entity that first appeared;
  • refers to an entity e1, denoting the entity where that entity first appeared.

hasOriginalSource is a strict sub-relation of wasEventuallyDerivedFrom

Illustration and Notation Conventions

In this section, we summarize the conventions adopted for the graphical illustration and the abstract syntax notation appearing in this specification.

Should we formalize the graphical illustration and abstract syntax notation. Where? Should they become normative?

Illustration Conventions

  • The graphical illustration aims to illustrate the provenance model. It is not intended to represent all the details of the model, and therefore, cannot be seen as a alternate notation for expressing provenance.
  • The graphical illustration is a graph.
  • entities, process executions and agents are represented as nodes, with oval, rectangular, and half-hexagonal shapes, respectively.
  • Use, Generation, Derivation, IVPof are represented as directed edges.
  • entities are layed out according to temporal order (the temporal event at which they are generated). Time SHOULD progress from left to right or from top to bottom. This means that edges for Use, Generation and Derivation typically point from right to left or from bottom to top.

PROV-ASN Conventions

This appendix is now obsolete. It should be deleted, making sure all information has been transcribed where appropriate, and all cross-references delete.
  • Constructs are expressed as name(arg0, arg1, ...), where the name of the construct occurs first, and is followed by its arguments.
  • For use, generation, and derivation event, the first argument is the 'effect' (i.e. most recent item) and the second argument is the 'cause' (i.e. least recent item). This order is compatible with the temporal layout of the graphical notation.

Acknowledgements

WG membership to be listed here.