This document defines a data model for Provenance.

This is a document for internal discussion, which will ultimately evolve in the first Public Working Draft of the Conceptual Model.

Introduction

To be written

Motivation and Requirements

Example

A File Scenario

Let us consider a shared file system in which journalists Alice, Bob, Charles, David, and Edith can share and edit a crime statistics file.

Time t: Alice creates an empty file in /share/crime.txt

Time t+1: Bob appends the following line to /share/crime.txt:

There was a lot of crime in London last month.

Time t+2: Charles emails the contents of /share/crime.txt

cat /share/crime.txt | sendmail ...

Time t+3: David edits file /share/crime.txt as follows.

There was a lot of crime in London and New-York last month.

Time t+4: Edith emails the contents of /share/crime.txt

cat /share/crime.txt | sendmail ...

Encoding in PIL

BOBs:

bob(e0, [ type: "File", location: "/shared/crime.txt", creator: "Alice" ])
bob(e1, [ type: "File", location: "/shared/crime.txt", creator: "Alice", content: "" ])
bob(e2, [ type: "File", location: "/shared/crime.txt", creator: "Alice", content: "There was a lot of crime in London last month."])
bob(e3, [ type: "File", location: "/shared/crime.txt", creator: "Alice", content: "There was a lot of crime in London and New York last month."])
bob(e4)
bob(e5)

e0: holds in interval [t,t+4[
e1: holds in interval [t,t+1[
e2: holds in interval [t+1,t+3[
e3: holds in interval [t+3,t+4[
e4: the information piped to sendmail at t+2 (that's a copy of e2's content)
e5: the information piped to sendmail at t+4 (that's a copy of e3's content)

Derivations:

isDerivedFrom(e2,e1)
isDerivedFrom(e3,e2)
isDerivedFrom(e4,e2)
isDerivedFrom(e5,e3)

Generations:

isGeneratedBy(e2,pe1,out)     
isGeneratedBy(e3,pe3,out)     
isGeneratedBy(e4,pe2,out)     
isGeneratedBy(e5,pe4,out)     

Use:

use(pe1,e1,in)
use(pe3,e2,in)
use(pe2,e2,in)
use(pe4,e3,in)

Process Executions:

processExecution(pe1,add-crime-in-london,t+1)
processExecution(pe2,copy,t+2)
processExecution(pe3,edit-London-New-York,t+3)
processExecution(pe4,copy,t+4)

IVP of:

IVPof(e1,e0)
IVPof(e2,e0)
IVPof(e3,e0)

Agents:

bob(ag_al, [ type: "Person", name: "Alice" ])
agent(ag_al)

bob(ag_bo, [ type: "Person", name: "Bob" ])
agent(ag_bo)

bob(ag_ch, [ type: "Person", name: "Charles" ])
agent(ag_ch)

bob(ag_da, [ type: "Person", name: "David" ])
agent(ag_da)

bob(ag_ed, [ type: "Person", name: "Edith" ])
agent(ag_ed)

Control:

isControlledBy(pe1,ag_bo,"author")
isControlledBy(pe2,ag_ch, "communicator")
isControlledBy(pe3,ag_da,"author")
isControlledBy(pe4,ag_ed, "communicator")

Graphical Illustration

Provenance assertions as graph:

About the Provenance Language

In the world (whether real or not), there are entities, which can be physical, digital, conceptual, or otherwise, and activities involving entities. Words such entity or activity should be understood with their informal meaning.

Furthermore, this specification is concerned with characterized entities, that is, entities and their situation in the world, as perceived by their asserters.

In the rest of the document, we are concerned with the representation of such entities; their situation in the world will be represented using sets of attributes.

Example: a file at some point during its lifecycle, which includes multiple edits by multiple people, can be represented by its location in the file system, a creator, and content.

PIL is a language by which representations of the world can be expressed using terms that are drawn from a controlled vocabulary. These representations are relative to an asserter, and in that sense constitute assertions about the world. Different asserters will normally contribute different representations, and no attempt is made to define a notion of consistency of such different sets of assertions. The language provides the means to associate attribution to assertions.

All assertions in PIL SHOULD be interpreted as a record of what has happened, as opposed to what may or will happen.

This specification does not prescribe the means by which assertions are made, for example on the basis of observations, inferences, or any other means.

The language introduces a notion of "provenance container", which provides a default scope for assertions. The model may define additional scoping rules for assertions. Identifiers can safely be used within that scope. Optionally, identifiers can be exported so that they can be used outside their default scope. The language does not prescribe the mechanisms by which identifiers are generated.

In this specification, when an assertion is defined to refer to another assertion about something, it does so by means of that thing's identifier.

Sometimes, inferences about the world can be made from assertions of the provenance data model. When this is the case, this specification defines such inferences.

Provenance Data Model

The language defines the following types of constructs.

BOB

A BOB represents an identifiable characterized entity.

A BOB assertion is about a characterized entity, whose situation in the world is variant. A BOB assertion is made at a particular point and is invariant, in the sense that all the attributes are assigned a value as part of that assertion.

A BOB assertion, noted bob(id, [ attr: val, ...]):

bob(e0, [ type: "File", location: "/shared/crime.txt", creator: "Alice" ])

A BOB assertion MUST describe a characterized entity over a continuous time interval in the world (which may collapse into a single instant). Characterizing an entity over multiple time intervals requires multiple BOB assertions, each with its own identifier. Some attributes may retain their values across multiple assertions.

There is no assumption that the set of attributes is complete and that the attributes are independent/orthogonal of each other.

Process Execution

A process execution represents an identifiable activity, which performs a piece of work.

The activity that a process execution represents has a duration, delimited by its start and its end; hence, it occurs over a continuous time interval. However, the process execution repre senting the activity need not mention time information, nor duration, because they may not be known.

A process execution assertion, noted processExecution(id,rl,st,et):

processExecution(pe1,add-crime-in-london,t+1,t+1+epsilon)

From the assertion of a process execution, one can infer that the start precedes the end of the represented activity.

Generation

Generation represents the creation of a new characterized entity by an activity. This characterized entity did not exist before creation.

A Generation assertion, noted isGeneratedBy(b,pe,r,t):

isGeneratedBy(e2,pe1,out)     

A given BOB can be generated at most by one process execution.

Given an assertion isGeneratedBy(x,pe,r) or isGeneratedBy(x,pe,r,t), the activity denoted by pe and the entities used by pe dermine values of some of x's attributes.

Given an assertion isGeneratedBy(x,pe,r) or isGeneratedBy(x,pe,r,t), one can infer that the generation of the entity denoted by x precedes the end of pe and follows the beginning of pe.

Use

Use represents the consumption of a characterized entity by an activity.

A Use assertion, use(pe,b,r,t):

use(pe1,e1,in,t)

A reference to a given BOB may appear in multiple use assertions that refer to a given process execution, but each of those use assertions MUST have a distinct role.

Given an assertion Use(pe,x,r) or Use(pe,x,r,t), at least one value of x's attributes is a pre-condition for the activity denoted by pe to terminate.

Given an assertion Use(pe,x,r) or Use(pe,x,r,t), one can infer that the use of the entity denoted by x precedes the end of pe and follows the beginning of pe. Furthermore, we can infer that the generation of the entity x always precedes its use.

Derivation

Derivation expresses that some characterized entity is transformed from, created from, or affected by another characterized entity.

A Derivation assertion, isDerivedFrom(b1,b2):

isDerivedFrom(e5,e3)

From an assertion isDerivedFrom(B,A), the values of some characteristics of B are at least partially determined by the values of some characteristics of A.

Given an assertion isDerivedFrom(B,A), one can infer that the use of characterized entity denoted by A precedes the generation of the characterized entity denoted by B.

Agent

An agent represents a characterized entity capable of activity.

An agent assertion, agent(b):

A characterized entity can be asserted to be an agent or can be inferred to be an agent by involvement in a process execution.

bob(alice, [Employee="1234"])  and agent(alice)


bob(david) and isControlledBy(pe,david)

Control

Control represents the involvement of an agent or a BOB in a process execution; a role qualifies this involvement.

A Control assertion, noted isControlledBy(pe,ag,r):

isControlledBy(pe3,david,"author")

IVP of

We propose to replace the relation "IPV of" with "complement of". The new term is used in the text below to "test" and see how it fits...

VPN of is a relationship between two characterized entities asserted to have compatible characterization over some continuous time interval.
The rationale for introducing this relationship is that in general, at any given time there will be multiple representations of a characterized entity, which are reflected in assertions possibly made by different asserters. In the example that follows, suppose entity "Royal Society" is represented by two asserters, each using a different set of attributes. If the asserters agree that both representations refer to "The Royal Society", the question of whether any correspondence can be established between the two representations arises naturally. This is particularly relevant when (a) the sets of properties used by the two representations overlap partially, or (b) when one set is subsumed by the other. In both these cases, we have a situation where each of the two asserters has a partial view of "The Royal Society", and establishing a correspondence between them on the shared properties is beneficial, as in case (a) each of the two representation complements the other, and in case (b) one of the two (that with the additional properties) complements the other.

This intuition is made more precise by considering the BOBs that embody the representation of a characterised entity at a certain point in time. A BOB, as defined above, exists only as long as all of its attributes do not change their value. As soon as one attribute, say X changes value, say from v1 to v2, the BOB no longer exists and is replaced by a new one in which X=v2. Thus, if we overlap the timelines (or, more generally, the sequences of value-changing events) for the two characterised entities, we can hope two establish correspondences amongst the BOBs that represent them at various points along that events line. Fig. TBD-fig3. illustrates this intuition.

Relation complement-of between two BOBs is intended to capture these correspondences, as follows. Suppose BOBs A and B share a set P of properties, and each of them has other properties in addition to P. If the values assigned to each property in P are compatible between A and B, then we say that A is-complement-of B, and B is-complement-of A, in a symmetrical fashion. In the particular case where the set P of properties of B is a struct superset of A's properties, then we say that B is-complement-of A, but in this case the opposite does not hold. In this case, the relation is not symmetric. (as a special case, A and B may not share any attributes at all, and yet the asserters may still stipulate that they are representing the same entity "Royal Society". The symmetric relation may hold trivially in this case).

The term compatible used above means that a mapping can be established amongst the values of attributes in P and found in the two BOBs. This is generalizes to the case where attribute sets P1 and P2 of A, and B, respectively, are not identical but they can be mapped to one another. The simplest case is the identity mapping, in which A and B share attribute set P, and furthermore the values assigned to attributes in P match exactly.
It is important to note that the relation holds only as long as the BOBs involved are valid. As soon as one attribute changes value in one of them, new correspondences need to be found amongst the new BOBs. Thus, the relation has a validity span that can be expressed in terms of the event lines of the entity.

An IVP assertion is denoted IVPof(B,A), where A and B are two BOBs.

bob(rs,[created: "1870"])

bob(rs_l1,[location: "loc2"])
bob(rs_l2,[location: "The Mall"])

bob(rs_m1,[membership: "250", year: "1900"])
bob(rs_m2,[membership: "300", year: "1945"])
bob(rs_m3,[membership: "270",  year: "2010"])

ivpOf(rs_m3, rs_l2)
ivpOf(rs_m2, rs_l1)
ivpOf(rs_m2, rs_l2)
ivpOf(rs_m1, rs_l1)

ivpOf(rs_l1, rs)
ivpOf(rs_l2, rs)

An assertion "B is an IVP of A" holds over the temporal intersection of A and B, only if:

  1. if a mapping can be established from an attribute X of B to an attribute Y of A, then the values of A and B must be consistent with that mapping
  2. B has some attribute that A does not have

Time

Time is defined according to ISO 8601.

It is OPTIONAL to assert time in use, generation, and process execution.

Role

A role is a label that names the function assumed by a BOB or an agent with respect to a specific process execution.

Use, Generation, and Control assertions MUST contain a role.

The set of all Use (resp. Generation, Control) assertions that refer to a given process execution MUST contain at most one occurrence of a given role.

The interpretation of a role is specific to the process execution it relates to, which means that a same role may appear in relation to two different process executions with different interpretations. From this specification's viewpoint, a role's interpretation is out of scope.

Location

Location is an identifiable geographic place (ISO 19112). As such, there are numerous ways in which location can be specified, such as by a coordinate, address, landmark, row, column, and so forth.

Location is an OPTIONAL characteristics of BOB, process execution, and agent.

Ordering of Processes

Various relationships of temporal nature exist between process executions. Control ordering means that the end of a process execution precedes the start of another process execution, by a same agent. Information flow means that a characterized entity was generated by a process execution before it was used by another process execution.

An assertion isScheduledAfter:

An assertion isInformedBy:

Revision

Revision represents the creation of a characterized entity considered to be a variant of another.

An assertion isRevisionOf, noted isRevisionOf(b2,b1,ag):

From an assertion isRevisionOf(new,old,ag), one can infer that:

Participation

Provenance Container

It should be possible for asserters to annotate the container with a description of the justification for the assertions it contains, as well as additional meta-information, such as authorship of the assertions.

View Or Account

an account is a set of assertions, forming a perspective on the world.

Collection

Acknowledgements

WG membership to be listed here.