This document defines a data model for Provenance.
This is a document for internal discussion, which will ultimately evolve in the first Public Working Draft of the Conceptual Model.
To be written
Let us consider a shared file system in which journalists Alice, Bob, Charles, David, and Edith can share and edit a crime statistics file.
Time t: Alice creates an empty file in /share/crime.txt
Time t+1: Bob appends the following line to /share/crime.txt:
There was a lot of crime in London last month.
Time t+2: Charles emails the contents of /share/crime.txt
cat /share/crime.txt | sendmail ...
Time t+3: David edits file /share/crime.txt as follows.
There was a lot of crime in London and New-York last month.
Time t+4: Edith emails the contents of /share/crime.txt
cat /share/crime.txt | sendmail ...
BOBs:
bob(e0, [ type: "File", location: "/shared/crime.txt", creator: "Alice" ]) bob(e1, [ type: "File", location: "/shared/crime.txt", creator: "Alice", content: "" ]) bob(e2, [ type: "File", location: "/shared/crime.txt", creator: "Alice", content: "There was a lot of crime in London last month."]) bob(e3, [ type: "File", location: "/shared/crime.txt", creator: "Alice", content: "There was a lot of crime in London and New York last month."]) bob(e4) bob(e5)
e0: | holds in interval [t,t+4[ |
e1: | holds in interval [t,t+1[ |
e2: | holds in interval [t+1,t+3[ |
e3: | holds in interval [t+3,t+4[ |
e4: | the information piped to sendmail at t+2 (that's a copy of e2's content) |
e5: | the information piped to sendmail at t+4 (that's a copy of e3's content) |
Derivations:
isDerivedFrom(e2,e1) isDerivedFrom(e3,e2) isDerivedFrom(e4,e2) isDerivedFrom(e5,e3)
Generations:
isGeneratedBy(e2,pe1,out) isGeneratedBy(e3,pe3,out) isGeneratedBy(e4,pe2,out) isGeneratedBy(e5,pe4,out)
Use:
use(pe1,e1,in) use(pe3,e2,in) use(pe2,e2,in) use(pe4,e3,in)
Process Executions:
processExecution(pe1,add-crime-in-london,t+1) processExecution(pe2,copy,t+2) processExecution(pe3,edit-London-New-York,t+3) processExecution(pe4,copy,t+4)
IVP of:
IVPof(e1,e0) IVPof(e2,e0) IVPof(e3,e0)
Agents:
bob(ag_al, [ type: "Person", name: "Alice" ]) agent(ag_al) bob(ag_bo, [ type: "Person", name: "Bob" ]) agent(ag_bo) bob(ag_ch, [ type: "Person", name: "Charles" ]) agent(ag_ch) bob(ag_da, [ type: "Person", name: "David" ]) agent(ag_da) bob(ag_ed, [ type: "Person", name: "Edith" ]) agent(ag_ed)
Control:
isControlledBy(pe1,ag_bo,"author") isControlledBy(pe2,ag_ch, "communicator") isControlledBy(pe3,ag_da,"author") isControlledBy(pe4,ag_ed, "communicator")
In the world (whether real or not), there are entities, which can be physical, digital, conceptual, or otherwise, and activities involving entities. Words such entity or activity should be understood with their informal meaning.
Furthermore, this specification is concerned with characterized entities, that is, entities and their situation in the world, as perceived by their asserters.
In the rest of the document, we are concerned with the representation of such entities; their situation in the world will be represented using sets of attributes.
PIL is a language by which representations of the world can be expressed using terms that are drawn from a controlled vocabulary. These representations are relative to an asserter, and in that sense constitute assertions about the world. Different asserters will normally contribute different representations, and no attempt is made to define a notion of consistency of such different sets of assertions. The language provides the means to associate attribution to assertions.
All assertions in PIL SHOULD be interpreted as a record of what has happened, as opposed to what may or will happen.
This specification does not prescribe the means by which assertions are made, for example on the basis of observations, inferences, or any other means.
The language introduces a notion of "provenance container", which provides a default scope for assertions. The model may define additional scoping rules for assertions. Identifiers can safely be used within that scope. Optionally, identifiers can be exported so that they can be used outside their default scope. The language does not prescribe the mechanisms by which identifiers are generated.
In this specification, when an assertion is defined to refer to another assertion about something, it does so by means of that thing's identifier.
Sometimes, inferences about the world can be made from assertions of the provenance data model. When this is the case, this specification defines such inferences.
The language defines the following types of constructs.
A BOB represents an identifiable characterized entity.
A BOB assertion is about a characterized entity, whose situation in the world is variant. A BOB assertion is made at a particular point and is invariant, in the sense that all the attributes are assigned a value as part of that assertion.
A BOB assertion, noted bob(id, [ attr: val, ...]):
bob(e0, [ type: "File", location: "/shared/crime.txt", creator: "Alice" ])
A BOB assertion MUST describe a characterized entity over a continuous time interval in the world (which may collapse into a single instant). Characterizing an entity over multiple time intervals requires multiple BOB assertions, each with its own identifier. Some attributes may retain their values across multiple assertions.
There is no assumption that the set of attributes is complete and that the attributes are independent/orthogonal of each other.
A process execution represents an identifiable activity, which performs a piece of work.
The activity that a process execution represents has a duration, delimited by its start and its end; hence, it occurs over a continuous time interval. However, the process execution repre senting the activity need not mention time information, nor duration, because they may not be known.
A process execution assertion, noted processExecution(id,rl,st,et):
processExecution(pe1,add-crime-in-london,t+1,t+1+epsilon)
From the assertion of a process execution, one can infer that the start precedes the end of the represented activity.
Generation represents the creation of a new characterized entity by an activity. This characterized entity did not exist before creation.
A Generation assertion, noted isGeneratedBy(b,pe,r,t):
isGeneratedBy(e2,pe1,out)
A given BOB can be generated at most by one process execution.
Given an assertion isGeneratedBy(x,pe,r) or isGeneratedBy(x,pe,r,t), the activity denoted by pe and the entities used by pe dermine values of some of x's attributes.
Given an assertion isGeneratedBy(x,pe,r) or isGeneratedBy(x,pe,r,t), one can infer that the generation of the entity denoted by x precedes the end of pe and follows the beginning of pe.
Use represents the consumption of a characterized entity by an activity.
A Use assertion, use(pe,b,r,t):
use(pe1,e1,in,t)
A reference to a given BOB may appear in multiple use assertions that refer to a given process execution, but each of those use assertions MUST have a distinct role.
Given an assertion Use(pe,x,r) or Use(pe,x,r,t), at least one value of x's attributes is a pre-condition for the activity denoted by pe to terminate.
Given an assertion Use(pe,x,r) or Use(pe,x,r,t), one can infer that the use of the entity denoted by x precedes the end of pe and follows the beginning of pe. Furthermore, we can infer that the generation of the entity x always precedes its use.
Derivation expresses that some characterized entity is transformed from, created from, or affected by another characterized entity.
A Derivation assertion, isDerivedFrom(b1,b2):
isDerivedFrom(e5,e3)
From an assertion isDerivedFrom(B,A), the values of some characteristics of B are at least partially determined by the values of some characteristics of A.
Given an assertion isDerivedFrom(B,A), one can infer that the use of characterized entity denoted by A precedes the generation of the characterized entity denoted by B.
An agent represents a characterized entity capable of activity.
An agent assertion, agent(b):
A characterized entity can be asserted to be an agent or can be inferred to be an agent by involvement in a process execution.
bob(alice, [Employee="1234"]) and agent(alice) bob(david) and isControlledBy(pe,david)
Control represents the involvement of an agent or a BOB in a process execution; a role qualifies this involvement.
A Control assertion, noted isControlledBy(pe,ag,r):
isControlledBy(pe3,david,"author")
VPN of is a relationship between two characterized entities asserted to have compatible characterization over some continuous time interval.
The rationale for introducing this relationship is that in general, at any given time there will be multiple representations of a characterized entity, which are reflected in assertions possibly made by different asserters. In the example that follows, suppose entity "Royal Society" is represented by two asserters, each using a different set of attributes. If the asserters agree that both representations refer to "The Royal Society", the question of whether any correspondence can be established between the two representations arises naturally. This is particularly relevant when (a) the sets of properties used by the two representations overlap partially, or (b) when one set is subsumed by the other. In both these cases, we have a situation where each of the two asserters has a partial view of "The Royal Society", and establishing a correspondence between them on the shared properties is beneficial, as in case (a) each of the two representation complements the other, and in case (b) one of the two (that with the additional properties) complements the other.
An IVP assertion is denoted IVPof(B,A), where A and B are two BOBs.
bob(rs,[created: "1870"]) bob(rs_l1,[location: "loc2"]) bob(rs_l2,[location: "The Mall"]) bob(rs_m1,[membership: "250", year: "1900"]) bob(rs_m2,[membership: "300", year: "1945"]) bob(rs_m3,[membership: "270", year: "2010"]) ivpOf(rs_m3, rs_l2) ivpOf(rs_m2, rs_l1) ivpOf(rs_m2, rs_l2) ivpOf(rs_m1, rs_l1) ivpOf(rs_l1, rs) ivpOf(rs_l2, rs)
An assertion "B is an IVP of A" holds over the temporal intersection of A and B, only if:
Time is defined according to ISO 8601.
It is OPTIONAL to assert time in use, generation, and process execution.
A recipe link is an association between a process execution and a process specification that underpins the process execution.
It is OPTIONAL to assert recipe links in process executions.
Process specifications, as referred to by recipe links, are out of scope of this specification.
A role is a label that names the function assumed by a BOB or an agent with respect to a specific process execution.
Use, Generation, and Control assertions MUST contain a role.
The set of all Use (resp. Generation, Control) assertions that refer to a given process execution MUST contain at most one occurrence of a given role.
The interpretation of a role is specific to the process execution it relates to, which means that a same role may appear in relation to two different process executions with different interpretations. From this specification's viewpoint, a role's interpretation is out of scope.
Location is an identifiable geographic place (ISO 19112). As such, there are numerous ways in which location can be specified, such as by a coordinate, address, landmark, row, column, and so forth.
Location is an OPTIONAL characteristics of BOB, process execution, and agent.
Various relationships of temporal nature exist between process executions. Control ordering means that the end of a process execution precedes the start of another process execution, by a same agent. Information flow means that a characterized entity was generated by a process execution before it was used by another process execution.
An assertion isScheduledAfter:
An assertion isInformedBy:
Revision represents the creation of a characterized entity considered to be a variant of another.
An assertion isRevisionOf, noted isRevisionOf(b2,b1,ag):
From an assertion isRevisionOf(new,old,ag), one can infer that:
an account is a set of assertions, forming a perspective on the world.
WG membership to be listed here.