PROV-DM is a data model for provenance for building representations of the entities, people and activities involved in producing a piece of data or thing in the world. PROV-DM is domain-agnostic, but is equipped with extensibility points allowing further domain-specific and application-specific extensions to be defined. PROV-DM is accompanied by PROV-ASN, a technology-independent abstract syntax notation, which allows serializations of PROV-DM instances to be created for human consumption, which facilitates its mapping to concrete syntax, and which is used as the basis for a formal semantics.

This document is part of the PROV family of specifications, a set of specifications aiming to define the various aspects that are necessary to achieve the vision of inter-operable interchange of provenance information in heterogeneous environments such as the Web. This document defines the PROV-DM data model for provenance, accompanied with a notation to express instances of that data model for human consumption.Four other documents are:

Introduction

For the purpose of this specification, provenance is defined as a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing in the world. In particular, the provenance of information is crucial in deciding whether information is to be trusted, how it should be integrated with other diverse information sources, and how to give credit to its originators when reusing it. In an open and inclusive environment such as the Web, users find information that is often contradictory or questionable: provenance can help those users to make trust judgments.

The idea that a single way of representing and collecting provenance could be adopted internally by all systems does not seem to be realistic today. Instead, a pragmatic approach is to consider a core data model for provenance that allows domain and application specific representations of provenance to be translated into such a data model and exchanged between systems. Heterogeneous systems can then export their provenance into such a core data model, and applications that need to make sense of provenance in heterogeneous systems can then import it, process it, and reason over it.

Thus, the vision is that different provenance-aware systems natively adopt their own model for representing their provenance, but a core provenance data model can be readily adopted as a provenance interchange model across such systems.

A set of specifications, referred to as the PROV family of specifications, define the various aspects that are necessary to achieve this vision in an inter-operable way, the first of which is this document:

The PROV-DM data model for provenance consists of a set of core concepts, and a few common relations, based on these core concepts. PROV-DM is a domain-agnotisc model, but with clear extensibility points allowing further domain-specific and application-specific extensions to be defined.

This specification also introduces PROV-ASN, an abstract syntax that is primarily aimed at human consumption. PROV-ASN allows serializations of PROV-DM instances to be written in a technology independent manner, it facilitates its mapping to concrete syntax, and it is used as the basis for a formal semantics. This specification uses instances of provenance written in PROV-ASN to illustrate the data model.

Structure of this Document

In section 2, a set of preliminaries are introduced, including concepts that underpin PROV-DM and motivations for the PROV-ASN notation.

Section 3 provides an overview of PROV-DM listing its core types and their relations.

In section 4, PROV-DM is applied to a short scenario, encoded in PROV-ASN, and illustrated graphically.

Section 5 provides the normative definition of PROV-DM and the notation PROV-ASN.

Section 6 introduces further relations offered by PROV-DM, including relations for data collections and domain-independent common relations.

Section 7 provides an interpretation of PROV-DM in terms of ordering constraints between events, and also presents a set of structural constraints to be satisfied by PROV-DM.

Section 8 summarizes PROV-DM extensibility points.

Section 9 discusses how PROV-DM can be applied to the notion of resource.

PROV-DM Namespace

The PROV-DM namespace is http://www.w3.org/ns/prov-dm/ (TBC).

All the elements, relations, reserved names and attributes introduced in this specification belong to the PROV-DM namespace.

There is a desire to use a single namespace that all specifications of the PROV family can share to refer to common provenance terms. This is ISSUE-224.

Conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [[!RFC2119]].

Preliminaries

A Conceptualization of the World

Entity, Activity, Agent

This specification is based on a conceptualization of the world that is described in this section. In the world (whether real or not), there are things, which can be physical, digital, conceptual, or otherwise, and activities involving things.

When we talk about things in the world in natural language and even when we assign identifiers, we are often imprecise in ways that make it difficult to clearly and unambiguously report provenance: a resource with a URL may be understood as referring to a report available at that URL, the version of the report available there today, the report independent of where it is hosted over time, etc.

Hence, to accommodate different perspectives on things and their situation in the world as perceived by us, we introduce the idea of a characterized thing, which refers to a thing and its situation in the world, as characterized by someone. We then define an entity as an identifiable characterized thing. An entity fixes some aspects of a thing and its situation in the world, so that it becomes possible to express its provenance, and what causes these specific aspects to be as such. An alternative entity may fix other aspects, and its provenance may be different.

Different users may take different perspectives on a resource with a URL. These perspectives in this conceptualization of the world are referred to as entities. Three such entities may be expressed:
  • a report available at a URL: fixes the nature of the thing, i.e. a document, and its location;
  • the version of the report available there today: fixes its version number, contents, and its date;
  • the report independent of where it is hosted and of its content over time: fixes the nature of the thing as a conceptual artifact.
The provenance of these three entities may differ, and may be along the following lines:
  • the provenance of a report available at a URL may include: the act of publishing it and making it available at a given location, possibly under some license and access control;
  • the provenance of the version of the report available there today may include: the authorship of the specific content, and reference to imported content;
  • the provenance of the report independent of where it is hosted over time may include: the motivation for writing the report, the overall methodology for producing it, and the broad team involved in it.

We do not assume that any characterization is more important than any other, and in fact, it is possible to describe the processing that occurred for the report to be commissioned, for individual versions to be created, for those versions to be published at the given URL, etc., each via a different entity that characterizes the report appropriately.

In the world, activities involve entities in multiple ways: consuming them, processing them, transforming them, modifying them, changing them, relocating them, using them, generating them, being associated with them, etc.

An agent is a type of entity that takes an active role in an activity such that it can be assigned some degree of responsibility for the activity taking place. This definition intentionally stays away from using concepts such as enabling, causing, initiating, affecting, etc, because any entities also enable, cause, initiate, and affect in some way the activities. So the notion of having some degree of responsibility is really what makes an agent.

Even software agents can be assigned some responsibility for the effects they have in the world, so for example if one is using a Text Editor and one's laptop crashes, then one would say that the Text Editor was responsible for crashing the laptop. If one invokes a service to buy a book, that service can be considered responsible for drawing funds from one's bank to make the purchase (the company that runs the service and the web site would also be responsible, but the point here is that we assign some measure of responsibility to software as well). So when someone models software as an agent for an activity in our model, they mean the agent has some responsibility for that activity.

PROV-DM considers agents as a type of entity so that the model can be used to represent the provenance of the agents themselves. For example, a grammarchecker software may be an agent of a document preparation activity, but itself can have a provenance record that states who its vendor is.

In this specification, the qualifier 'identifiable' is implicit whenever a reference is made to an activity, agent, or an entity.

Time and Event

Time is critical in the context of provenance, since it can help corroborate provenance claims. For instance, if an entity is claimed to be obtained by transforming another, then the latter must have existed before the former. If it is not the case, then there is something wrong with such a provenance claim.

Although time is critical, we should also recognize that provenance can be used in many different contexts: in a single system, across the Web, or in spatial data management, to name a few. Hence, it is a design objective of PROV-DM to minimize the assumptions about time, so that PROV-DM can be used in varied contexts.

Furthermore, consider two activities that started at the same time instant. Just by referring to that instant, we cannot distinguish which activity start we refer to. This is particularly relevant if we try to explain that the start of these activities had different reasons. We need to be able to refer to the start of an activity as a first class concept, so that we can talk about it and about its relation with respect to other similar starts.

Hence, in our conceptualization of the world, an instantaneous event, or event for short, happens in the world and marks a change in the world, in its activities and in its entities. The term "event" is commonly used in process algebra with a similar meaning. For instance, in CSP [[CSP]], events represent communications or interactions; they are assumed to be atomic and instantaneous.

Types of Events

Four kinds of instantaneous events underpin the PROV-DM data model. The activity start and activity end events demarcate the beginning and the end of activities, respectively. The entity generation and entity usage events demarcate the characterization interval for entities. More specifically:

An entity generation event is the instantaneous event that marks the final instant of an entity's creation timespan, after which it becomes available for use.

An entity usage event is the instantaneous event that marks the first instant of an entity's consumption timespan by an activity.

An activity start event is the instantaneous event that marks the instant an activity starts.

An activity end event is the instantaneous event that marks the instant an activity ends.

Event Ordering

To allow for minimalistic clock assumptions, like Lamport [[CLOCK]], PROV-DM relies on a notion of relative ordering of instantaneous events, without using physical clocks. This specification assumes that a partial order exists between instantaneous events.

Specifically, follows is a partial order between instantaneous events, indicating that an instantaneous event occurs at the same time as or after another. For symmetry, precedes is defined as the inverse of follows. (Hence, these relations are reflexive and transitive.)

How such partial order is realized in practice is beyond the scope of this specification. This specification only assumes that each instantaneous event can be mapped to an instant in some form of timeline. The actual mapping is not in scope of this specification. Likewise, whether this timeline is formed of a single global timeline or whether it consists of multiple Lamport's style clocks is also beyond this specification. It is anticipated that follows and precedes correspond to some ordering over this timeline.

This specification introduces a set of "temporal interpretation" rules allowing the derivation of instantaneous event ordering constraints from provenance records. According to such temporal interpretation, provenance records MUST satisfy such constraints. We note that the actual verification of such ordering constraints is outside the scope of this specification.

PROV-DM also allows for time observations to be inserted in specific provenance records, for each recognized instantaneous event introduced in this specification. The presence of a time observation for a given instantaneous event fixes the mapping of this instantaneous event to the timeline. It can also help with the verification of associated ordering constraints (though, again, this verification is outside the scope of this specification).

We need to refine the definition of entity and activity, and all the concepts in general. This is ISSUE-223.

PROV-ASN: The Provenance Abstract Syntax Notation

This specification defines PROV-DM, a data model for provenance, consisting of records describing how people, entities, and activities, were involved in producing, influencing, or delivering a piece of data or a thing in the world.

This specification also relies on a language, PROV-ASN, the Provenance Abstract Syntax Notation, to express instances of that data model. For each construct of PROV-DM, a corresponding ASN expression is introduced, by way of a production in the ASN grammar.

PROV-ASN is an abstract syntax, whose goals are:

This specification provides a grammar for PROV-ASN. Each record of the PROV-DM data model is explained in terms of the production of this grammar.

The formal semantics of PROV-DM is defined at [[PROV-SEM]] and its encoding in the OWL2 Web Ontology Language at [[!PROV-O]].

Representation, Record, Assertion, and Inference

PROV-DM is a provenance data model designed to express representations of the world. Such representations are structured according to a set of records.

A file at some point during its lifecycle, which includes multiple edits by multiple people, can be represented by its location in the file system, a creator, and content.

These records are relative to an asserter, and in that sense constitute assertions stating properties of the world, as represented by an asserter. Different asserters will normally contribute different records expressive different representations of the world. This specification does not define a notion of consistency between different sets of records (whether by the same asserter or different asserters). The data model provides the means to associate attribution to assertions.

An alternative representation of the above file is a set of blocks in a hard disk.

The data model is designed to capture activities that happened in the past, as opposed to activities that may or will happen. However, this distinction is not formally enforced. Therefore, all PROV-DM records SHOULD be interpreted as a description of what has happened, as opposed to what may or will happen.

This specification does not prescribe the means by which an asserter arrives at records; for example, records can be composed on the basis of observations, reasoning, or any other means.

Sometimes, inferences about the world can be made from records conformant to the PROV-DM data model. When this is the case, this specification defines such inferences, allowing new provenance records to be inferred from existing ones. Hence, representations of the world can result either from direct assertions of records by asserters or from inference of new records by application of inference rules defined by this specification.

Grammar Notation

This specification includes a grammar for PROV-ASN expressed using the Extended Backus-Naur Form (EBNF) notation.

Each production rule (or production, for short) in the grammar defines one non-terminal symbol, in the form:

E ::= expression

Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters:
  • E: matches term satisfying rule for symbol E.
  • abc: matches the literal string inside the single quotes.
  • expression: matches expression or nothing; optional expression.
  • expression: matches one or more occurrences of expression.
  • expression: matches zero or more occurrences of expression.

PROV-DM: An Overview

The following ER diagram provides a high level overview of the structure of PROV-DM records. Examples of provenance assertions that conform to this schema are provided in the next section.

PROV-DM overview
PROV-DM overview
Overview diagram does not represent the sub-relations -- proposal to use a UML notation instead of ER.

The model includes the following elements:

A set of attribute-value pairs can be associated to elements and relations of the PROV model in order to further characterize their nature. The alternateOf and specializationOf relationships are used to denote that two entities represent two alternative characterizations of the same thing, and that one of the two is a more precise characterization than the other, respectively.
The attributes role and type are pre-defined.

In addition to the kinds of record introduced in the overview figure, PROV-DM also features a notion of Account Record that allows attribution of provenance records to be expressed.

The set of relations presented here forms a core, which is further extended with additional relations, defined in Section Common Relations.

The model includes a further additional element: notes. These are also structured as sets of attribute-value pairs. Notes are used to provide additional, "free-form" information regarding any identifiable construct of the model, with no prescribed meaning. Notes are described in detail here.

Attributes and notes are the main extensibility points in the model: individual interest groups are expected to extend PROV-DM by introducing new attributes and notes as needed to address applications-specific provenance modelling requirements.

Example

There is a suggestion that a better example should be adopted for this document. Possibly, several shorter examples. This is ISSUE-132

To illustrate PROV-DM, this section presents an example encoded according to PROV-ASN. For more detailed explanations of how PROV-DM should be used, and for more examples, we refer the reader to the Provenance Primer [[!PROV-PRIMER]].

A File Scenario

This scenario is concerned with the evolution of a crime statistics file (referred to as e0) stored on a shared file system and which journalists Alice, Bob, Charles, David, and Edith can share and edit. The file e0 evolution can be mapped to an event line, in which we consider various events; events listed below follow each other, unless otherwise specified.

Event evt1: Alice creates (activity: a0) an empty file in /share/crime.txt. We denote this file e1.

Event evt2: Bob appends (activity: a1) the following line to /share/crime.txt:

There was a lot of crime in London last month.

We denote the revised file e2.

Event evt3: Charles emails (a2) the contents of /share/crime.txt, as an attachment, which we refer to as e4. (We specifically refer to a copy of the file that is uploaded on the mail server.)

Event evt4: David edits (activity: a3) file /share/crime.txt as follows.

There was a lot of crime in London and New York last month.

We denote the revised file e3.

Event evt5: Edith emails (activity: a4) the contents of /share/crime.txt as an attachment, referred to as e5.

Event evt6: between events evt4 and evt5, someone (unspecified) runs a grammar checker (activity: a5) on the file /share/crime.txt, using a set of grammatical rules (referred to as gr1). The file after grammatical checking is referred to as e6.

Encoding using PROV-ASN

In this section, the example is encoded according to the provenance data model (specified in section PROV-DM: The Provenance Data Model) and expressed in PROV-ASN.

Entity Records (described in Section Entity). The file in its various forms and its copies are modelled as entity records, corresponding to multiple characterizations, as per scenario. The entity records are identified by e0, ..., e6.

entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ])
entity(e1, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", 
             ex:content="" ])
entity(e2, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", 
             ex:content="There was a lot of crime in London last month."])
entity(e3, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", 
             ex:content="There was a lot of crime in London and New York last month."])
entity(e4)
entity(e5)
entity(e6, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", 
             ex:content="There was a lot of crime in London and New York last month.", 
             ex:grammarchecked="yes"])

These entity records list attributes that have been given values during intervals delimited by events; such intervals are referred to as characterization intervals. The following table lists all entity identifiers and their corresponding characterization intervals. When the end of the characterization interval is not delimited by an event described in this scenario, it is marked by "...".

EntityCharacterization Interval
e0evt1 - ...
e1evt1 - evt2
e2evt2 - evt4
e3evt4 - ...
e4evt3 - ...
e5evt5 - ...
e6evt6 - ...

Activity Records (described in Section Activity) represent activities in the scenario. Each activity record contains the activity identifier, a start time and a type attribute characterizing the nature of the activity.

activity(a0, 2011-11-16T16:00:00,,[prov:type="createFile"])
activity(a1, 2011-11-16T16:05:00,,[prov:type="edit"])
activity(a2, 2011-11-16T17:00:00,,[prov:type="email"])
activity(a3, 2011-11-17T09:00:00,,[prov:type="edit"])
activity(a4, 2011-11-17T09:50:00,,[prov:type="email"])
activity(a5, 2011-11-17T09:30:00, ,[prov:type="grammarcheck"])

Generation Records (described in Section Generation) represent the event at which a file is created in a specific form. Attributes are used to describe the modalities according to which a given entity is generated by a given activity. The interpretation of attributes is application specific. Illustrations of such attributes for the scenario are: no attribute is provided for e0; e2 was generated by the editor's save function; e4 can be found on the smtp port, in the attachment section of the mail message; e6 was produced on the standard output of a5. Sometimes, it is necessary to refer to generation records in other records. For those cases, we introduce identifiers such as g1 and g2 to identify the generation records; these identifiers are used in derivations introduced below to reference those specific records.

wasGeneratedBy(e0, a0)
wasGeneratedBy(e1, a0, [ex:fct="create"])
wasGeneratedBy(e2, a1, [ex:fct="save"])     
wasGeneratedBy(e3, a3, [ex:fct="save"])     
wasGeneratedBy(g1, e4, a2, [ex:port="smtp", ex:section="attachment"])  
wasGeneratedBy(g2, e5, a4, [ex:port="smtp", ex:section="attachment"])    
wasGeneratedBy(e6, a5, [ex:file="stdout"])

Usage Records (described in Section Usage) represent the event by which a file is read by an activity. Likewise, attributes describe the modalities according to which the various entities are used by activities. Illustrations of such attributes are: e1 is used in the context of a1's load functionality; e2 is used by a2 in the context of its attach functionality; e3 is used on the standard input by a5. Sometimes, it is also necessary to refer to usage records in other records. To this end, for these usage records, identifiers such as u1 and u2 are introduced to identify them; these identifiers are used later in derivations introduced below to refer to these specific Usage records.

used(a1,e1,[ex:fct="load"])
used(a3,e2,[ex:fct="load"])
used(u1,a2,e2,[ex:fct="attach"])
used(u2,a4,e3,[ex:fct="attach"])
used(a5,e3,[ex:file="stdin"])

Derivation Records (described in Section Derivation Relation) express that an entity is derived from another. The first two are expressed in their compact version, whereas the following two are expressed in their full version, including the activity underpinning the derivation, and associated usage (u1, u2) and generation (g1, g2) records.

wasDerivedFrom(e2,e1)
wasDerivedFrom(e3,e2)
wasDerivedFrom(e4,e2,a2,g1,u1)
wasDerivedFrom(e5,e3,a4,g2,u2)

specializationOf: (this relation is described in Section alternate and specialization records). The crime statistics file (e0) has various contents over its existence (e1, e2, e3); the entity records identified by e1, e2, e3 specialize e0 with an attribute content. Likewise, the one denoted by e6 specializes the record denoted by e3 with an attribute grammarchecked.

specializationOf(e1,e0)
specializationOf(e2,e0)
specializationOf(e3,e0)
specializationOf(e6,e3) 

Agent Records (described at Section Agent): the various users are represented as agents, themselves being a type of entity. Furthermore, a sixth agent is defined to be a software agent (the grammar checker).

agent(ag1, [ prov:type="prov:Person" %% xsd:QName, ex:name="Alice" ])

agent(ag2, [ prov:type="prov:Person" %% xsd:QName, ex:name="Bob" ])

agent(ag3, [ prov:type="prov:Person" %% xsd:QName, ex:name="Charles" ])

agent(ag4, [ prov:type="prov:Person" %% xsd:QName, ex:name="David" ])

agent(ag5, [ prov:type="prov:Person" %% xsd:QName, ex:name="Edith" ])

agent(ag6, [ prov:type="prov:SoftwareAgent" %% xsd:QName, ex:name="GoodEnglish" ])

Activity Assocation Records (described in Section Activity Association): the association of an agent with an activity is expressed with , and the nature of this association is described by attributes. Illustrations of such attributes include the role of the participating agent, as creator, author, communicator, and checker (role is a reserved attribute in PROV-DM).

wasAssociatedWith(a0, ag1, [prov:role="creator"])
wasAssociatedWith(a1, ag2, [prov:role="author"])
wasAssociatedWith(a2, ag3, [prov:role="communicator"])
wasAssociatedWith(a3, ag4, [prov:role="author"])
wasAssociatedWith(a4, ag5, [prov:role="communicator"])

In addition, activity a5 is associated with the grammar checker, which relied on a set of grammatical rules to perform the grammar checking. Generally, rules like these are referred to as a plan, a specific type of entity.

entity(gr1,[prov:type="prov:Plan"%% xsd:QName, 
            ex:url="http://example.org/grammarRules.html" %% xsd:anyURI])

wasAssociatedWith(a5, ag6, gr1, [prov:role="checker"])

Finally, the software agent ag6 did not act autonomously, but was operating on behalf of a user. This chain of responsibility is captured with the Responsibility Record (described in Section Responsibility Record).

actedOnBehalfOf(ag6, ag4, a5, [prov:type="delegation"])

Graphical Illustration

Provenance records can be illustrated graphically. The illustration is not intended to represent all the details of the model, but it is intended to show the essence of a set of provenance records. Therefore, it should not be seen as an alternate notation for expressing provenance.

The graphical illustration takes the form of a graph. Entities, activities and agents are represented as nodes, with oval, rectangular, and pentagonal shapes, respectively. Usage, Generation, Derivation, Activity Association, and Specialization are represented as directed edges.

Entities are layed out according to the ordering of their generation event. We endeavor to show time progressing from left to right. This means that edges for Usage, Generation and Derivation typically point from right to left.

example
fig. 1. Graphical illustration of the example scenario

example
Fig. 2. Ttimeline for Entities and specializationOf relations

PROV-DM Core

This section contains the normative specification of PROV-DM core, the core of the PROV data model.

Record

This specification introduced provenance as a set of records describing the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing in the world. PROV-DM is a data model defining the structure and meaning of such records.

Concretely, PROV-DM consists of a set of constructs to formulate representations of the world and constraints that must be satisfied by them. A PROV-DN record is a body of information about something which is of interest from a provenance viewpoint. PROV-DM records may be asserted directly or may be inferred from others.

PROV-DM records are typed and can be among the following types, introduced one by one in this section: entity record, activity record, agent record, note record, generation record, usage record, derivation record, activity association record, responsibility record, start record, end record, alternate record, specialization record, annotation record, and account record.

Furthermore, PROV-DM includes a "house-keeping construct", a record container, used to wrap PROV-DM records and facilitate their interchange. Hence, by creating a set of PROV-DM records and packaging them into a record container, one forms a provenance record.

In PROV-ASN, such representations of the world MUST be conformant with the toplevel production record of the grammar. These records are grouped in three categories: elementRecord (see section Element), relationRecord (see section Relation), and accountRecord (see section Account).

record ::= elementRecord | relationRecord | accountRecord

elementRecord ::= entityRecord | activityRecord | agentRecord | noteRecord

relationRecord ::= generationRecord | usageRecord | derivationRecord | activityAssociationRecord | responsibilityRecord | startRecord | endRecord | alternateRecord | specializationRecord | annotationRecord

In PROV-ASN, a record container is compliant with the production recordContainer (see section Record Container).

Element

This section describes all the PROV-DM records referred to as element records. (In PROV-ASN, such records are conformant to the elementRecord production of the grammar.)

There is still some confusion about what the identifiers really denote. For instance, are they entity identifiers or entity record identifiers. This is ISSUE-183. An example and questions appear in ISSUE-215. A related issued is also raised in ISSUE-145.

Entity Record

In PROV-DM, an entity record is a representation of an entity.

Examples of entities include a car on a road, a linked data set, a sparse-matrix matrix of floating-point numbers, a document in a directory, the same document published on the Web, and meta-data embedded in a document.

An entity record, noted entity(id, [ attr1=val1, ...]) in PROV-ASN, contains:

  • id: an identifier id identifying an entity;
  • attributes: an OPTIONAL set of attribute-value pairs [ attr1=val1, ...], representing this entity's situation in the world.

The assertion of an entity record states, from a given asserter's viewpoint, the existence of an entity, whose situation in the world is represented by the attribute-value pairs, which remain unchanged during a characterization interval, i.e. a continuous interval between two instantaneous events in the world.

In PROV-ASN, an entity record's text matches the entityRecord production of the grammar defined in this specification document.

entityRecord ::= entity ( identifier optional-attribute-values )

optional-attribute-values ::= , [ attribute-values ]
attribute-values ::= attribute-value | attribute-value , attribute-values
attribute-value ::= attribute = Literal

The following entity record,

entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ])
states the existence of an entity, denoted by identifier e0, with type File and path /shared/crime.txt in the file system, and creator alice The attributes path and creator are application specific, whereas the attribute type is reserved in the PROV-DM namespace.
Further considerations:
  • The entity identifier id contained in an entity record is expected to be unique among all the identifiers contained in the current account's records. This constraint is elaborated upon in identifiable-record-in-account. It means that the current account does not contain any other record for this identifier. Effectively, id acts as a local identifier for this record. In this specification, whenever we write "an entity record identified by ... ", this identification is to be understood in the context of the account that defines it.
  • If an asserter wishes to characterize an entity with the same attribute-value pairs over several intervals, then they are required to create multiple entity records (either by direct assertion or by inference), each with its own identifier (so as to allow potential dependencies between the various entity records to be expressed).
  • There is no assumption that the set of attributes is complete and that the attributes are independent or orthogonal of each other.
  • A characterization interval may collapse into a single instant.
  • An entity assertion is about a thing, whose situation in the world may be variant. An entity record is asserted at a particular point and is invariant, in the sense that its attributes are given a value as part of that assertion.
  • The attributes occurring in an entity record MUST be declared in the namespace referred to by their prefix according to Section record-attribute. Furthermore, for each attribute, a namespace also declares the number of occurrences it may have in a list of attributes. An entity record is valid if the number of occurrences of any of its attributes is compatible with this attribute's declaration it its namespace. This property applies to all types of records, and is referred to as attribute occurrence validity.
  • Activities are not represented by entity records, but instead by activity records, as explained below.

Furthermore, section Activity Association Record, introduces the idea of plans being associated with activities:

  • Plan: entities of type plan that represent a set of actions or steps intended by one or more agents to achieve some goals.
The characterization interval of an entity record is currently implicit. Making it explicit would allow us to define alternateOf and specializationOf more precisely. Beginning and end of characterization interval could be expressed by attributes (similarly to activities). How do we define the end of an entity? This is ISSUE-204.

Activity Record

In PROV-DM, an activity record is a representation of an identifiable activity, which performs a piece of work.

An activity, represented by an activity record, is delimited by its start and its end events; hence, it occurs over an interval delimited by two instantaneous events. However, an activity record need not mention time information, nor duration, because they may not be known.

If start and end times are known, they are expressed as attributes of an activity, where the interpretation of attribute in the context of an activity record is the same as the interpretation of attribute for entity record: an activity record's attribute remains constant for the duration of the activity it represents. Further characteristics of the activity in the world can be represented by other attribute-value pairs, which MUST also remain unchanged during the activity duration.

Examples of activities include driving a car from Boston to Cambridge, assembling a data set based on a set of measurements, performing a statistical analysis over a data set, sorting news items according to some criteria, running a sparql query over a triple store, editing a file, and publishing a web page.

An activity record, written activity(id, st, et, [ attr1=val1, ...]) in PROV-ASN, contains:

  • id: an identifier id identifying an activity;
  • startTime: an OPTIONAL time st indicating the start of the activity;
  • endTime: an OPTIONAL time et indicating the end of the activity;
  • attributes: an OPTIONAL set of attribute-value pairs [ attr1=val1, ...], representing other attributes of this activity that hold for its whole duration.

In PROV-ASN, an activity record's text matches the activityRecord production of the grammar defined in this specification document.

activityRecord ::= activity ( identifier , time , time optional-attribute-values )

The following activity record

activity(a1,2011-11-16T16:05:00,2011-11-16T16:06:00,
        [ex:host="server.example.org",prov:type="ex:edit" %% xsd:QName])

states the existence of an activity identified by a1, start time 2011-11-16T16:05:00, and end time 2011-11-16T16:06:00, running on host server.example.org, and of type edit (declared in some namespace with prefix ex). The attribute host is application specific, but MUST hold for the duration of activity. The attribute type is a reserved attribute of PROV-DM, allowing for subtyping to be expressed.

For the interpretation of an activity record, see start-precedes-end.

Further considerations:

  • The activity identifier id contained in an activity record is also expected to be unique among all the identifiers contained in the current account's records. This constraint is elaborated upon in identifiable-record-in-account. It means that the current account does not contain any other record for this identifier, and that effectively id acts as a local identifier for this record in the current account.
  • An activity record is not an entity record. Indeed, an entity record represents an entity that exists in full at any point in its characterization interval, persists during this interval, and preserves the characteristics that makes it identifiable. Alternatively, an activity in something that happens, unfolds or develops through time, but is typically not identifiable by the characteristics it exhibits at any point during its duration. This distinction is similar to the distinction between 'continuant' and 'occurrent' in logic [[Logic]].

Agent Record

An agent record is a representation of an agent, which is an entity that can be assigned some degree of responsibility for an activity taking place.

Many agents can have an association with a given activity. An agent may do the ordering of the activity, another agent may do its design, another agent may push the button to start it, another agent may run it, etc. As many agents as one wishes to mention can occur in the provenance record, if it is important to indicate that they were associated with the activity.

From an inter-operability perspective, it is useful to define some basic categories of agents since it will improve the use of provenance records by applications. There should be very few of these basic categories to keep the model simple and accessible. There are three types of agents in the model since they are common across most anticipated domain of use:

  • Person: agents of type Person are people. (This type is equivalent to a "foaf:person" [[FOAF]])
  • Organization: agents of type Organization are social institutions such as companies, societies etc. (This type is equivalent to a "foaf:organization" [[FOAF]])
  • SoftwareAgent: a software agent is a piece of software.

These types are mutually exclusive, though they do not cover all kinds of agent.

An agent record, noted agent(id, [ attr1=val1, ...]) in PROV-ASN, contains:

  • id: an identifier id identifying an agent;
  • attributes: contains a set of attribute-value pairs [ attr1=val1, ...], representing this agent's situation in the world.

In PROV-ASN, an agent record's text matches the agentRecord production of the grammar defined in this specification document.

agentRecord ::= agent ( identifier optional-attribute-values )

With the following assertions,

agent(e1, [ex:employee="1234", ex:name="Alice", prov:type="prov:Person" %% xsd:QName])

entity(e2) and wasStartedBy(a1,e2,[prov:role="author"])

entity(e3) and wasAssociatedWith(a1,e3,[prov:role="sponsor"])

the agent record represents an explicit agent identified by e1 that holds irrespective of activities it may be associated with. On the other hand, from the entity records identified by e2 and e3, one can infer agent records, as per the following inference.

One can assert an agent record or alternatively, one can infer an agent record by its association with an activity.

If the records entity(e,attrs) and wasAssociatedWith(a,e) hold for some identifiers a, e, and attribute-values attrs, then the record agent(e,attrs) also holds.
Shouldn't we allow for entities (not agent) to be associated with an activity? Should we drop the inference association-agent? ISSUE-203.

Note Record

As provenance records are exchanged between systems, it may be useful to add extra-information about such records. For instance, a "trust service" may add value-judgements about the trustworthiness of some of the assertions made. Likewise, an interactive visualization component may want to enrich a set of provenance records with information helping reproduce their visual representation. To help with inter-operability, PROV-DM introduces a simple annotation mechanism allowing any identifiable record to be associated with notes.

A note record is a set of attribute-value pairs, whose meaning is application specific. It may or may not be a representation of something in the world.

In PROV-ASN, a note record's text matches the noteRecord production of the grammar defined in this specification document.

noteRecord ::= note ( identifier , attribute-values )

A separate PROV-DM record is used to associate a note with an identifiable record (see Section on annotation). A given note may be associated with multiple records.

The following note record consists of a set of application-specific attribute-value pairs, intended to help the rendering of the record it is associated with, by specifying its color and its position on the screen.

note(ann1,[ex:color="blue", ex:screenX=20, ex:screenY=30])
hasAnnotation(g1,n1)

The note record is associated with a record g1 previously introduced (hasAnnotation is discussed in Section Annotation Record). In this example, the attribute-value pairs do not constitute a representation of something in the world; they are just used to help render provenance.

Attribute-value pairs occurring in notes differ from attribute-value pairs occurring in entity records and activity records. In entity and activity records, attribute-value pairs MUST be a representation of something in the world, which remain constant for the duration of the characterization interval (for entity record) or the activity duration (for activity records). A note record linked with an entity record consists of attribute-value pairs which may or may not represent the entity's situation in the world. If a note record's attribute-value pair represents an entity's situation in the world, no requirement is made on this situation to be unchanged for the entitys' characterization interval.

Comments on notes and their attributes appear in ISSUE-189.

Relation

This section describes all the PROV-DM records representing relations between the elements introduced in Section Element. While these relations are not binary, they all involve two primary elements. They can be summarized as follows.

PROV-DM Core Relation Summary
EntityActivityAgentNote
EntitywasDerivedFrom
alternateOf
specializationOf
wasGeneratedByhasAnnotation
ActivityusedwasStartedBy
wasEndedBy
wasAssociatedWith
hasAnnotation
AgentactedOnBehalfOfhasAnnotation
NotehasAnnotation

In PROV-ASN, all these relation records are conformant to the relationRecord production of the grammar.

Activity-Entity Relation

Generation Record

In PROV-DM, a generation record is a representation of an instantaneous world event, the completed creation of a new entity by an activity. This entity become available for usage after this instantaneous event. This entity did not exist before creation (though another with a different characterization may have existed before). The representation of this instantaneous event encompasses a description of the modalities of generation of this entity by this activity.

A generation event may be, for example, the completed creation of a file by a program, the completed creation of a linked data set, the completed publication of a new version of a document, and the complete sending of a value on a communication channel. The point at which creation is actually complete is application specific: generation of a file may complete when a lock is released by its creator, whereas actual publication of a document may be after the embargo period that was defined for it.

A generation record, written wasGeneratedBy(id,e,a,t,attrs) in PROV-ASN, has the following components:

  • id: an OPTIONAL identifier id identifying the generation record;
  • entity: an identifier e identifying an entity record that represents the entity that is created;
  • activity: an OPTIONAL identifier a identifying an activity record that represents the activity that creates the entity;
  • time: an OPTIONAL "generation time" t, the time at which the entity was created;
  • attributes: an OPTIONAL set of attribute-value pairs attrs that describes the modalities of generation of this entity by this activity.

In PROV-ASN, a generation record's text matches the generationRecord production of the grammar defined in this specification document.

generationRecord ::= wasGeneratedBy ( identifier , eIdentifier , aIdentifier , time optional-attribute-values )

A generation record's id is OPTIONAL. It MUST be used when annotating generation records (see Section Annotation Record) or when defining precise-1 derivations (see Derivation Record).

The following generation assertions

  wasGeneratedBy(e1,a1, 2001-10-26T21:32:52, [ex:port="p1", ex:order=1])
  wasGeneratedBy(e2,a1, 2001-10-26T10:00:00, [ex:port="p1", ex:order=2])

state the existence of two events in the world (with respective times 2001-10-26T21:32:52 and 2001-10-26T10:00:00), at which new entities, represented by entity records identified by e1 and e2, are created by an activity, itself represented by an activity record identified by a1. The first one is available as the first value on port p1, whereas the other is the second value on port p1. The semantics of port and order in these records are application specific.

In some cases, we may want to record the time at which an entity was generated without having to specify the activity that generated it. To support this requirement, the activity component in generation records is optional. Hence, the following record indicates the time at which an entity is generated, without giving the activity that did it.

  wasGeneratedBy(e,,2001-10-26T21:32:52)

For the interpretation of a generation record, see generation-within-activity.

See generation-uniqueness for a structural constraint on generation records.
We may want to assert the time at which an entity is created. The placeholder for such time information is a generation record. But a generation mandates the presence of an activity identifier. But it may not be known. It would be nice if the activity identifier was made optional in the generation record. This is ISSUE-205. This is now implemented. See also ISSUE-43.

Usage Record

In PROV-DM, a usage record is a representation of an instantaneous world event: an activity beginning to consume an entity. Before this event, the activity had not begun to consume or use to this entity. The representation includes a description of the modalities of usage of this entity by this activity.

A usage event may be a procedure beginning to consume a parameter, a service starting to read a value on a port, a program beginning to read a configuration file, or the point at which an ingredient, such as eggs, is being added in a baking activity. Usage may entirely consume an entity (e.g. eggs are not longer available after being added to the mix), or leave it as such, ready for further uses (e.g. a file on a file system can be read indefinitely).

A usage record, written used(id,a,e,t,attrs) in PROV-ASN, has the following constituent:

  • id: an OPTIONAL identifier id identifying the usage record;
  • activity: an identifier a for an activity record, which represents the consuming activity;
  • entity: an identifier e for an entity record, which represents the entity that is consumed;
  • time: an OPTIONAL "usage time" t, the time at which the entity was used;
  • attributes: an OPTIONAL set of attribute-value pairs attrs that describe the modalities of usage of this entity by this activity;

In PROV-ASN, a usage record's text matches the usageRecord production of the grammar defined in this specification document.

usageRecord ::= used ( identifier , aIdentifier , eIdentifier , time optional-attribute-values )

A usage record's id is OPTIONAL, but comes handy when annotating usage records (see Section Annotation Record) or when defining derivations.

The following usage records

  used(a1,e1,2011-11-16T16:00:00,[ex:parameter="p1"])
  used(a1,e2,2011-11-16T16:00:01,[ex:parameter="p2"])

state that the activity, represented by the activity record identified by a1, consumed two entities, represented by entity records identified by e1 and e2, at times 2011-11-16T16:00:00 and 2011-11-16T16:00:01, respectively; the first one was found as the value of parameter p1, whereas the second was found as value of parameter p2. The semantics of parameter in these records is application specific.

A usage record's id is OPTIONAL. It MUST be present when annotating usage records (see Section Annotation Record) or when defining precise-1 derivations (see Derivation Record).

A reference to a given entity record MAY appear in multiple usage records that share a given activity record identifier.

For the interpretation of a usage record, see generation-precedes-usage and usage-within-activity.

Activity-Agent Relation

Activity Association Record

The key purpose of agents in PROV-DM is to assign responsibility for activities. It is important to reflect that there is a degree in the responsibility of agents, and that is a major reason for distinguishing among all the agents that have some association with an activity and determine which ones are really the originators of the entity. For example, a programmer and a researcher could both be associated with running a workflow, but it may not matter what programmer clicked the button to start the workflow while it would matter a lot what researcher told the programmer to do so. Another example: a student publishing a web page describing an academic department could result in both the student and the department being agents associated with the activity, and it may not matter what student published a web page but it matters a lot that the department told the student to put up the web page. So there is some notion of responsibility that needs to be captured.

Examples of activity association include designing, participation, initiation and termination, timetabling or sponsoring.

Provenance reflects activities that have occurred. In some cases, those activities reflect the execution of a plan that was designed in advance to guide the execution. PROV-DM allows attaching a plan to an activity record, which represents what was intended to happen. The plan can be useful for various tasks, for example to validate the execution as represented in the provenance record, to manage expectation failures, or to provide explanations.

In the context of PROV-DM, a plan should be understood as the description of a set of actions or steps intended by one or more agents to achieve some goal. PROV-DM is not prescriptive about the nature of plans, their representation, the actions and steps they consist of, and their intended goals. Hence, for the purpose of this specification, a plan can be a workflow for a scientific experiment, a recipe for a cooking activity, or a list of instructions for a micro-processor execution. While PROV-DM does not specify the representations of plans, it allows for activities to be associated with plans. Furthermore, since plans may evolve over time, it may become necessary to track their provenance, and hence, plans are entities. An activity MAY be associated with multiple plans. This allows for descriptions of activities initially associated with a plan, which was changed, on the fly, as the activity progresses. Plans can be successfully executed or they can fail. We expect applications to exploit PROV-DM extensibility mechanisms to capture the rich nature of plans and associations between activities and plans.

Thus, PROV-DM offers two kinds of records. The first, introduced in this section, represents an association between an agent, a plan, and an activity; the second, introduced in Section Responsibility record, represents the fact that an agent was acting on behalf of another, in the context of an activity.

An activity association record, written wasAssociatedWith(id,a,ag2,pl,attrs) in PROV-ASN, has the following constituents:

  • id: an OPTIONAL identifier id identifying the activity association record;
  • activity: an identifier a for an activity record;
  • agent: an identifier ag2 for an agent record, which represents the agent associated with the activity;
  • plan: an OPTIONAL identifier pl for an entity record, which represents a plan adopted by the agent in the context of this activity;
  • attributes: an OPTIONAL set of attribute-value pairs attrs that describe the modalities of association of this activity with this agent.

In PROV-ASN, an activity association record's text matches the activityAssociationRecord productions of the grammar defined in this specification document.

activityAssociationRecord ::= wasAssociatedWith ( identifier, aIdentifier, agIdentifier ,eIdentifier optional-attribute-values )
In the following example, a designer and an operator agents are asserted to be associated with an activity. The designer's goals are achieved by a workflow ex:wf.
activity(ex:a,[prov:type="workflow execution"])
agent(ex:ag1,[prov:type="operator"])
agent(ex:ag2,[prov:type="designer"])
wasAssociatedWith(ex:a,ex:ag1,[prov:role="loggedInUser", ex:how="webapp"])
wasAssociatedWith(ex:a,ex:ag2,ex:wf,[prov:role="designer", ex:context="project1"])
entity(ex:wf,[prov:type="prov:Plan"%% xsd:QName, ex:label="Workflow 1", 
              ex:url="http://example.org/workflow1.bpel" %% xsd:anyURI])
Since the workflow ex:wf is itself an entity, its provenance can also be expressed in PROV-DM: it can be generated by some activity and derived from other entities, for instance.
The activity association record does not allow for a plan to be asserted without an agent. This seems over-restrictive. Discussed in the context of ISSUE-203.
Agents should not be inferred. WasAssociatedWith should also work with entities. This is ISSUE-206.

Start and End Records

A start record is a representation of an agent starting an activity. An end record is a representation of an agent ending an activity. Both relations are specialized forms of wasAssociatedWith. They contain attributes describing the modalities of acting/ending activities.

A start record, written wasStartedBy(id,a,ag,attrs) in PROV-ASN, contains:

  • id: an OPTIONAL identifier id identifying the start record;
  • activity: an identifier a denoting an activity record, representing the started activity;
  • agent: an identifier ag for an agent record, representing the starting agent;
  • attributes: an OPTIONAL set of attribute-value pairs attrs, describing modalities according to which the agent started the activity.

An end record, written wasEndedBy(id,a,ag,attrs) in PROV-ASN, contains:

  • id: an OPTIONAL identifier id identifying the end record;
  • activity: an identifier a denoting an activity record, representing the ended activity;
  • agent: an identifier ag for an agent record, representing the ending agent;
  • attributes: an OPTIONAL set of attribute-value pairs attrs, describing modalities according to which the agent ended the activity.

In PROV-ASN, start and end record's texts match the startRecord and endRecord productions of the grammar defined in this specification document.

startRecord ::= wasStartedBy ( identifier, aIdentifier, agIdentifier optional-attribute-values )
endRecord ::= wasEndedBy ( identifier, aIdentifier, agIdentifier optional-attribute-values )

The following assertions

wasStartedBy(a,ag,[ex:mode="manual"])
wasEndedby(a,ag,[ex:mode="manual"])

state that the activity, represented by the activity record denoted by a was started and ended by an agent, represented by record denoted by ag, in "manual" mode, an application specific characterization of these relations.

Should we define start/end records as representation of activity start/end events. Should time be associated with these events rather than with activities. This will be similar to what we do for entities. This is issue ISSUE-207.

Entity-Entity or Agent-Agent Relation

Responsibility Record

To promote take-up, PROV-DM offers a mild version of responsibility in the form of a relation to represent when an agent acted on another agent's behalf. So in the example of someone running a mail program, the program is an agent of that activity and the person is also an agent of the activity, but we would also add that the mail software agent is running on the person's behalf. In the other example, the student acted on behalf of his supervisor, who acted on behalf of the department chair, who acts on behalf of the university, and all those agents are responsible in some way for the activity to take place but we don't say explicitly who bears responsibility and to what degree.

We could also say that an agent can act on behalf of several other agents (a group of agents). This would also make possible to indirectly reflect chains of responsibility. This also indirectly reflects control without requiring that control is explicitly indicated. In some contexts there will be a need to represent responsibility explicitly, for example to indicate legal responsibility, and that could be added as an extension to this core model. Similarly with control, since in particular contexts there might be a need to define specific aspects of control that various agents exert over a given activity.

Given an activity association record wasAssociatedWith(a,ag2,attrs), a responsibility record, written actedOnBehalfOf(id,ag2,ag1,a,attrs) in PROV-ASN, has the following constituents:

  • id: an OPTIONAL identifier id identifying the responsibility record;
  • subordinate: an identifier ag2 for an agent record, which represents an agent associated with an activity, acting on behalf of the responsible agent;
  • responsible: an identifier ag1 for an agent record, which represents the agent on behalf of which the subordinate agent ag2 acts;
  • activity: an OPTIONAL identifier a of an activity record for which the responsibility record holds;
  • attributes: an OPTIONAL set of attribute-value pairs attrs that describe the modalities of this relation.
responsibilityRecord ::= actedOnBehalfOf ( identifier, agIdentifier , agIdentifier , aIdentifier optional-attribute-values )
In the following example, a programmer, a researcher and a funder agents are asserted. The programmer and researcher are associated with a workflow activity. The programmer acts on behalf of the researcher (delegation) encoding the commands specified by the researcher; the researcher acts on behalf of the funder, who has an contractual agreement with the researcher. The terms 'delegation' and 'contact' used in this example are domain specific.
activity(a,[prov:type="workflow"])
agent(ag1,[prov:type="programmer"])
agent(ag2,[prov:type="researcher"])
agent(ag3,[prov:type="funder"])
wasAssociatedWith(a,ag1,[prov:role="loggedInUser"])
wasAssociatedWith(a,ag2)
actedOnBehalfOf(ag1,ag2,a,[prov:type="delegation"])
actedOnBehalfOf(ag2,ag3,a,[prov:type="contract"])

Derivation Record

In PROV-DM, a derivation record is a representation that some entity is transformed from, created from, or affected by another entity in the world.

Examples of derivation include the transformation of a canvas into a painting, the transportation of a person from London to New York, the transformation of a relational table into a linked data set, and the melting of ice into water.

According to Section Conceptualization, for an entity to be transformed from, created from, or affected by another in some way, there must be some underpinning activities performing the necessary actions resulting in such a derivation. However, asserters may not assert or have knowledge of these activities and associated details: they may not assert or know their number, they may not assert or know their identity, they may not assert or know the attributes characterizing how the relevant entities are used or generated. To accommodate the varying circumstances of the various asserters, PROV-DM allows more or less precise records of derivation to be asserted. Hence, PROV-DM uses the terms precise and imprecise to characterize the different kinds of derivation record. We note that the derivation itself is exact (i.e., deterministic, non-probabilistic), but it is its description, expressed in a derivation record, that may be imprecise.

The lack of precision may come from two sources:

  • the number of activities that underpin a derivation is not asserted or known, or
  • any of the other details that are involved in the derivation is not asserted or known; these include activity identities, generation and usage records, and their attributes.

Hence, we can consider two axis. An activity number axis that has values single, multiple, and unknown, respectively representing the case where one activity is known to have occurred, more than one activities are known to have occurred, or an unknown number of activities have occurred. Likewise, we can consider another axis to cover other details (identities, generation and usage records, attributes), with values asserted and not asserted. We can then form a matrix of possible derivations. Out of the six possibilities, PROV-DM offers three forms of derivation derivation records to cater for five othem, while the remaining one is not meaningful. The following table summarises names for the three kinds of derivation, which we then explain.

PROV-DM Derivation Type Summary
other details axis
assertednot asserted
activity number
axis
singleprecise-1 derivation recordimprecise-1 derivation record
multipleimprecise-n derivation recordimprecise-n derivation record
unknown
  • The asserter asserts that derivation is due to exactly one activity, and all the details are asserted. We call this a precise-1 derivation record.
  • The asserter asserts that derivation is due to exactly one activity, but other details, whether known or unknown, are not asserted. We call this an imprecise-1 derivation record.
  • The following cases are captured by an imprecise-n derivation record.
    • The asserter knows that multiple activities are involved or ignores the number of activities involved in the derivation, and other details are not asserted.
    • The asserter knows that multiple activities are involved in the derivation, and all their details are asserted. In this case, these activities are connected by means of generated and used intermediary entities. Despite all activities and details being known, there is no guarantee that any of these activities plays an active role in the derivation; hence, this case is also regarded as imprecise. Instead, precise derivations need to be expressed between these intermediary entities.

We note that the last theoretical cases cannot occur, since asserting the details of an unknown number of activities is a contradiction.

In order to represent the number of activities in a derivation, we introduce a PROV-DM attribute steps, which can take two possible values: single and any. When prov:steps="single", derivation is due to one activity; when prov:steps="any", the number of activities is multiple or not known.

The three kinds of derivation records are successively introduced. Making use of the attribute steps, we can distinguish the various derivation types.

A precise-1 derivation record, written wasDerivedFrom(id, e2, e1, a, g2, u1, attrs) in PROV-ASN, contains:

  • id: an OPTIONAL identifier id identifying the derivation record;
  • generatedEntity: the identifier e2 of an entity record, which is a representation of the generated entity;
  • usedEntity: the identifier e1 of an entity record, which is a representation of the used entity;
  • activity: an identifier a of an activity record, which is a representation of the activity using and generating the above entities;
  • generation: an identifier g2 of the generation record pertaining to e2 and a;
  • usage: an identifier u1 of the usage record pertaining to e1 and a.
  • attributes: an OPTIONAL set of attribute-value pairs attrs that describe the modalities of this derivation, optionally including the attribute-value pair prov:steps="single".

It is OPTIONAL to include the attribute prov:steps in a precise-1 derivation since the record already refers to the one and only one activity underpinning the derivation.

An imprecise-1 derivation record, written wasDerivedFrom(id, e2,e1, t, attrs) in PROV-ASN, contains:

  • id: an OPTIONAL identifier id identifying the derivation record;
  • generatedEntity: the identifier e2 of an entity record, which is a representation of the generated entity;
  • usedEntity: the identifier e1 of an entity record, which is a representation of the used entity;
  • time: an OPTIONAL "generation time" t, the time at which the entity denoted by e2 was created;
  • attributes: a set of attribute-value pairs attrs that describe the modalities of this derivation; it MUST include the attribute-value pair prov:steps="single".

An imprecise-1 derivation MUST include the attribute prov:steps, since it is the only means to distinguish this record from an imprecise-n derivation record.

An imprecise-n derivation record, written wasDerivedFrom(id, e2, e1, t, attrs) in PROV-ASN, contains:

  • id: an OPTIONAL identifier id identifying the derivation record;
  • generatedEntity: the identifier e2 of an entity record, which is a representation of the generated entity;
  • usedEntity: the identifier e1 of an entity record, which is a representation of the used entity;
  • time: an OPTIONAL "generation time" t, the time at which the entity denoted by e2 was created;
  • attributes: an OPTIONAL set of attribute-value pairs attrs that describe the modalities of this derivation; it optionally includes the attribute-value pair prov:steps="any".

It is OPTIONAL to include the attribute prov:steps in an imprecise-n derivation record. It defaults to prov:steps="any".

None of the three kinds of derivation is defined to be transitive. Domain-specific specializations of these derivations may be defined in such a way that the transitivity property holds.

In PROV-ASN, a derivation record's text matches the derivationRecord production of the grammar defined in this specification document.

derivationRecord ::= wasDerivedFrom ( identifier, eIdentifier , eIdentifier , aIdentifier , gIdentifier , uIdentifier optional-attribute-values )
| wasDerivedFrom ( identifier, eIdentifier , eIdentifier , time optional-attribute-values )

The first clause of the alternative, where the activity, generation and usage record identifiers are present formalizes a derivation record is precise-1. The second clause of the alternative, with optional time formalizes imprecise records. The distinction between imprecise-1 and imprecise-n is made by the attribute prov:steps.

The following assertions state the existence of derivations.

wasDerivedFrom(e5,e3,a4,g2,u2)
wasDerivedFrom(e5,e3,a4,g2,u2,[prov:steps="single"])

wasDerivedFrom(e3,e2,[prov:steps="single"])

wasDerivedFrom(e2,e1,[])
wasDerivedFrom(e2,e1,[prov:steps="any"])

wasDerivedFrom(e2,e1,2012-01-18T16:00:00, [prov:steps="any"])

The first two are precise-1 derivation records expressing that the activity represented by the activity a4, by using the entity denoted by e3 according to usage record u2 derived the entity denoted by e5 and generated it according to generation record g2. The third record is an imprecise-1 derivation, which is similar for e3 and e2, but it leaves the activity record and associated attributes implicit. The fourth and fifth records are imprecise-n derivation records between e2 and e1, but no information is provided as to the number and identity of activities underpinning the derivation. The six derivation records extends the fifth with the derivation time of e2.

An precise-1 derivation record is richer than an imprecise-1 derivation record, itself, being more informative that an imprecise-n derivation record. Hence, the following implications hold.

Given two entity records denoted by e1 and e2, if the assertion wasDerivedFrom(e2, e1, a, g2, u1, attrs) holds for some generation record identified by g2, and usage record identified by u1, then wasDerivedFrom(e2,e1,[prov:steps="single"] ∪ attrs) also holds.
Given two entity records denoted by e1 and e2, if the assertion wasDerivedFrom(e2, e1, [prov:steps="single"] ∪ attrs) holds, then wasDerivedFrom(e2,e1,attrs) also holds.
For the interpretation of a derivation record, see derivation-usage-generation-ordering and derivation-generation-generation-ordering

The imprecise-1 derivation has the same meaning as the precise-1 derivation, except that an activity is known to exist, though it does not need to be asserted. This is formalized by the following inference rule, referred to as activity introduction:

If wasDerivedFrom(e2,e1) holds, then there exist an activity record identified by a, a usage record identified by u, and a generation record identified by g such that:
activity(a,aAttrs)
wasGeneratedBy(g,e2,a,gAttrs)
used(u,a,e1,uAttrs)
for sets of attribute-value pairs gAttrs, uAttrs, and aAttrs.

Note that inferring derivation from usage and generation does not hold in general. Indeed, when a generation wasGeneratedBy(g, e2, a, attrs2) precedes used(u, a, e1, attrs1), for some e1, e2, attrs1, attrs2, and a, one cannot infer derivation wasDerivedFrom(e2, e1, a, g, u) or wasDerivedFrom(e2,e1) since of e2 cannot possibly be derived from e1, given the creation of e2 precedes the use of e1.

In PROV-DM, the effective placeholder for an entity generation time is the generation record. The presence of time information in imprecise derivation records is merely a convenience notation for a timeless derivation record and a generation record with this generation time information.

If wasDerivedFrom(e2,e1,t,attrs) holds, then the following records also hold: wasDerivedFrom(e2,e1,attrs) and wasGeneratedBy(e2,t).

See derivation-use for a structural constraint on derivation records.
Should derivation have a time? Which time? This is ISSUE-43.This is now addressed in this text. Optional time in derivation is generation time. See also ISSUE-205.
Several points were raised about the attribute steps. Its name, its default value ISSUE-180. ISSUE-179.
Emphasize the notion of 'affected by' ISSUE-133.

Alternate and Specialization Records

This section is currently under revision and in flux
The purpose of the record types defined in this section is to establish a relationship between two entities, which asserts that they provide a different characterization of the same thing. Consider for example three entities:
  • e1 denoting "Bob, the holder of facebook account ABC",
  • e2 denoting "Bob, the holder of twitter account XYZ",
  • e3 denoting "Bob, the person".
One may make several assertions to establish that these entities refer to the same the real-world thing Bob, either in different contexts, or at different levels of abstraction. For example:
  1. Entity denoted by e1 provides a more concrete characterization of Bob than e3 does;
  2. Entity denoted by e2 provides a more concrete characterization of Bob than e3 does;
  3. The entities denoted by e1 and e2 provide two different characterizations of the same thing, i.e., Bob.
Two relations are introduced to express these assertions:
  • e2 is a specialization of e1, written specializationOf(e2,e1) captures the intent of assertion (1) and (2);
  • e2 is an alternative characterization of e1, written alternateOf(e2,e1) captures the intent of assertion (3).
In order to further convey the intended meaning, the following properties are associated to these two relations.
  • specializationOf(e2,e1) is transitive: specializationOf(e3,e2) and specializationOf(e2,e1) implies specializationOf(e3,e1).
  • specializationOf(e2,e1) is anti-symmetric: specializationOf(e2,e1) implies that specializationOf(e1,e2) does not hold.
  • alternateOf(e2,e1) is symmetric: alternateOf(e2,e1) implies alternateOf(e1,e2).
There are proposals to make alternateOf a transitive property. This is still under discussion and the default is for alternateOf not to be transitive, and this is what the current text reflects.

A alternate record, written alternateOf(alt1, alt2, attrs) in PROV-ASN, has the following constituents:

  • first alternate: an identifier alt1 of the first of the two entities
  • second alternate: an identifier alt2 of the second of the two entities
  • attrs: an OPTIONAL set attrs of attribute-value pairs to further describe this record.

A specialization record written specializationOf(sub, super, attrs) in PROV-ASN, has the following constituents:

  • specialised entity: an identifier sub of the specialised entity
  • general entity: an identifier super of the entity that is being specialised
  • attrs: an OPTIONAL set attrs of attribute-value pairs to further describe this record.

An entity record identifier can optionally be accompanied by an account identifier. When this is the case, it becomes possible to use the alternateOf relation to link two entity record identifiers that are appear in different accounts. (In particular, the entity identifiers in two different account are allowed to be the same.). When account identifiers are not available, then the linking of entity records through alternateOf can only take place within the scope of a single account.

In PROV-ASN, an alternate record's text matches the alternateRecord production of the grammar defined in this specification document.

alternateRecord ::= alternateOf ( eIdentifier , eIdentifier , optional-attribute-values )
| alternateOf ( eIdentifier , accIdentifier , eIdentifier , accIdentifier , optional-attribute-values )

In PROV-ASN, a specialization record's text matches the specializationRecordproduction of the grammar defined in this specification document.

specializationRecord ::= specializationOf ( eIdentifier , eIdentifier , optional-attribute-values )
| specializationOf ( eIdentifier , accIdentifier , eIdentifier , accIdentifier , optional-attribute-values )
A discussion on alternative definition of these relations has not reached a satisfactory conclusion yet. This is ISSUE-29. Also ISSUE-96.

Annotation Record

An annotation record establishes a link between an identifiable PROV-DM record and a note record referred to by its identifier. Multiple note records can be associated with a given PROV-DM record; symmetrically, multiple PROV-DM records can be associated with a given note record. Since note records have identifiers, they can also be annotated. The annotation mechanism (with note record and the annotation record) forms a key aspect of the extensibility mechanism of PROV-DM (see extensibility section).

An annotation record, written hasAnnotation(r,n,attrs) in PROV-ASN, has the following constituents:

  • record: an identifier r of the record being annnotated;
  • note: an identifier n of a note record;
  • attributes: an OPTIONAL set attrs of attribute-value pairs to further describe this record.

In PROV-ASN, a note record's text matches the noteRecord production of the grammar defined in this specification document.

annotationRecord ::= hasAnnotation ( identifier , nIdentifier optional-attribute-values )

The interpretation of notes is application-specific. See Section Note for a discussion of the difference between note attributes and other records attributes. We also note the present tense in this term to indicate that it may not denote something in the past.

The following records

entity(e1,[prov:type="document"])
entity(e2,[prov:type="document"])
activity(a,t1,t2)
used(u1,a,e1,[ex:file="stdin"])
wasGeneratedBy(e2, a, [ex:file="stdout"])

note(n1,[ex:icon="doc.png"])
hasAnnotation(e1,n1)
hasAnnotation(e2,n1)

note(n2,[ex:style="dotted"])
hasAnnotation(u1,n2)

assert the existence of two documents in the world (attribute-value pair: prov:type="document") identified by e1 and e2, and annotate these records with a note indicating that the icon (an application specific way of rendering provenance) is doc.png. It also asserts an activity, its usage of the first entity, and its generation of the second entity. The usage record is annotated with a style (an application specific way of rendering this edge graphically). To be able to express this annotation, the usage record was provided with an identifier u1, which was then referred to in hasAnnotation(u1,n2).

Bundle

In this section, two constructs are introduced to group PROV-DM records. The first one, account record is itself a record, whereas the second one record container is not.

Account Record

It is common for multiple provenance records to co-exist. For instance, when emailing a file, there could be a provenance record kept by the mail client, and another by the mail server. Such provenance records may provide different explanations about something happening in the world, because they are created by different parties or observed by different witnesses. A given party could also create multiple provenance records about an execution, to capture different levels of details, targeted at different end-users: the programmer of an experiment may be interested in a detailed log of execution, while the scientists may focus more on the scientific-level description. Given that multiple provenance records can co-exist, it is important to know who asserted these records.

In PROV-DM, an account record is a wrapper of records with the following purposes:

  • It is the mechanism by which attribution of provenance can be assserted; it allows asserters to bundle up their assertions, and assert suitable attribution;
  • It provides a scoping mechanism for the uniqueness of identifiers (of elements and relations of PROV-DM);
  • It provides a scoping mechanism for structural contraints (such as generation-uniqueness and derivation-use discussed in Section structural-constraints).

An account record, written account(id, assertIRI, recs, attrs) in PROV-ASN, contains:

  • id: an identifier id that identifies this account globally;
  • asserter: an IRI, denoted by assertIRI, to identify an asserter; such IRI has no specific interpretation in the context of PROV-DM;
  • records: a set recs of provenance records;
  • attributes: an OPTIONAL set attrs of attribute-value pairs to further describe this record.

In PROV-ASN, an account record's text matches the accountRecord production of the grammar defined in this specification document.

accountRecord ::= account ( identifier , asserter , record optional-attribute-values )
Currently, the non-terminal asserter is defined as IRI and its interpretation is outside PROV-DM. We may want the asserter to be an agent instead, and therefore use PROV-DM to express the provenance of PROV-DM assertions. The editors seek inputs on how to resolve this issue. We seek inputs on how to resolve this issue. This is ISSUE-217.

The following account record

account(ex:acc0,
        http://example.org/asserter, 
          entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ])
          ...
          wasDerivedFrom(e2,e1)
          ...
          activity(a0,t,,[prov:type="createFile"])
          ...
          wasGeneratedBy(e0,a0)     
          ...
          wasAssociatedWith(a4, ag5, [prov:role="communicator"])  )

contains the set of provenance records of section example-prov-asn-encoding, is asserted by agent http://example.org/asserter, and is identified by identifier ex:acc0.

An identifier in a record within the scope of an account is intended to denote a single record. However, nothing prevents an asserter from asserting an account containing, for example, multiple entity records with a same identifier but different attribute-values. In that case, they should be understood as a single entity record with this identifier and the union of all attributes values, as formalized in identifiable-record-in-account.

Given an entity record identifier e, two sets of attribute-values denoted by av1 and av2, two entity records entity(e,av1) and entity(e,av2) occurring in an account are equivalent to the entity record entity(e,av) where av is the set of attribute-value pairs formed by the union of av1 and av2.

This constraint similarly applies to all other types of records. As a result, the identifier that occurs in a record is unique and acts as a local identifier for that record in that account.

Whilst constraint identifiable-record-in-account specifies how to understand multiple entity records with a same identifier within a given account, it does not guarantee that the entity record formed with the union of all attribute-value pairs satisfies the attribute occurrence validity property, as illustrated by the following example.

In the following account record, we find two entity records with a same identifier e.

account(ex:acc1,
        http://example.org/id,
          entity(e,[prov:type="person", ex:age=20])
          entity(e,[prov:type="person", ex:weight=50, ex:age=30])
          ...)

Application of identifiable-record-in-account results in an entity record containing the attribute-value pairs age=20, weight=50, and age=30. The namespace referred to by prefix ex declares the number of occurrences that are permitted for each attribute. The resulting entity record may or may not satisfy the attribute occurrence validity, depending on this namespace. For instance, if the namespace referred to by ex declares that age must have at most one occurrence, then the resulting entity record does not satisfy the attribute occurrence validity property. This document does not specify how to handle such an entity record.

Account records can be nested since an account record can occur among the records being wrapped by another account.

Account records constitute a scope for identifier uniqueness. Since accounts can be nested, scopes can also be nested; thus, the requirement on uniqueness of identifiers should be understood in the context of such nested scopes. When a record with an identifier occurs directly within an account, then its identifier denotes this record in the scope of this account, except in sub-accounts where records with the same identifier occur.

The following account record is inspired from section example-prov-asn-encoding. This account, identified by ex:acc3, declares entity record with identifier e0, which is being referred to in the nested account ex:acc4. Identifier e0 is uniquely identify a record in account ex:acc3, including subaccount ex:acc4.

account(ex:acc3,
        http://example.org/asserter1, 
          entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ])
          activity(a0,t,,[prov:type="createFile"])
          wasGeneratedBy(e0,a0,[])  
          account(ex:acc4,
                  http://example.org/asserter2,
                    entity(e1, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", ex:content="" ])
                    activity(a0,t,,[prov:type="copyFile"])
                    wasGeneratedBy(e1,a0,[ex:fct="create"])
                    specializationOf(e1,e0)))

Alternatively, an activity record identified by a0 occurs in each of the two accounts. Therefore, each activity record is asserted in a separate scope, and therefore may represent different activities in the world.

The identifier of an account record is expected to be globally unique, whereas identifiers for other records are expected to be unique within the scope of the account in which their record occurs.

The account record is the hook by which further meta information can be expressed about provenance, such as asserter, time of creation, signatures. The annotation mechanism can be used for this purpose, but how general meta-information is expressed is beyond the scope of this specification, except for asserters.

See Section structural-constraints for a structural constraint on account records.

Record Container

A record container is a house-keeping construct of PROV-DM, also capable of bundling PROV-DM records. A record container is the root of a provenance record and can be exploited to package up PROV-DM records in response to a request for the provenance of something ([[!PROV-AQ]]). Given that a record container is the root of a provenance record, it is not defined as a PROV-DM record (production record), since otherwise it could appear arbitrarily nested inside accounts.

A record container, written container decls recs endContainer in PROV-ASN, contains:

  • namespaceDeclarations: a set decls of namespace declarations, declaring namespaces and associated prefixes, which can be used in attributes and identifiers occurring inside recs;
  • records: a non-empty set of records recs.

All the records in recs are implictly wrapped in a default account, scoping all the record identifiers they declare directly, and constituting a toplevel account, in the hierarchy of accounts. Consequently, every provenance record is always expressed in the context of an account, either explicitly in an asserted account, or implicitly in a container's default account.

In PROV-ASN, a record container's text matches the recordContainer production of the grammar defined in this specification document.

recordContainer ::= container namespaceDeclarations record endContainer

The following container contains records related to the provenance of entity e2.

container
  prefix ex: http://example.org/,
  entity(e2, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", 
             ex:content="There was a lot of crime in London last month."])
  activity(a1, 2011-11-16T16:05:00,,[prov:type="edit"])
  wasGeneratedBy(e2, a1, [ex:fct="save"])     
  wasAssociatedWith(a1, ag2, [prov:role="author"])
  agent(ag2, [ prov:type="prov:Person" %% xsd:QName, ex:name="Bob" ])
endContainer

This container could for instance be returned by querying a provenance store for the provenance of entity e2 [[!PROV-AQ]]. All these assertions are implicitly wrapped in a default account. In the absence of an explicit account, such provenance records remain unattributed.

The following container

container
  prefix ex: http://example.org/,

  account(ex:acc1,http://example.org/asserter1,...)
  account(ex:acc2,http://example.org/asserter1,...)
endContainer

illustrates how two accounts with identifiers ex:acc1 and ex:acc2 can be returned in a PROV-ASN serialization of the provenance of something.

Clarify what records are. This is ISSUE-208.

Further Terms in Records

This section further terms in PROV-DM records.

Attribute

An attribute is a qualified name. An qualified name is a name subject to namespace interpretation. It consists of namespace, denoted by an optional prefix, and a local name. The namespace is denoted by an IRI [[!IRI]].

PROV-DM stipulates that a qualified name can be mapped into an IRI by concatenating the IRI associated with the prefix and the local part (see detailed rule in [[!RDF-SPARQL-QUERY]], Section 4.1.1).

A qualified name's prefix is OPTIONAL. If a prefix occurs in a qualified name, it refers to a namespace declared in the record container. In the absence of prefix, the qualified name refers to the default namespace declared in the container.

In PROV-ASN, an attribute's text matches the attribute production of the grammar defined in this specification document.

attribute ::= qualifiedName
qualifiedName  ::= prefixedName | unprefixedName
prefixedName  ::= prefix : localPart
unprefixedName  ::= localPart
prefix  ::= a name without colon compatible with the NC_NAME production [[!XML-NAMES]]
localPart  ::= a name without colon compatible with the NC_NAME production [[!XML-NAMES]]
Note that XML NC_NAME don't allow local identifiers to start with a number. Instead, should we use the productions used in SPARQL or TURTLE?

For each attribute in a record, its namespace also declares the number of occurrences it may have in a list of attributes. The property attribute occurrence validity holds for a record if the actual number of occurrences of each attribute in this record is compatible with this attribute's declaration it its namespace. How to handle records that do not satisfy the attribute occurrence validity property is beyond the scope of this specification.

From this specification's viewpoint, the interpretation of an attribute declared in a namespace other than prov-dm is out of scope.

The PROV data model introduces a fixed set of attributes in the PROV-DM namespace:

  • The attribute prov:role denotes the function of an entity with respect to an activity, in the context of a usage, generation, activity association, start, end record. The attribute prov:role is allowed to occur multiple times in such records. The value associated with a prov:role attribute MUST be conformant with Literal.

    The following start record describes the role of the agent identified by ag in this start relation with activity a.

       wasStartedBy(a,ag, [prov:role="program-operator"])
    
  • The attribute prov:type provides further typing information for the element or relation asserted in the record. PROV-DM liberally defines a type as a category of things having common characteristics. PROV-DM is agnostic about the representation of types, and only states that the value associated with a prov:type attribute MUST be conformant with Literal. The attribute prov:type is allowed to occur multiple times in a record.

    The following record declares an agent of type software agent

       agent(ag, [prov:type="prov:SoftwareAgent" %% xsd:QName])
    
  • The attribute prov:steps defines the level of precision associated with a derivation record. The value associated with a prov:steps attribute MUST be "single" or "any". The attribute prov:step occurs at most once in a derivation record. A derivation record without attribute prov:step is considered to be equivalent to the same record extended with an extra attribute prov:step and associated value "any".

    The following record declares an imprecise-1 derivation, which is known to involve one activity, though its identity, usage details of ex:e1, and generation details of ex:e2 are not asserted.

       wasDerivedFrom(ex:e2, ex:e1, [prov:steps="single"])
    
  • The attribute prov:label provides a human-readable representation of a PROV-DM element or relation.
    This is ISSUE-219.

Identifier

An identifier is a qualified name. A qualified name can be mapped into an IRI by concatenating the IRI associated with the prefix and the local part (see detailed rule in [[!RDF-SPARQL-QUERY]], Section 4.1.1).

identifier ::= qualifiedName
eIdentifier ::= identifier (intended to denote an entity record)
aIdentifier ::= identifier (intended to denote an activity record)
agIdentifier ::= identifier (intended to denote an agent record)
gIdentifier::= identifier (intended to denote a generation record)
uIdentifier::= identifier (intended to denote a usage record)
nIdentifier::= identifier (intended to denote a note record)
accIdentifier::= identifier (intended to denote an account record)

Literal

A PROV-DM Literal represents a data value such as a particular string or number. A PROV-DM Literal represents a value whose interpretation is outside the scope of PROV-DM.

In PROV-ASN, a Literal's text matches the Literal production of the grammar defined in this specification document.

Literal  ::= typedLiteral | convenienceNotation
typedLiteral ::= quotedString %% datatype
datatype ::= qualifiedName
convenienceNotation  ::= stringLiteral | intLiteral
stringLiteral ::= quotedString
quotedString ::= a finite sequence of characters in which " (U+22) and \ (U+5C) occur only in pairs of the form \" (U+5C, U+22) and \\ (U+5C, U+5C), enclosed in a pair of " (U+22) characters
intLiteral ::= a finite-length sequence of decimal digits (#x30-#x39) with an optional leading negative sign (-)

The non terminals stringLiteral and intLiteral are syntactic sugar for quoted strings with datatype xsd:string and xsd:int, respectively.

In particular, a PROV-DM Literal may be an IRI-typed string (with datatype xsd:anyURI); such IRI has no specific interpretation in the context of PROV-DM.

The following examples respectively are the string "abc" (expressed using the convenience notation), the string "abc", the integer number 1, the integer number 1 (expressed using the convenience notation) and the IRI "http://example.org/foo".

  "abc"
  "abc" %% xsd:string
  "1" %% xsd:int
  1
  "http://example.org/foo" %% xsd:anyURI
The following example shows a literal of type xsd:QName (see QName [[!XMLSCHEMA-2]]). The prefix ex MUST be bound to a namespace declared in the record container.
  "ex:value" %% xsd:QName
Should we define structural equivalence of literals as in OWL2? [[!OWL2-SYNTAX]] (see section Literals).

Time

Time instants are defined according to xsd:dateTime [[!XMLSCHEMA-2]].

It is OPTIONAL to assert time in usage, generation, and activity records.

Asserter

An asserter is a creator of PROV-DM records. An asserter is denoted by an IRI. Such IRI has no specific interpretation in the context of PROV-DM.

asserter ::= IRI
IRI ::= an IRI compatible with production IRI in [[!IRI]], enclosed in a pair of < (U+3C) and > (U+3E) characters
Currently, the non-terminal asserter is defined as IRI. We may want the asserter to be an agent instead, and therefore use PROV-DM to express the provenance of PROV-DM. We seek inputs on how to resolve this issue. This is ISSUE-217

Namespace Declaration

A PROV-DM namespace is identified by an IRI reference [[!IRI]]. In PROV-DM, attributes, identifiers, and literals of with datatype xsd:QName can be placed in a namespace using the mechanisms described in this specification.

A namespace declaration consists of a binding between a prefix and a namespace. Every qualified name with this prefix in the scope of this declaration refers to this namespace. A default namespace declaration consists of a namespace. Every unprefixed qualified name in the scope of this default namespace declaration refers to this namespace.

namespaceDeclarations ::= | defaultNamespaceDeclaration | namespaceDeclaration namespaceDeclaration
namespaceDeclaration ::= prefix prefix IRI
defaultNamespaceDeclaration ::= default IRI

Location

Location is an identifiable geographic place (ISO 19112). As such, there are numerous ways in which location can be expressed, such as by a coordinate, address, landmark, row, column, and so forth. This document does not specify how to concretely express locations, but instead provide a mechanism to introduce locations in assertions.

Location is an OPTIONAL attribute of entity records and activity records. The value associated with a attribute location MUST be a Literal, expected to denote a location.

PROV-DM Common Relations

This section contains the normative specification of common relations of PROV-DM.

The following figure summarizes the additional relations described in subsections 6.1 to 6.7.

common relations
PROV-DM Common Relations

Traceability Record

It is common that we may want to know who or what may have some influence, whether direct or indirect, on a given entity, or who may, directly or not, have some responsibility for a given outcome. Hence, we may want to infer such a notion from an existing set of PROV-DM records. Vice-versa, we may have knowledge of this influence and responsibility, but without knowing its actual details. Thus, we may also want to assert such a notion.

A traceability record states the existence of a "dependency path" between two entities, indicating that one entity can be shown to be in the lineage of another, and may have influenced it, or may bear some responsibility for it, in some way. The traceability relation subsumes derivation, activity association, and responsibility, and is defined to be transitive.

A traceability record, written tracedTo(id,e2,e1,attrs) in PROV-ASN, contains the following components:

In PROV-ASN, a traceability record's text matches the traceabilityRecord production of the grammar defined in this specification document.

traceabilityRecord ::= tracedTo ( identifier , eIdentifier , eIdentifier optional-attribute-values )

A traceability record can be inferred from existing records, or can be asserted stating that such a dependency path exists without the asserter knowing its individual steps, as expressed by the following inference and constraint, respectively.

Given two identifiers e2 and e1 identifying entity records, the following statements hold:
  1. If wasDerivedFrom(e2,e1,a,g2,u1) holds, for some a, g2, u1, then tracedTo(e2,e1) also holds.
  2. If wasDerivedFrom(e2,e1) holds, then tracedTo(e2,e1) also holds.
  3. If wasGeneratedBy(e2,a,gAttr) and wasAssociatedWith(a,e1) hold, for some a and gAttr, then tracedTo(e2,e1) also holds.
  4. If wasGeneratedBy(e2,a,gAttr), wasAssociatedWith(a,e) and actedOnBehalfOf(e,e1) hold, for some a, e, and gAttr, then tracedTo(e2,e1) also holds.
  5. If wasGeneratedBy(e2,a,gAttr) and wasStartedBy(a,e1,sAttr) hold, for some a, e, and gAttr, and sAttr, then tracedTo(e2,e1) also holds.
  6. If tracedTo(e2,e) and tracedTo(e,e1) hold for some e, then tracedTo(e2,e1) also holds.

We note that the inference rule traceability-inference does not allow us to infer attributes, which are record and application specific.

If the record tracedTo(r2,r1,attrs) holds for two identifiers r2 and r1 identifying entity records, and attribute-value pairs attrs, then there exist e0, e1, ..., en for n≥1, with e0=r2 and en=r1, and for any i such that 0≤i≤n-1, at least of the following statements holds:
  • wasDerivedFrom(ei,ei+1,a,g2,u1) holds, for some a, g2, u1, or
  • wasDerivedFrom(ei,ei+1) holds, or
  • wasBasedOn(ei,ei+1) holds, or
  • wasGeneratedBy(ei,a,gAttr) and wasAssociatedWith(a,ei+1) hold, for some a and gAttr, or
  • wasGeneratedBy(ei,a,gAttr), wasAssociatedWith(a,e) and actedOnBehalfOf(e,ei+1) hold, for some a, e and gAttr, or
  • wasGeneratedBy(ei,a,gAttr) and wasStartedBy(a,ei+1,sAttr) hold, for some a, e, and gAttr, and sAttr.

We note that the previous constraint is not really an inference rule, since there is nothing that we can actually infer. Instead, this constraint should simply be seen as part of the definition of the traceability record.

Activity Ordering Record

PROV-DM allows dependencies amongst activities to be expressed. An information flow ordering record is a representation that an entity was generated by an activity, before it was used by another activity. A control ordering record is a representation that an activity was initiated by another activity.

In PROV-ASN, an activity ordering record's text matches the activityOrderingRecord production of the grammar defined in this specification document.

activityOrderingRecord ::= informationFlowOrderingRecord | controlOrderingRecord
informationFlowOrderingRecord  ::= wasInformedBy ( identifier , aIdentifier , aIdentifier optional-attribute-values )
controlOrderingRecord  ::= wasStartedBy ( identifier , aIdentifier , aIdentifier optional-attribute-values )

An information flow ordering record, written as wasInformedBy(id,a2,a1,attrs) in PROV-ASN, contains:

An information flow ordering record is formally defined as follows.

Given two activity records identified by a1 and a2, the record wasInformedBy(a2,a1) holds, if and only if there is an entity record identified by e and sets of attribute-value pairs attrs1 and attrs2, such that wasGeneratedBy(e,a1,attrs1) and used(a2,e,attrs2) hold.
For the interpretation of an information flow ordering record, see wasInformedBy-ordering.

The relationship wasInformedBy is not transitive. Indeed, consider the following records.

wasInformedBy(a2,a1)
wasInformedBy(a3,a2)

We cannot infer wasInformedBy(a3,a1) from them. Indeed, from wasInformedBy(a2,a1), we know that there exists e1 such that e1 was generated by a1 and used by a2. Likewise, from wasInformedBy(a3,a2), we know that there exists e2 such that e2 was generated by a2 and used by a3. The following illustration shows a case for which transitivity cannot hold. The horizontal axis represents the event line. We see that e1 was generated after e2 was used. Furthermore, the illustration also shows that a3 completes before a1. So it is impossible for a3 to have used an entity generated by a1.

non transitivity of wasInformedBy
Counter-example for transitivity of wasInformedBy

A control ordering record, written as wasStartedBy(a2,a1, attrs) in PROV-ASN, contains:

Such a record states control ordering between a2 and a1, specified as follows.

Given two activity records identified by a1 and a2, the record wasStartedBy(a2,a1) holds if and only if there exist an entity record identified by e and some attributes gAttr and sAttr, such that wasGeneratedBy(e,a1,gAttr) and wasStartedBy(a2,e,sAttr) hold.

We note that a start record associates an activity with an agent, and is denoted by the name wasStartedBy. A control ordering record associates an activity with another activity, also denoted by the name wasStartedBy. Effectively, by considering both record types, the relation wasStartedBy has a range formed by the union of agents and activities.

In the following assertions, we find two activity records, identified by a1 and a2, representing two activities, which took place on two separate hosts. The third record indicates that the latter activity was started by the former.

activity(a1,t1,t2,[ex:host="server1.example.org",prov:type="workflow"])
activity(a2,t3,t4,[ex:host="server2.example.org",prov:type="subworkflow"])
wasStartedBy(a2,a1)

Alternatively, we could have asserted the existence of an entity, representing a request to create a sub-workflow. This request, issued by a1, triggered the start of a2.

entity(e,[prov:type="creation-request"])
wasGeneratedBy(e,a1)
wasStartedBy(a2,e)
For the interpretation of a control flow ordering record, see wasStartedBy-ordering.

Revision Record

A revision record is a representation of the creation of an entity considered to be a variant of another. Deciding whether something is made available as a revision of something else usually involves an agent who represents someone in the world who takes responsibility for approving that the former is a due variant of the latter.

A revision record, written wasRevisionOf(e2,e1,ag,attrs) in PROV-ASN, contains:

In PROV-ASN, a revision record's text matches the revisionRecord production of the grammar defined in this specification document.

revisionRecord ::= wasRevisionOf ( eIdentifier , eIdentifier , agIdentifier optional-attribute-values )

A revision record needs to satisfy the following constraint, linking the two entity records by a derivation, and stating them to be a complement of a third entity record.

Given two identifiers old and new identifying two entities, and an identifier ag identifying an agent, if a record wasRevisionOf(new,old,ag) is asserted, then there exists an entity record identifier e and attribute-values eAttrs, dAttrs, such that the following records hold:
  • wasDerivedFrom(new,old,dAttrs);
  • entity(e,eAttrs);
  • specializationOf(new,e);
  • specializationOf(old,e).
The derivation record may be imprecise-1 or imprecise-n.

wasRevisionOf is a strict sub-relation of wasDerivedFrom since two entities e2 and e1 may satisfy wasDerivedFrom(e2,e1) without being a variant of each other.

The following revision assertion

agent(ag,[prov:type="QualityController"])
entity(e1,[prov:type="document"])
entity(e2,[prov:type="document"])
wasRevisionOf(e2,e1,ag)

states that the document represented by entity record identified by e2 is a revision of document represented by entity record identified by e1; agent denoted by ag is responsible for this new versioning of the document.

Attribution Record

An attribution record represents that an entity is ascribed to an agent.

An attribution record, written wasAttributedTo(e,ag,attr) in PROV-ASN, contains the following components:

Attribution models the notion of an activity generating an entity identified by e being associated with an agent ag, which takes responsibility for generating e. Formally, this is expressed as the following necessary condition.

If wasAttributedTo(e,ag) holds for some identifiers e and ag, then there exists an activity identified by a such that the following statements hold:
activity(a,t1,t2,attr1)
wasGenerateBy(e,a)
wasAssociatedWith(a,ag,attr2)
for some sets of attribute-value pairs attr1 and attr2, time t1, and t2.

In PROV-ASN, an attribution record's text matches the attributionRecord production of the grammar.

attributionRecord ::= wasAttributedTo ( eIdentifier , agIdentifier optional-attribute-values )

Quotation Record

A quotation record is a representation of the repeating or copying of some part of an entity.

A quotation record, written wasQuotedFrom(e2,e1,ag2,ag1,attrs) in PROV-ASN, contains:

In PROV-ASN, a quotation record's text matches the quotationRecord production of the grammar.

quotationRecord ::= wasQuotedFrom ( eIdentifier , eIdentifier , agIdentifier , agIdentifier optional-attribute-values )
If wasQuotedFrom(e2,e1,ag2,ag1,attrs) holds for some identifiers e2, e1, ag2, ag1, then the following records hold:
wasDerivedFrom(e2,e1)
wasAttributedTo(e2,ag2)
wasAttributedTo(e1,ag1)

Summary Record

A summary record represents that an entity (expected to be a document) is a synopsis or abbreviation of another entity (also expected to be a document).

A summary record, written wasSummaryOf(e2,e1,attrs) in PROV-ASN, contains:

wasSummaryOf is a strict sub-relation of wasDerivedFrom.

In PROV-ASN, a summary record's text matches the summaryRecord production of the grammar.

summaryRecord ::= wasSummaryOf ( eIdentifier , eIdentifier optional-attribute-values )
Drop this relation ISSUE-220.

Original Source Record

An original source record represents an entity in which another entity first appeared.

An assertion hadOriginalSource, written hadOriginalSource(e2,e1,attrs), contains:

hasOriginalSource is a strict sub-relation of wasDerivedFrom.

In PROV-ASN, an original source record's text matches the originalSourceRecord production of the grammar.

originalSourceRecord ::= hadOriginalSource ( eIdentifier , eIdentifier optional-attribute-values )

Collections

Section in flux to address various comments about collections: ISSUE-135, ISSUE-136, ISSUE-137, ISSUE-138
The intent of the collection record types relations introduced in this section is to account for part-of relationships that may exist amongst entities. Specifically, this section:

We adopt a very generic form of collection for the purpose, namely an abstract data types consisting of set of key-value pairs, often referred to as a map. This provides a generic indexing structure that can be used to model commonly used data structures, including associative lists (also known as "dictionaries" or maps in some programming languages), relational tables, ordered lists, and more (the specification of such specialized structures in terms of key-value pairs is out of the scope of this document).

Keys and values used in collections are entities. This allows expressing nested collections, that is, collections whose values include entities of type collection. PROV-DM does not specify how to relate these entities to literals such as "key1" or 1337.

The following relations and corresponding record types are introduced to model (a) insertion of a new key-value pair into a collection and (b) removal of a key-value pair from a collection.

Because these relations state the derivation of a collection from another, formally they are specializations of the precise-1 wasDerivedFrom relation.

The following entity types are introduced:
Given the specific nature of the derivation, the intervening activity that accounts for imprecise-1 derivation should have an equally specific type, such as" collection-insertion" and "collection-rmoval". This is left for a future version.

The intent of these relations and entity types is to capture the history of changes that occurred to a collection.

The following examples illustrate how these assertions are expected to be used in practice.

   entity(c, [prov:type="prov:EmptyCollection"%%xsd:QName])    // e is an empty collection
   entity(k1)
   entity(v1)
   entity(k2)
   entity(v2)
   entity(c1, [prov:type="prov:Collection"%%xsd:QName])
   entity(c2, [prov:type="prov:Collection"%%xsd:QName])
  
   CollectionAfterInsertion(c1, c, k1, v1)       // c1 = { (k1,v1) }
   CollectionAfterInsertion(c2, c1, k2, v2)      // c2 = { (k1,v1), (k2 v2) }
   CollectionAfterRemoval(c3, c2, k1)            // c3 = { (k2,v2) }

This representation of a collection's evolution makes no assumption regarding the underlying data structure used to store and manage collections. In particular, no assumptions are needed regarding the mutability of a data structure that is subject to updates. In fact, the state of a collection (i.e., the set of key-value pairs it contains) at a given point in a sequence of operations is never stated explicitly. Rather, it can be obtained by querying the chain of derivation assertions involving insertions and removals. Entity type prov:type="prov:EmptyCollection"%%xsd:QName can be used in this context as it marks the start of a sequence of collection operations.

Observations:

An assertion CollectionAfterInsertion, written CollectionAfterInsertion(collAfter, collBefore, key, value), contains:

An assertion CollectionAfterRemoval, written CollectionAfterRemoval(collAfter, collBefore, key), contains:

In PROV-ASN, an collection record's text matches the collectionRecord production of the grammar:

collectionRecord ::= collectionInsertionRecord | collectionRemovalRecord
collectionInsertionRecord ::= CollectionAfterInsertion ( cidentifier , cidentifier , kidentifier , videntifier )
collectionRemovalRecord ::= CollectionAfterRemoval ( cidentifier , cidentifier , kidentifier )

PROV-DM Constraints

The previous two sections have introduced a data model for provenance, without introducing any constraint that this data model has to satisfy. In this section, we explore the constraints that this data model has to satisfy.

PROV-DM Interpretation

Section section-time-event introduces a notion of instantaneous event marking changes in the world, in its activities and entities. PROV-DM identifies four kinds of instantaneous events, namely entity generation event, entity usage event, activity start event and activity end event. PROV-DM adopts Lamport's clock assumptions [[CLOCK]] in the form of a reflexive, transitive partial order follows (and its inverse precedes) between instantaneous events. Furthermore, PROV-DM assumes the existence of a mapping from instantaneous events to time clocks, though the actual mapping is not in scope of this specification.

Given that provenance records offer a description of past entities and activities, to be meaningful provenance records MUST satisfy instantaneous event ordering constraints, which we introduce in this section. For instance, an entity can only be used after it was generated; hence, we say that an entity's generation event precedes any of this entity's usage event. Should this ordering constraint be proven invalid, the associated generation and usage records could not be credible. The rest of this section defines the temporal interpretation of provenance records as the set of instantaneous event ordering constraints associated with provenance records.

PROV-DM also allows for time observations to be inserted in specific provenance records, for each of the four kinds of instantaneous events introduced in this specification. The presence of a time observation for a given instantaneous event fixes the mapping of this instantaneous event to the timeline. The presence of time information in a provenance record instantiates the ordering constraint with that time information. It is expected that such instantiated constraint can help corroborate provenance information. We anticipate that verification algorithms could be developed though this verification is outside the scope of this specification.

The following figure summarizes the ordering constraints in a graphical manner. For each subfigure, an event time line points to the right. Activities are represented by rectangles, whereas entities are represented by circles. Usage, generation and derivation records are represented by the corresponding edges between entities and activities. The four kind of instantaneous events are represented by vertical dotted lines (adjacent to the vertical sides of an activity's rectangle, or intersecting usage and generation edges). The ordering constraints are represented by triangles: an occurrence of a triangle between two instantaneous event vertical dotted lines represents that the event denoted by the left line precedes the event denoted by the right line.

constraints between events
Summary of instantaneous event ordering constraints

The mere existence of an activity assertion entails some event ordering in the world, since an activity start event always precedes the corresponding activity end event. This is illustrated by Subfigure constraint-summary (a) and expressed by constraint start-precedes-end.

The following ordering constraint holds for any activity record: the start event precedes the end event.

Assertion of a usage record and a generation record for a given entity implies ordering of events in the world, since the generation event had to precede the usage event. This is illustrated by Subfigure constraint-summary (b) and expressed by constraint generation-precedes-usage.

For any entity, the following ordering constraint holds: the generation of an entity always precedes any of its usages.

The assertion of a usage record implies ordering of events in the world, since the corresponding event had to occur during the associated activity. This is illustrated by Subfigure constraint-summary (c) and expressed by constraint usage-within-activity.

Given an activity record identified by a, an entity record identified by e, a set of attribute-value pairs attrs, and optional time t, if assertion used(a,e,attrs) or used(a,e,attrs,t) holds, then the following ordering constraint holds: the usage of the entity represented by entity record identified by e precedes the end of activity represented by record identified by a and follows its start.

The assertion of a generation record implies ordering of events in the world, since the corresponding event had to occur during the associated activity. This is illustrated by Subfigure constraint-summary (d) and expressed by constraint generation-within-activity.

If an assertion wasGeneratedBy(x,a,attrs) or wasGeneratedBy(x,a,attrs,t) holds, then the following ordering constraint also holds: the generation of the entity denoted by x precedes the end of a and follows the start of a.

If a derivation record holds for e2 and e1, then this means that the entity e1 had some form of influence on the entity e2; for this to be possible, some event ordering must be satisfied. First, we consider one-activity derivations. In that case, the usage of e1 has to precede the generation of e2. This is illustrated by Subfigure constraint-summary (e) and expressed by constraint derivation-usage-generation-ordering.

Given an activity record identified by a, entity records identified by e1 and e2, generation record identified by g2, and usage record identified by u1, if the record wasDerivedFrom(e2,e1,a,g2,u1,attrs) or wasDerivedFrom(e2,e1,[prov:steps="single"] ∪ attrs) holds, then the following ordering constraint holds: the usage of entity denoted by e1 precedes the generation of the entity denoted by e2.

For imprecise-n derivations, a similar constraint exists, but in this case, no usage record can be inferred for e1. Instead, the constraint refers to its generation event, as illustrated by Subfigure constraint-summary (f) and expressed by constraint derivation-generation-generation-ordering.

Given two entity records denoted by e1 and e2, if the record wasDerivedFrom(e2,e1,[prov:steps="any"] ∪ attrs) holds, then the following ordering constraint holds: the generation event of the entity denoted by e1 precedes the generation event of the entity denoted by e2.

Note that event ordering is between generations of e1 and e2, as opposed to precise-1 derivation, which implies ordering ordering between the usage of e1 and generation of e2. Indeed, in the case of imprecise-n derivation, nothing is known about the usage of e1, since there is no associated activity.

The assertion of an information flow ordering record between two activities of a1 and a2 also implies ordering of events in the world, since some entity must have been generated by the former and used by the later, which implies that the start event of a1 cannot follow the end event of a2. This is illustrated by Subfigure constraint-summary (g) and expressed by constraint wasInformedBy-ordering.

Given two activity records denoted by a1 and a2, if the record wasInformedBy(a2,a1) holds, then the following ordering constraint holds: the start event of the activity record denoted by a1 precedes the end event of the activity record denoted by a2.

The assertion of a control flow ordering record between two activities of a1 and a2 also implies ordering of events in the world, since a1 must have been active before a2 started. This is illustrated by Subfigure constraint-summary (h) and expressed by constraint wasStartedBy-ordering.

Given two activity records denoted by a1 and a2, if the record wasStartedBy(a2,a1) holds, then the following ordering constraint holds: the start event of the activity record denoted by a1 precedes the start event of the activity record denoted by a2.
In the following, we assume that we can talk about the end of an entity (or agent) For this, we use the term 'destruction' This is ISSUE-204.

Further constraints appear in Figure constraint-summary2 and are discussed below.

further constraints between events
Summary of instantaneous event ordering constraints (continued)

An agent that started an activity must exist when the activity starts. This is illustrated by Subfigure constraint-summary2 (a) and expressed by constraint wasStartedByAgent-ordering.

Given an activity denoted by a and an agent denoted by ag, if the record wasStartedBy(a,ag) holds, then the following ordering constraints hold: the start event of the activity denoted by a follows the generation event for agent denoted by ag, and precedes the destruction event of the same agent.

An activity that was associated with an agent must have some overlap with the agent. The agent may be generated, or may only become associated with the activity, after its start: so, the agent is required to exist before the activity end. Likewise, the agent may be destructed, or may terminate its association with the activity, before the activity end: hence, the agent destruction is required to happen after the activity start. This is illustrated by Subfigure constraint-summary2 (b) and expressed by constraint wasAssociatedWith-ordering.

Given an activity denoted by a and an agent denoted by ag, if the record wasAssociatedWith(a,ag) holds, then the following ordering constraints hold: the start event of the activity denoted by a precedes the destruction event of the agent denoted by ag, and the generation event for agent denoted by ag precedes the activity end event.
For completeness, we should define ordering constraint for wasAssociatedWith and actedOnBehalfOf. For wasAssociatedWith(a,ag), it feels that ag must have some overlap with a. For actedOnBehalfOf(ag1,ag2,a), it seem that ag2 should have existed before the overlap between ag1 and a. This is ISSUE-221.
It is suggested that a stronger name for wasAssociatedWith should be adopted. This is ISSUE-182.

PROV-DM Structural Constraints

Sections 5 and 6 define a data model for provenance, which, for the most part, is unconstrained. Section 7.1 defines an interpretation of this data model, in terms of event ordering constraints. This section introduces further constraints on the structure of PROV-DM records. Records that satisfy these constraints are said to be structurally well-formed. A benefit of structurally well-formed provenance records is that further inferences can be made, because records are more precise, and therefore, richer.

According to the definition of a generation record, an entity becomes available after this entity's generation event, and does not exist before this event. From this definition, we conclude that PROV-DM does not allow for an entity to have two generation records occurring at two different instants. The rationale for this constraint is as follows. Two distinct generation events (by a same activity or by two distinct activities), occurring one after the other, necessarily create two distinct entities; otherwise, the second generation event would have resulted in an entity that existed before its creation, which contradicts the definition of generation record.

So, PROV-DM allows for two distinct generation records g1 and g2 referencing a same entity record provided they occur simultaneously. In practice, for such a simultaneous generation to occur, the generation event has to be unique and caused by a single world activity, though the provenance records may contain different activity records providing alternative descriptions of that same world activity.

In the following assertions, a workflow execution a0 consists of two sub-workflow executions a1 and a2. Sub-workflow execution a2 generates entity e, so does a0.

activity(a0,,,[prov:type="workflow execution"])
activity(a1,,,[prov:type="workflow execution"])
activity(a2,,,[prov:type="workflow execution"])
wasInformedBy(a2,a1)

wasGeneratedBy(e,a0)
wasGeneratedBy(e,a2)
This example is permitted in PROV-DM if the two activity records a0 and a2 provide alternate descriptions of what happens in the world with respect to this generation event.

While this example is permitted in PROV-DM, it does not expose the hierarchical organization of executions and it mixes records providing two descriptions of a same execution. This issue is highlighted by two different generation records for entity e, which makes reasoning about this kind of provenance records unnecessarily difficult. Such assertions are said not be structurally well-formed.

Structurally well-formed provenance records can be obtained by partitioning the generation records into different accounts. This makes it clear that these records provide alternative descriptions of the same real-world generation event, rather than describing two generation events for the same entity. When accounts are used, the example can be encoded as follows.

The same example is now revisited, with the following assertions that are structurally well-formed. Two accounts are introduced, and there is a single generation record for entity e per account.

account(ex:summary,
        http://example.org/asserter, 
        activity(a0,t1,t2,[prov:type="workflow execution"])
        wasGeneratedBy(e,a0))

account(ex:detailed,
        http://example.org/asserter, 
        activity(a1,t1,t3,[prov:type="workflow execution"])
        activity(a2,t3,t2,[prov:type="workflow execution"])
        wasInformedBy(a2,a1)
        wasGeneratedBy(e,a2))

Structurally well-formed records satisfy some constraints, which force the structure of descriptions to be exposed by means of accounts. With these constraints satisfied, further inferences can be made about structurally well-formed records. The uniqueness of generation records in accounts is formulated as follows.

Given an entity record denoted by e, two activity records denoted by a1 and a2, and two sets of attribute-value pairs attrs1 and attrs2, if the records wasGeneratedBy(e,a1,attrs1) and wasGeneratedBy(e,a2,attrs2) exist in the scope of a given account, then a1=a2 and attrs1=attrs2.

A further inference is permitted from the imprecise-1 derivation record:

Given an activity record identified by a, entity records identified by e1 and e2, and set of attribute-value pairs attrs2, if wasDerivedFrom(e2,e1, [prov:steps="single"]) and wasGeneratedBy(e2,a,attrs2) hold, then used(a,e1,attrs1) also holds for some set of attribute-value pairs attrs1.

This inference is justified by the fact that the entity represented by entity record identified by e2 is generated by at most one activity in a given account (see generation-uniqueness). Hence, this activity record is also the one referred to in the usage record of e1.

We note that the converse inference, does not hold. From wasDerivedFrom(e2,e1) and used(a,e1), one cannot derive wasGeneratedBy(e2,a,attrs2) because identifier e1 may occur in usage records referring to many activity records, but they may not be referred to in generation records containing identifier e2.

An account is said to be structurally well-formed if it satisfies the constraint generation-uniqueness. If an account is structurally well-formed, it support the inference derivation-use.

The union of two accounts is another account, containing the unions of their respective records, where records with a same identifier should be understood according to constraint identifiable-record-in-account. Structurally well-formed accounts are not closed under union because the constraint generation-uniqueness may no longer be satisfied in the resulting union.

Indeed, let us reconsider example account-example-1, and let us define another account record as follows.

account(ex:acc2,
        http://example.org/asserter2, 
          entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ])
          ...
          activity(a1,t1,,[prov:type="createFile"])
          ...
          wasGeneratedBy(e0,a1,[ex:fct="create"])     
          ... )

with identifier ex:acc2, containing assertions by asserter by http://example.org/asserter2 stating that the entity represented by entity record identified by e0 was generated by an activity represented by activity record identified by a1 instead of a0 in the previous account ex:acc0. If accounts ex:acc0 and ex:acc2 are merged together, the resulting set of records violates generation-uniqueness if the two activities a0 and a1 are distinct.

Can the semantics characterize better what can be achieved with structurally well-formed accounts?

PROV-DM Extensibility Points

The PROV data model provides several extensibility points that allow designers to specialize it to specific applications or domains. We summarize these extensibility points here:

The PROV data model is designed to be application and technology independent, but specializations of PROV-DM are welcome and encouraged. To ensure inter-operability, specializations of the PROV data model that exploit the extensibility points summarized in this section MUST preserve the semantics specified in this document. For instance, a qualified attribute on a domain specific entity record MUST represent an aspect of an entity and this aspect MUST remain unchanged during the characterization's interval of this entity record.

Resources, URIs, Entities, Identifiers, and Scope

This specification introduces the notion of an identifiable entity in the world. In PROV-DM, an entity record is a representation of such an identifiable entity. An entity record includes an identifier identifying this entity. Identifiers are qualified names, which can be mapped to IRIs.

The term 'resource' is used in a general sense for whatever might be identified by a URI [[!RFC3986]]. On the Web, a URI denotes a resource, without any expectation that the resource is accessed.

The purpose of this section is to clarify the relationship between resource and the notions of entity and entity record.

In the context of PROV-DM, a resource is just a thing in the world. One may take multiple perspectives on such a thing and its situation in the world, fixing some its aspects.

We refer to the example of section 2.1 for a resource (at some URL) and three different perspectives, referred to as entities. Three different entity records can be expressed for this report, which in the PROV-ASN sample below, are expressed within a same account.

container
prefix app http://example.org/app/
prefix cr  http://example.org/crime/

   account(acc1,
           http://example.org/asserter1,

           entity(app:0, [ prov:type="Document", cr:path="http://example.org/crime.txt" ])
           entity(app:1, [ prov:type="Document", cr:path="http://example.org/crime.txt", cr:version="2.1", cr:content="...", cr:date="2011-10-07" ])
           entity(app:2, [ prov:type="Document", cr:author="John" ])
        ...)
endContainer

Each entity record contains an identifier that is unique in account acc1, and therefore locally identifies the entity record it is contained in. In this example, three identifiers were minted.

Given that the report is a resource denoted by the URI http://example.org/crime.txt, we could simply use this URI as the identifier of an entity. This would avoid us minting new URIs. Hence, the report URI would play a double role: as a URI it denotes a resource accessible at that URI, and as an identifier in a PROV-DM record, it helps identify a specific characterization of this report. A given identifier occurring in an entity record must be unique within the scope of an account. Hence, below, all entities records have been given the same identifier but appear in the scope of different accounts, so as to satisfy identifiable-record-in-account.

container 
prefix app http://example.org/
prefix cr  http://example.org/crime/

   account(acc2,
           http://example.org/asserter1,

           entity(app:crime.txt, [ prov:type="Document", cr:path="http://example.org/crime.txt" ])
           ...)

   account(acc3,
           http://example.org/asserter1,

           entity(app:crime.txt, [ prov:type="Document", cr:path="http://example.org/crime.txt", cr:version="2.1", cr:content="...", cr:date="2011-10-07" ])
           ...)

   account(acc4,
           http://example.org/asserter1,
           entity(app:crime.txt, [ prov:type="Document", cr:author="John" ])
           ...)
endContainer

In this case, the qualified name app:crime.txt maps to URI http://example.org/crime.txt still denotes the same resource; however, the perspectives we take about that resource are expressed by multiple entity records, happening to all contain the same identifier but in different accounts.

Alternatively, if we need to assert the existence of two different perspectives on the report within the same account, then alternate identifiers MUST be used, one of them being allowed to be the resource URI.

container 
 prefix app  http://example.org/
 prefix app2 http://example.org/app/
 prefix cr   http://example.org/crime/

   account(acc5,
           http://example.org/asserter1,

           entity(app:crime.txt, [ prov:type="Document", cr:path="http://example.org/crime.txt" ])
           entity(app2:1, [ prov:type="Document", cr:path="http://example.org/crime.txt", cr:version="2.1", cr:content="...", cr:date="2011-10-07" ])

           ...)
endContainer

Changes Since Second Public Working Draft

Acknowledgements

WG membership to be listed here.