Copyright © 2011-2012 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
PROV-DM, the PROV conceptual data model, is a data model for provenance that describes the entities, people and activities involved in producing a piece of data or thing. PROV-DM distinguishes core structures, forming the essence of provenance descriptions, from extended structures catering for more advanced uses of provenance. PROV-DM is organized in six components, respectively dealing with: (1) entities and activities, and the time at which they were created, used, or ended; (2) agents bearing responsibility for entities that were generated and activities that happened; (3) derivations of entities from entities; (4) properties to link entities that refer to the same thing; (5) a notion of bundle, a mechanism to support provenance of provenance; and, (6) collections forming a logical structure for its members.
This document introduces the provenance concepts found in PROV and defines PROV-DM types and relations. PROV data model is domain-agnostic, but is equipped with extensibility points allowing domain-specific information to be included.
Two further documents complete the specification of PROV-DM. First, a companion document specifies the set of constraints that provenance descriptions should follow. Second, a separate document describes a provenance notation for expressing instances of provenance for human consumption; this notation is used in examples in this document.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is the fifth public release of the PROV-DM document. Publication as Last Call working draft means that the Working Group believes that it has satisfied the relevant technical requirements outlined in its charter on this document. The design is not expected to change significantly, going forward, and now is the key time for external review, before the implementation phase.
The PROV Working group seeks public feedback on this Working Draft. The end date of the Last Call review period is TBD, and we would appreciate comments by that date to public-prov-comments@w3.org
This document was published by the Provenance Working Group as an Editor's Draft. If you wish to make comments regarding this document, please send them to public-prov-comments@w3.org (subscribe, archives). All feedback is welcome.
Publication as an Editor's Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
For the purpose of this specification, provenance is defined as a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing. In particular, the provenance of information is crucial in deciding whether information is to be trusted, how it should be integrated with other diverse information sources, and how to give credit to its originators when reusing it. In an open and inclusive environment such as the Web, where users find information that is often contradictory or questionable, provenance can help those users to make trust judgements.
We consider a generic data model for provenance that allows domain and application specific representations of provenance to be translated into such a data model and interchanged between systems. Thus, heterogeneous systems can export their native provenance into such a core data model, and applications that need to make sense of provenance can then import it, process it, and reason over it.
The PROV data model distinguishes core structures from extended structures: core structures form the essence of provenance descriptions, and are commonly found in various domain-specific vocabularies. Extended structures enhance and refine core structures with more expressive capabilities to cater for more advanced uses of provenance. The PROV data model, comprising both core and extended structures, is a domain-agnostic model, but with clear extensibility points allowing further domain-specific and application-specific extensions to be defined.
The PROV data model has a modular design and is structured according to six components covering various facets of provenance:
This specification presents the concepts of the PROV Data Model, and provenance types and relations, without specific concern for how they are applied. With these, it becomes possible to write useful provenance descriptions, and publish or embed them alongside the data they relate to.
However, if something about which provenance is expressed is subject to change, then it is challenging to express its provenance precisely (e.g. the data from which a daily weather report is derived changes from day to day). To address this challenge, it is proposed to enrich simple provenance, with refined descriptions that help qualify the specific subject of provenance and provenance itself, with attributes and temporal information, intended to satisfy a comprehensive set of constraints. These aspects are covered in the companion specification [PROV-CONSTRAINTS].
Section 2 provides an overview of the PROV Data Model, distinguishing a core set of types and relations, commonly found in provenance descriptions, from extended structures catering for advanced uses. It also introduces a modular organization of the data model in components.
Section 3 overviews the Provenance Notation used to illustrate examples of provenance descriptions.
Section 4 illustrates how the PROV data model can be used to express the provenance of a report published on the Web.
Section 5 provides the definitions of PROV concepts, structured according to six components.
Section 6 summarizes PROV-DM extensibility points.
Section 7 introduces the idea that constraints can be applied to the PROV data model to validate provenance descriptions; these are covered in the companion specification [PROV-CONSTRAINTS].
The key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" in this document are to be interpreted as described in [RFC2119].
The following namespaces prefixes are used throughout this document.
prefix | namespace uri | definition |
prov | http://www.w3.org/ns/prov# | The PROV namespace (see Section 5.7.1) |
xsd | http://www.w3.org/2000/10/XMLSchema# | XML Schema Namespace [XMLSCHEMA-2] |
rdf | http://www.w3.org/1999/02/22-rdf-syntax-ns# | The RDF namespace [RDF-CONCEPTS] |
(others) | (various) | All other namespace prefixes are used in examples only. In particular, URIs starting with "http://example.com" represent some application-dependent URI [URI] |
Examples throughout this document use the PROV-N Provenance Notation, briefly introduced in Section 3 and specified fully in a separate document [PROV-N].
This section introduces provenance concepts with informal descriptions and illustrative examples. PROV distinguishes core structures, forming the essence of provenance descriptions, from extended structures catering for more advanced uses of provenance. Core and extended structures are respectively presented in Section 2.1 and Section 2.2. Furthermore, the PROV data model is organized according to components, which form thematic groupings of concepts (see Section 2.3).
The core of PROV consists of essential provenance structures commonly found in provenance descriptions. It is summarized graphically by the UML diagram of Figure 1, illustrating three types (entity, activity, and agent) and how they relate to each other. In the core of PROV, all associations are binary.
The concepts found in the core of PROV are introduced in the rest of this section. They are summarized in Table 2, where they are categorized as type or relation. The first column lists concepts, the second column indicates whether a concept maps to a type or a relation, whereas the third column contains the corresponding name. Names of relations have a verbal form in the past tense to express what happened in the past, as opposed to what may or will happen.
PROV Concepts | PROV-DM types or relations | Name | Overview |
Entity | PROV-DM Types | entity | 2.1.1 |
Activity | activity | 2.1.1 | |
Agent | agent | 2.1.2 | |
Generation | PROV-DM Relations | wasGeneratedBy | 2.1.1 |
Usage | used | 2.1.1 | |
Communication | wasInformedBy | 2.1.1 | |
Attribution | wasAttributedTo | 2.1.2 | |
Association | wasAssociatedWith | 2.1.2 | |
Delegation | actedOnBehalfOf | 2.1.2 | |
Derivation | wasDerivedFrom | 2.1.3 |
In PROV, things we want to describe the provenance of are called entities and have some fixed aspect. The term "things" encompasses a broad diversity of notions, including digital objects such as a file or web page, physical things such as a mountain, a building, a printed book, or a car as well as abstract concepts and ideas.
An entity may be the document at URI http://www.bbc.co.uk/news/science-environment-17526723, a file in a file system, a car, or an idea.
An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, generating, or being associated with entities. Activities that operate on digital entities may for example move, copy, or duplicate them.
An activity may be the publishing of a document on the Web, sending a twitter message, extracting metadata embedded in a file, driving a car from Boston to Cambridge, assembling a data set based on a set of measurements, performing a statistical analysis over a data set, sorting news items according to some criteria, running a SPARQL query over a triple store, or editing a file.
Activities and entities are associated with each other in two different ways: activities utilize entities and activities produce entities. The act of utilizing or producing an entity may have a duration. The term 'generation' refers to the completion of the act of producing; likewise, the term 'usage' refers to the beginning of the act of utilizing entities. Thus, we define the following concepts of generation and usage.
Examples of generation are the completed creation of a file by a program, the completed creation of a linked data set, and the completed publication of a new version of a document.
Usage examples include a procedure beginning to consume an argument, a service starting to read a value on a port, a program beginning to read a configuration file, or the point at which an ingredient, such as eggs, is being added in a baking activity. Usage may entirely consume an entity (e.g. eggs are no longer available after being added to the mix); in contrast, the same entity may be used multiple times, possibly by different activities (e.g. a file on a file system can be read indefinitely).
The generation of an entity by an activity and its subsequent usage by another activity is termed communication.
The activity of writing a celebrity article was informed by (a communication instance) the activity of intercepting voicemails.
The motivation for introducing agents in the model is to express the agent's responsibility for activities that happened and entities that were generated.
An agent is something that bears some form of responsibility for an activity taking place or for the existence of an entity. An agent may be a particular type of entity or activity. This means that the model can be used to express provenance of the agents themselves.
Software for checking the use of grammar in a document may be defined as an agent of a document preparation activity; one can also describe its provenance, including for instance the vendor and the version history. A site selling books on the Web, the services involved in the processing of orders, and the companies hosting them are also agents.
Agents can be related to entities, activities, and other agents.
A blog post can be attributed to an author, a mobile phone to its manufacturer.
Agents are defined as having some kind of responsibility for activities.
An activity association is an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity.
Examples of association between an activity and an agent are:
Delegation is the assignment of authority to an agent (by itself or by another agent) to carry out a specific activity as a delegate or representative, while the agent that it represents remains responsible for the outcome of the delegated work. The nature of this relation is intended to be broad, including contractual relation, but also altruistic initiative by the representative agent.
A student publishing a web page describing an academic department could result in both the student and the department being agents associated with the activity. It may not matter which actual student published a web page, but it may matter significantly that the department told the student to put up the web page.
Activities utilize entities and produce entities. In some cases, utilizing an entity influences the creation of another in some way. This notion of 'influence' is captured by derivations, defined as follows.
A derivation is a transformation of an entity into another, an update of an entity, resulting in a new one, or based on an entity, the construction of another.
Examples of derivation include the transformation of a relational table into a linked data set, the transformation of a canvas into a painting, the transportation of a work of art from London to New York, and a physical transformation such as the melting of ice into water.
While the core of PROV focuses on essential provenance structures commonly found in provenance descriptions, extended structures are designed to support more advanced uses of provenance. The purpose of this section is twofold. First, mechanisms to specify these extended structures are introduced. Second, two further kinds of provenance structures are overviewed: they cater for provenance of provenance and collections, respectively.
Extended structures are defined by a variety of mechanisms outlined in this section: subtyping, expanded relations, optional identification, and new relations.
Subtyping can be applied to core types. For example, a software agent is special kind of agent, defined as follows.
A software agent is running software.Subtyping can also be applied to core relations. For example, a revision is a special kind of derivation, defined as follows.
A revision is a derivation that revises an entity into a revised version.
Section 2.1 shows that seven concepts are mapped to binary relations in the core of PROV. However, some advanced uses of these concepts cannot be captured by a binary relation, but require relations to be expanded to n-ary relations.
To illustrate expanded relations, we consider the concept of association, described in section 2.1.2. Agents may adopt sets of actions or steps to achieve their goals in the context of an activity: this is captured by the notion of a plan. Thus, an activity may reflect the execution of a plan that was designed in advance to guide the execution. Hence, an expanded association relation allows a plan be linked to an activity. Plan is defined by subtyping and full association by an expanded relation, as follows.
A plan is an entity that represents a set of actions or steps intended by one or more agents to achieve some goals. There exist no prescriptive requirement on the nature of plans, their representation, the actions or steps they consist of, or their intended goals. Since plans may evolve over time, it may become necessary to track their provenance, so plans themselves are entities. Representing the plan explicitly in the provenance can be useful for various tasks: for example, to validate the execution as represented in the provenance record, to manage expectation failures, or to provide explanations.
An activity association is an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity. It further allows for a plan to be specified, which is the plan intended by the agent to achieve some goals in the context of this activity.
An example of association between an activity and an agent involving a plan is: an XSLT transform (an activity) launched by a user (an agent) based on an XSL style sheet (a plan).
Some concepts exhibit both a core use, expressed as binary relation, and an extended use, expressed as n-ary relation. In some cases, mapping the concept to a relation, whether binary or n-ary, is not sufficient: instead, it may be required to identify an instance of such concept. In those cases, PROV-DM allows for an optional identifier to be expressed to identify an instance of an association between two or more elements. This optional identifier can then be used to refer to an instance as part of other concepts.
A service may read a same configuration file on two different occasions. Each usage can be identifed by its own identifier, allowing them to be distinguished.
Finally, PROV-DM supports further relations that are not subtypes or expanded versions of existing relations.
A bundle is a named set of provenance descriptions, and is itself an entity, so allowing provenance of provenance to be expressed.
For users to decide whether they can place their trust in a resource, they may want to analyze the resource's provenance, but also determine who its provenance is attributed to, and when it was generated. In other words, users need to be able to determine the provenance of provenance. Hence, provenance is also regarded as an entity (of type Bundle), by which provenance of provenance can then be expressed.
A collection is an entity that provides a structure to some constituents, which are themselves entities. These constituents are said to be member of the collections. This concept allows for the provenance of the collection itself to be expressed in addition to that of the members. Many different types of collections exist, such as a sets, dictionaries, or lists, all of which involve a membership relationship between the constituents and the collection.
An example of collection is an archive of documents. Each document has its own provenance, but the archive itself also has some provenance: who maintained it, which documents it contained at which point in time, how it was assembled, etc.
Besides the separation between core and extended structures, PROV-DM is further organized according to components, grouping concepts in a thematic manner.
Table 3 enumerates the six components, five of which have already been implicitly overviewed in this section. All components specify extended structures, whereas only the first three define core structures.
Component | Core Structures | Overview | Specification | Description | |
1 | Entities and Activities | ✔ | 2.1.1 | 5.1 | about entities and activities, and their interrelations |
2 | Agent and Responsibility | ✔ | 2.1.2 | 5.2 | about agents and concepts ascribing responsibility to them |
3 | Derivation | ✔ | 2.1.3 | 5.3 | about derivations and its subtypes |
4 | Alternate | — | 5.4 | about relations linking entities referring the same thing | |
5 | Bundles | 2.2.2 | 5.5 | about bundles, a mechanism to support provenance of provenance | |
6 | Collections | 2.2.3 | 5.6 | about collections and concepts capturing their transformation, such as insertion and removal |
To illustrate the application of PROV concepts to a concrete example (see Section 4) and to provide examples of concepts (see Section 5), we introduce PROV-N, a notation for writing instances of the PROV data model. For full details, the reader is referred to the companion specification [PROV-N]. PROV-N is a notation aimed at human consumption, with the following characteristics:
An activity with identifier a1 and an attribute type with value createFile.
activity(a1, [prov:type="createFile"])Two entities with identifiers e1 and e2.
entity(e1) entity(e2)The activity a1 used e1, and e2 was generated by a1.
used(a1,e1) wasGeneratedBy(e2,a1)The same descriptions, but with an explicit identifier u1 for the usage, and the syntactic marker '-' to mark the absence of identifier in the generation. Both are followed by ';'.
used(u1;a1,e1) wasGeneratedBy(-;e2,a1)
Section 2 has introduced some provenance concepts, and how they are expressed as types or relations in the PROV data model. The purpose of this section is to put these concepts into practice in order to express the provenance of some document published on the Web. With this realistic example, PROV concepts are composed together, and a graphical illustration shows a provenance description forming a directed graph, rooted at the entity we want to explain the provenance of, and pointing to the entities, activities, and agents it depended on. This example also shows that, sometimes, multiple provenance descriptions about the same entity can co-exist, which then justifies the need for provenance of provenance.
In this example, we consider one of the many documents published by the World Wide Web Consortium, and describe its provenance. Specifically, we consider the document identified by http://www.w3.org/TR/2011/WD-prov-dm-20111215. Its provenance can be expressed from several perspectives: first, provenance can take the authors' viewpoint; second, it can be concerned with the W3C process. Then, attribution of these two provenance descriptions is provided.
Description: A document is edited by some editor, using contributions from various contributors.
In this perspective, provenance of the document http://www.w3.org/TR/2011/WD-prov-dm-20111215 is concerned with the editing activity as perceived by authors. This kind of information could be used by authors in their CV or in a narrative about this document.
We paraphrase some PROV-DM descriptions, express them with the PROV-N notation, and depict them with a graphical illustration (see Figure 2). Full details of the provenance record can be found here.
entity(tr:WD-prov-dm-20111215, [ prov:type="document", ex:version="2" ])
activity(ex:edit1,[prov:type="edit"])
wasGeneratedBy(tr:WD-prov-dm-20111215, ex:edit1, -)
agent(ex:Paolo, [ prov:type="Person" ]) agent(ex:Simon, [ prov:type="Person" ])
wasAssociatedWith(ex:edit1, ex:Paolo, -, [ prov:role="editor" ]) wasAssociatedWith(ex:edit1, ex:Simon, -, [ prov:role="contributor" ])
Provenance descriptions can be illustrated graphically. The illustration is not intended to represent all the details of the model, but it is intended to show the essence of a set of provenance descriptions. Therefore, it should not be seen as an alternate notation for expressing provenance.
The graphical illustration takes the form of a graph. Entities, activities and agents are represented as nodes, with oval, rectangular, and pentagonal shapes, respectively. Usage, Generation, Derivation, and Association are represented as directed edges.
Entities are laid out according to the ordering of their generation. We endeavor to show time progressing from left to right. This means that edges for Usage, Generation, Derivation, Association typically point leftwards
Description: The World Wide Web Consortium publishes documents according to its publication policy. Working drafts are published regularly to reflect the work accomplished by working groups. Every publication of a working draft must be preceded by a "publication request" to the Webmaster. The very first version of a document must also be preceded by a "transition request" to be approved by the W3C director. All working drafts are made available at a unique URI. In this scenario, we consider two successive versions of a given document, the policy according to which they were published, and the associated requests.
We describe the kind of provenance record that the WWW Consortium could keep for auditors to check that due processes are followed. All entities involved in this example are Web resources, with well-defined URIs (some of which refer archived email messages, available to W3C Members).
We now paraphrase some PROV descriptions, and express them with the PROV-N notation, and depict them with a graphical illustration (see Figure 3). Full details of the provenance record can be found here.
entity(tr:WD-prov-dm-20111215, [ prov:type='rec54:WD' ])
activity(ex:act2,[prov:type="publish"])
wasGeneratedBy(tr:WD-prov-dm-20111215, ex:act2, -)
wasDerivedFrom(tr:WD-prov-dm-20111215, tr:WD-prov-dm-20111018)
used(ex:act2, email:2011Dec/0111, -)
wasAssociatedWith(ex:act2, w3:Consortium, process:rec-advance)
This simple example has shown a variety of PROV concepts, such as Entity, Agent, Activity, Usage, Generation, Derivation, and Association. In this example, it happens that all entities were already Web resources, with readily available URIs, which we used. We note that some of the resources are public, whereas others have restricted access: provenance statements only make use of their identifiers. If identifiers do not pre-exist, e.g. for activities, then they can be generated, for instance ex:act2, occurring in the namespace identified by prefix ex. We note that the URI scheme developed by W3C is particularly suited for expressing provenance of these documents, since each URI denotes a specific version of a document. It then becomes easy to relate the various versions with PROV-DM relations. We note that an Association is a ternary relation (represented by a multi-edge labeled wasAssociatedWith) from an activity to an agent and a plan.
The two previous sections offer two different perspectives on the provenance of a document. PROV allows for multiple sources to provide the provenance of a subject. For users to decide whether they can place their trust in the document, they may want to analyze its provenance, but also determine who the provenance is attributed to, and when it was generated, etc. In other words, we need to be able to express the provenance of provenance.
PROV-DM offers a construct to name a bundle of provenance descriptions (full details: ex:author-view).
bundle ex:author-view agent(ex:Paolo, [ prov:type='prov:Person' ]) agent(ex:Simon, [ prov:type='prov:Person' ]) ... endBundleLikewise, the process view can be expressed as a separate named bundle (full details: ex:process-view).
bundle ex:process-view agent(w3:Consortium, [ prov:type='prov:Organization' ]) ... endBundle
To express their respective provenance, these bundles must be seen as entities, and all PROV constructs are now available to express their provenance. In the example below, ex:author-view is attributed to the agent ex:Simon, whereas ex:process-view to w3:Consortium.
entity(ex:author-view, [prov:type='prov:Bundle' ]) wasAttributedTo(ex:author-view, ex:Simon) entity(ex:process-view, [prov:type='prov:Bundle' ]) wasAttributedTo(ex:process-view, w3:Consortium)
Provenance concepts, expressed as PROV-DM types and relations, are organized according to six components that are defined in this section. The components and their dependencies are illustrated in Figure 4. A component that relies on concepts defined in another is displayed above it in the figure. So, for example, component 6 (collections) depends on concepts defined in component 3 (derivation), itself dependen on concepts defined in component 1 (entity and activity).
While not all PROV-DM relations are binary, they all involve two primary elements. Hence, Table 4 indexes all relations according to their two primary elements (referred to as subject and object). The table adopts the same color scheme as Figure 4, allowing components to be readily identified. Note that for simplicity, this table does not include bundle-oriented and collection-oriented relations. Relation names appearing in bold correspond to the core structures introduced in Section 2.1.
Object | ||||
Entity | Activity | Agent | ||
Subject | Entity | wasGeneratedBy wasInvalidatedBy | wasAttributedTo | |
Activity | wasInformedBy | wasAssociatedWith | ||
Agent | — | — | actedOnBehalfOf |
Table 5 is a complete index of all the types and relations of PROV-DM, color-coded according to the component they belong to. In the first column, concept names link to their informal definition, whereas, in the second column, representations link to the information used to represent the concept. Concept names appearing in bold are the core structures introduced in Section 2.1.
In the rest of the section, each type and relation is defined informally, followed by a summary of the information used to represent the concept, and illustrated with PROV-N examples.
The first component of PROV-DM is concerned with entities and activities, and their interrelations: Usage, Generation, Start, End, Invalidation, and Communication. Figure 5 uses UML to depict the first component. Core structures are displayed in the yellow area, consisting of two classes (Entity, Activity) and three binary associations between them (Usage, Generation, and Communication). The rest of the figure displays extended structures, including UML association classes (see [UML], section 7.3.4, p. 42), represented in gray, to express expanded n-ary relations (for Usage, Generation, Invalidation, Start, End). The figure also makes explicit associations with time for these concepts (time being marked with the primitive stereotype).
The following expression
entity(tr:WD-prov-dm-20111215, [ prov:type="document", ex:version="2" ])states the existence of an entity, denoted by identifier tr:WD-prov-dm-20111215, with type document and version number 2. The attribute ex:version is application specific, whereas the attribute type (see Section 5.7.4.4) is reserved in the PROV namespace.
The following expression
activity(a1,2011-11-16T16:05:00,2011-11-16T16:06:00, [ ex:host="server.example.org", prov:type='ex:edit' ])
states the existence of an activity with identifier a1, start time 2011-11-16T16:05:00, and end time 2011-11-16T16:06:00, running on host server.example.org, and of type edit. The attribute host is application specific (declared in some namespace with prefix ex). The attribute type is a reserved attribute of PROV-DM, allowing for sub-typing to be expressed (see Section 5.7.4.4).
Further considerations:
While each of id, activity, time, and attributes is optional, at least one of them must be present.
The following expressions
wasGeneratedBy(e1, a1, 2001-10-26T21:32:52, [ ex:port="p1" ]) wasGeneratedBy(e2, a1, 2001-10-26T10:00:00, [ ex:port="p2" ])
state the existence of two generations (with respective times 2001-10-26T21:32:52 and 2001-10-26T10:00:00), at which new entities, identified by e1 and e2, are created by an activity, identified by a1. The first one is available on port p1, whereas the other is available on port p2. The semantics of port are application specific.
In some cases, we may want to record the time at which an entity was generated without having to specify the activity that generated it. To support this requirement, the activity element in generation is optional. Hence, the following expression indicates the time at which an entity is generated, without naming the activity that did it.
wasGeneratedBy(e, -, 2001-10-26T21:32:52)
While each of id, entity, time, and attributes is optional, at least one of them must be present.
A reference to a given entity may appear in multiple usages that share a given activity identifier.
The following usages
used(a1, e1, 2011-11-16T16:00:00, [ ex:parameter="p1" ]) used(a1, e2, 2011-11-16T16:00:01, [ ex:parameter="p2" ])
state that the activity identified by a1 used two entities identified by e1 and e2, at times 2011-11-16T16:00:00 and 2011-11-16T16:00:01, respectively; the first one was found as the value of parameter p1, whereas the second was found as value of parameter p2. The semantics of parameter is application specific.
A communication implies that activity a2 is dependent on another a1, by way of some unspecified entity that is generated by a1 and used by a2.
Consider two activities a1 and a2, the former performed by a government agency, and the latter by a driver caught speeding.
activity(a1, [ prov:type="traffic regulations enforcing" ]) activity(a2, [ prov:type="fine paying, check writing, and mailing" ]) wasInformedBy(a2, a1)The last line indicates that some implicit entity was generated by a1 and used by a2; this entity may be a traffic ticket that had a notice of fine, amount, and payment mailing details.
While each of id, trigger, starter, time, and attributes is optional, at least one of them must be present.
The following example contains the description of an activity a1 (a discussion), which was started at a specific time, and was triggered by an email message e1.
entity(e1, [ prov:type="email message"] ) activity(a1, [ prov:type="Discuss" ]) wasStartedBy(a1, e1, -, 2011-11-16T16:05:00)Furthermore, if the message is also an input to the activity, this can be described as follows:
used(a1, e1, -)
Alternatively, one can also describe the activity that generated the email message.
activity(a0, [ prov:type="Write" ]) wasGeneratedBy(e1, a0) wasStartedBy(a1, e1, a0, 2011-11-16T16:05:00)
If e1 is not known, it would also have valid to write:
wasStartedBy(a1, -, a0, 2011-11-16T16:05:00)
In the following example, a race is started by a bang, and responsibility for this trigger is attributed to an agent ex:Bob.
activity(ex:foot_race) entity(ex:bang) wasStartedBy(ex:foot_race, ex:bang, -, 2012-03-09T08:05:08-05:00) agent(ex:Bob) wasAttributedTo(ex:bang, ex:Bob)
In this example, filling fuel was started as a consequence of observing low fuel. The trigger entity is unspecified, it could for instance have been the low fuel warning light, the fuel tank indicator needle position, or the engine not running properly.
activity(ex:filling-fuel) activity(ex:observing-low-fuel) agent(ex:driver, [ prov:type='prov:Person' ) wasAssociatedWith(ex:filling-fuel, ex:driver) wasAssociatedWith(ex:observing-low-fuel, ex:driver) wasStartedBy(ex:filling-fuel, -, ex:observing-low-fuel, -)
The relations wasStartedBy and used are orthogonal, and thus need to be expressed independently, according to the situation being described.
While each of id, trigger, ender, time, and attributes is optional, at least one of them must be present.
The following example is a description of an activity a1 (editing) that was ended following an approval document e1.
entity(e1, [ prov:type="approval document" ]) activity(a1, [ prov:type="Editing" ]) wasEndedBy(a1, e1)
Entities have a duration. Generation marks the beginning of an entity, whereas invalidation marks its end. An entity's lifetime can end for different reasons:
In the first two cases, the entity has physically disappeared after its termination: there is no more soup, or painting. In the last three cases, there may be an "offer voucher" that still exists, but it is no longer valid; likewise, on April 4th, the BBC news site still exists but it is not the same entity as BBC news Web site on April 3rd; or the traffic light became red and therefore is regarded as a different entity to the green light.
While each of id, activity, time, and attributes is optional, at least one of them must be present.
The Painter, a Picasso painting, is known to have been destroyed in a plane accident.
entity(ex:The-Painter) agent(ex:Picasso) wasAttributedTo(ex:The-Painter, ex:Picasso) activity(ex:crash) wasInvalidatedBy(ex:The-Painter, ex:crash, 1998-09-02, [ ex:circumstances="plane accident" ])
The BBC news home page on 2012-04-03 ex:bbcNews2012-04-03 contained a reference to a given news item bbc:news/uk-17595024, but the BBC news home page on the next day did not.
entity(ex:bbcNews2012-04-03) memberOf(ex:bbcNews2012-04-03, {("item1", bbc:news/uk-17595024)}) wasGeneratedBy (ex:bbcNews2012-04-03, -, 2012-04-03T00:00:01) wasInvalidatedBy(ex:bbcNews2012-04-03, -, 2012-04-03T23:59:59)We refer to example Example 40 for further descriptions of the BBC Web site, and to Section 5.6.5 for a description of the relation memberOf.
In this example, the "buy one beer, get one free" offer expired at the end of the happy hour.
entity(buy_one_beer_get_one_free_offer_during_happy_hour) wasAttributedTo(proprietor) wasInvalidatedBy(buy_one_beer_get_one_free_offer_during_happy_hour, -,2012-03-10T18:00:00)
In contrast, in the following descriptions, Bob redeemed the offer 45 minutes before it expired, and got two beers.
entity(buy_one_beer_get_one_free_offer_during_happy_hour) wasAttributedTo(proprietor) activity(redeemOffer) entity(twoBeers) wasAssociatedWith(redeemOffer,bob) used(buy_one_beer_get_one_free_offer_during_happy_hour, redeemOffer, 2012-03-10T17:15:00) wasInvalidatedBy(buy_one_beer_get_one_free_offer_during_happy_hour, redeemOffer, 2012-03-10T17:15:00) wasGeneratedBy(twoBeers,redeemOffer)
We see that the offer was both used to be converted into twoBeers and invalidated by the redeemOffer activity: in other words, the combined usage and invalidation indicate consumption of the offer.
The second component of PROV-DM, depicted in Figure 6, is concerned with agents and the notions of Attribution, Association, Delegation, relating agents to entities, activities, and agents, respectively. Core structures are displayed in the yellow area and include three classes and three binary associations. Outside the yellow area, extended structures comprise and UML association classes to express expanded n-ary relations, and subclasses Plan, Person, SofwareAgent, and Organization. The subclasses are marked by the UML stereotype "prov:type" to indicate that that these are valid values for the attribute prov:type
It is useful to define some basic categories of agents from an interoperability perspective. There are three types of agents that are common across most anticipated domains of use; it is acknowledged that these types do not cover all kinds of agent.
The following expression is about an agent identified by e1, which is a person, named Alice, with employee number 1234.
agent(e1, [ex:employee="1234", ex:name="Alice", prov:type='prov:Person' ])
It is optional to specify the type of an agent. When present, it is expressed using the prov:type attribute.
When an entity e is attributed to agent ag, entity e was generated by some unspecified activity that in turn was associated to agent ag. Thus, this relation is useful when the activity is not known, or irrelevant.
Revisiting the example of Section 4.1, we can ascribe tr:WD-prov-dm-20111215 to some agents without an explicit activity. The reserved attribute role (see Section 5.7.4.3) allows for role of the agent in the attribution to be specified.
agent(ex:Paolo, [ prov:type="Person" ]) agent(ex:Simon, [ prov:type="Person" ]) entity(tr:WD-prov-dm-20111215, [ prov:type='rec54:WD' ]) wasAttributedTo(tr:WD-prov-dm-20111215, ex:Paolo, [ prov:role="editor" ]) wasAttributedTo(tr:WD-prov-dm-20111215, ex:Simon, [ prov:role="contributor" ])
In the following example, a designer agent and an operator agent are associated with an activity. The designer's goals are achieved by a workflow ex:wf, described as an an entity of type plan.
activity(ex:a, [ prov:type="workflow execution" ]) agent(ex:ag1, [ prov:type="operator" ]) agent(ex:ag2, [ prov:type="designer" ]) wasAssociatedWith(ex:a, ex:ag1, -, [ prov:role="loggedInUser", ex:how="webapp" ]) wasAssociatedWith(ex:a, ex:ag2, ex:wf, [ prov:role="designer", ex:context="project1" ]) entity(ex:wf, [ prov:type='prov:Plan' , ex:label="Workflow 1", ex:url="http://example.org/workflow1.bpel" %% xsd:anyURI ])Since the workflow ex:wf is itself an entity, its provenance can also be expressed in PROV-DM: it can be generated by some activity and derived from other entities, for instance.
In some cases, one wants to indicate a plan was followed, without having to specify which agent was involved.
activity(ex:a, [ prov:type="workflow execution" ]) wasAssociatedWith(ex:a, -, ex:wf) entity(ex:wf, [ prov:type='prov:Plan', ex:label="Workflow 1", ex:url="http://example.org/workflow1.bpel" %% xsd:anyURI])In this case, it is assumed that an agent exists, but it has not been specified.
For example, a student acted on behalf of his supervisor, who acted on behalf of the department chair, who acted on behalf of the university; all those agents are responsible in some way for the activity that took place but we do not say explicitly who bears responsibility and to what degree.
The following fragment describes three agents: a programmer, a researcher, and a funder. The programmer and researcher are associated with a workflow activity. The programmer acts on behalf of the researcher (line-management) encoding the commands specified by the researcher; the researcher acts on behalf of the funder, who has a contractual agreement with the researcher. The terms 'line-management' and 'contract' used in this example are domain specific.
activity(a,[ prov:type="workflow" ]) agent(ag1, [ prov:type="programmer" ]) agent(ag2, [ prov:type="researcher" ]) agent(ag3, [ prov:type="funder" ]) wasAssociatedWith(a, ag1, [ prov:role="loggedInUser" ]) wasAssociatedWith(a, ag2) wasAssociatedWith(a, ag3) actedOnBehalfOf(ag1, ag2, a, [ prov:type="line-management" ]) actedOnBehalfOf(ag2, ag3, a, [ prov:type="contract" ])
The third component of PROV-DM is concerned with: derivations of entities from others; derivation subtypes Revision, Quotation, and Original Source; derivation-related Trace. Figure 7 depicts the third component with PROV core structures in the yellow area, including two classes (Entity, Activity) and binary association (Derivation). PROV extended structures are found outside this area. UML association classes express expanded n-ary relations.
According to Section 2, for an entity to be transformed from, created from, or resulting from an update to another, there must be some underpinning activities performing the necessary actions resulting in such a derivation. A derivation can be described at various levels of precision. In its simplest form, derivation relates two entities. Optionally, attributes can be added to represent further information about the derivation. If the derivation is the result of a single known activity, then this activity can also be optionally expressed. To provide a completely accurate description of the derivation, the generation and usage of the generated and used entities, respectively, can be provided, so as to make the derivation path, through usage, activity, and generation, explicit. Optional information such as activity, generation, and usage can be linked to derivations to aid analysis of provenance and to facilitate provenance-based reproducibility.
The following descriptions are about derivations between e2 and e1, but no information is provided as to the identity of the activity (and usage and generation) underpinning the derivation. In the second line, a type attribute is also provided.
wasDerivedFrom(e2, e1) wasDerivedFrom(e2, e1, [prov:type="physical transform"])
The following description expresses that activity a, using the entity e1 according to usage u1, derived the entity e2 and generated it according to generation g2. It is followed by descriptions for generation g2 and usage u1.
wasDerivedFrom(e2, e1, a, g2, u1) wasGeneratedBy(g2; e2, a, -) used(u1; a, e1, -)
With such a comprehensive description of derivation, a program that analyzes provenance can identify the activity underpinning the derivation, it can identify how the original entity e1 was used by the activity (e.g. for instance, which argument it was passed as, if the activity is the result of a function invocation), and which output the derived entity e2 was obtained from (say, for a function returning multiple results).
A revision is a derivation that revises an entity into a revised version.
Revision is a particular case of derivation of an entity into its revised version.
Revisiting the example of Section 4.2, we can now state that the report tr:WD-prov-dm-20111215 was a revision of the report tr:WD-prov-dm-20111018.
entity(tr:WD-prov-dm-20111215, [ prov:type='rec54:WD' ]) entity(tr:WD-prov-dm-20111018, [ prov:type='rec54:WD' ]) wasDerivedFrom(tr:WD-prov-dm-20111215, tr:WD-prov-dm-20111018, [ prov:type='prov:WasRevisionOf' ])
A quotation is the repeat of (some or all of) an entity, such as text or image, by someone who may or may not be its original author.
Quotation is a particular case of derivation in which an entity is derived from an original entity by copying, or "quoting", some or all of it.
The following paragraph is a quote from one of the author's blogs.
"During the workshop, it became clear to me that the consensus based models (which are often graphical in nature) can not only be formalized but also be directly connected to these database focused formalizations. I just needed to get over the differences in syntax. This could imply that we could have nice way to trace provenance across systems and through databases and be able to understand the mathematical properties of this interconnection."
If wp:thoughts-from-the-dagstuhl-principles-of-provenance-workshop/ denotes the original blog by agent ex:Paul, and dm:bl-dagstuhl denotes the above paragraph, then the following descriptions express that the above paragraph was copied by agent ex:Luc from a part of the blog, attributed to the agent ex:Paul.
entity(wp:thoughts-from-the-dagstuhl-principles-of-provenance-workshop/) entity(dm:bl-dagstuhl) agent(ex:Luc) agent(ex:Paul) wasDerivedFrom(dm:bl-dagstuhl, wp:thoughts-from-the-dagstuhl-principles-of-provenance-workshop/, [ prov:type='prov:WasQuotedFrom' ]) wasAttributedTo(dm:bl-dagstuhl, ex:Luc) wasAttributedTo(wp:thoughts-from-the-dagstuhl-principles-of-provenance-workshop/, ex:Paul)
An original source refers to the source material that is closest to the person, information, period, or idea being studied.
An original source relation is a particular case of derivation that aims to give credit to the source that originated some information. It is recognized that it may be hard to determine which entity constitutes an original source. This definition is inspired by original-source as defined in http://googlenewsblog.blogspot.com/2010/11/credit-where-credit-is-due.html.
Let us consider the concept introduced in the current section, identified as dm:concept-original-source, and the Google page go:credit-where-credit-is-due.html, where the notion original-source was originally described (to the knowledge of the authors).
entity(dm:concept-original-source) entity(go:credit-where-credit-is-due.html) wasDerivedFrom(dm:concept-original-source, go:credit-where-credit-is-due.html, [ prov:type='prov:HadOriginalSource' ])
Trace is the ability to link back an entity to another by means of derivation or responsibility relations, possibly repeatedly traversed.
A trace relation between two entities e2 and e1 is a generic dependency of e2 on e1 that indicates either that e1 may have been necessary for e2 to be created, or that e1 bears some responsibility for e2's existence.
A Trace relation, written tracedTo(id; e2, e1, attrs) in PROV-N, has:
We note that the ancestor may be an agent since agents may be entities.
Derivation and attribution are particular cases of trace.
We refer to the example of Section 4.2, and specifically to Figure 3. We can see that there is a path from tr:WD-prov-dm-20111215 to w3:Consortium and to process:rec-advance. This is expressed as follows.
tracedTo(tr:WD-prov-dm-20111215, w3:Consortium) tracedTo(tr:WD-prov-dm-20111215, process:rec-advance)
The fourth component of PROV-DM is concerned with relations specialization and alternate between entities. Figure 8 depicts the fourth component with a single class and two binary associations.
Two provenance descriptions about the same thing may emphasize differents aspects of that thing.
User Alice writes an article. In its provenance, she wishes to refer to the precise version of the article with a date-specific IRI, as she might edit the article later. Alternatively, user Bob refers to the article in general, independently of its variants over time.
The PROV data model introduces relations, called specialization and alternate, that allow entities to be linked together. They are defined as follows.
Examples of aspects include a time period, an abstraction, and a context associated with the entity.
The BBC news home page on 2012-03-23 ex:bbcNews2012-03-23 is a specialization of the BBC news page in general bbc:news/. This can be expressed as follows.
specializationOf(ex:bbcNews2012-03-23, bbc:news/)We have created a new qualified name, ex:bbcNews2012-03-23, in the namespace ex, to identify the specific page carrying this day's news, which would otherwise be the generic bbc:news/ page.
A given news item on the BBC News site bbc:news/science-environment-17526723 for desktop is an alternate of a bbc:news/mobile/science-environment-17526723 for mobile devices.
entity(bbc:news/science-environment-17526723, [ prov:type="a news item for desktop"]) entity(bbc:news/mobile/science-environment-17526723, [ prov:type="a news item for mobile devices"]) alternateOf(bbc:news/science-environment-17526723, bbc:news/mobile/science-environment-17526723)
They are both specialization of an (unspecified) entity.
Considering again the two versions of the technical report tr:WD-prov-dm-20111215 (second working draft) and tr:WD-prov-dm-20111018 (first working draft). They are alternate of each other.
entity(tr:WD-prov-dm-20111018) entity(tr:WD-prov-dm-20111215) alternateOf(tr:WD-prov-dm-20111018,tr:WD-prov-dm-20111215)
They are both specialization of the page http://www.w3.org/TR/prov-dm/.
The fifth component of PROV-DM is concerned with bundles, a mechanism to support provenance of provenance. Figure 9 depict a UML class diagram for the fifth component. It comprises a Bundle class, a subclass of Entity and a novel n-ary relation, Provenance Locator.
A bundle is a named set of provenance descriptions, and is itself an entity, so allowing provenance of provenance to be expressed.
A bundle's identifier id identifies a unique set of descriptions.
A bundle is a named set of descriptions, but it is also an entity so that its provenance can be described.
PROV defines the following type for bundles:A bundle description is of the form entity(id,[prov:type='prov:Bundle', attr1=val1, ...]) where id is an identifier denoting a bundle, a type prov:Bundle and an optional set of attribute-value pairs ((attr1, val1), ...) representing additional information about this bundle.
The provenance of provenance can then be described using PROV constructs, as illustrated by Example 43 and Example 44.
Let us consider two entities ex:report1 and ex:report2.
entity(ex:report1, [ prov:type="report", ex:version=1 ]) wasGeneratedBy(ex:report1, -, 2012-05-24T10:00:01) entity(ex:report2, [ prov:type="report", ex:version=2]) wasGeneratedBy(ex:report2, -, 2012-05-25T11:00:01) wasDerivedFrom(ex:report2, ex:report1)
Let us assume that Bob observed the creation of ex:report1. A first bundle can be expressed.
bundle bob:bundle1 entity(ex:report1, [ prov:type="report", ex:version=1 ]) wasGeneratedBy(ex:report1, -, 2012-05-24T10:00:01) endBundle
In contrast, Alice observed the creation of ex:report2 and its derivation from ex:report1. A separate bundle can also be expressed.
bundle alice:bundle2 entity(ex:report1) entity(ex:report2, [ prov:type="report", ex:version=2 ]) wasGeneratedBy(ex:report2, -, 2012-05-25T11:00:01) wasDerivedFrom(ex:report2, ex:report1) endBundle
The first bundle contains the descriptions corresponding to Bob observing the creation of ex:report1. Its provenance can be described as follows.
entity(bob:bundle1, [prov:type='prov:Bundle']) wasGeneratedBy(bob:bundle1, -, 2012-05-24T10:30:00) wasAttributedTo(bob:bundle1, ex:Bob)
In contrast, the second bundle is attributed to Alice who observed the derivation of ex:report2 from ex:report1.
entity(alice:bundle2, [ prov:type='prov:Bundle' ]) wasGeneratedBy(alice:bundle2, -, 2012-05-25T11:15:00) wasAttributedTo(alice:bundle2, ex:Alice)
A provenance aggregator could merge two bundles, resulting in a novel bundle, whose provenance is described as follows.
bundle agg:bundle3 entity(ex:report1, [ prov:type="report", ex:version=1 ]) wasGeneratedBy(ex:report1, -, 2012-05-24T10:00:01) entity(ex:report2, [ prov:type="report", ex:version=2 ]) wasGeneratedBy(ex:report2, -, 2012-05-25T11:00:01) wasDerivedFrom(ex:report2, ex:report1) endBundle entity(agg:bundle3, [ prov:type='prov:Bundle' ]) agent(ex:aggregator01, [ prov:type='ex:Aggregator' ]) wasAttributedTo(agg:bundle3, ex:aggregator01) wasDerivedFrom(agg:bundle3, bob:bundle1) wasDerivedFrom(agg:bundle3, alice:bundle2)
The new bundle is given a new identifier agg:bundle3 and is attributed to the ex:aggregator01 agent.
If the target is not specified, it is assumed that target is the same identifier as subject.
When the subject and optional target denote entities, a provenance locator not only provides a located context, but it also expresses an alternate relation between the entity denoted by subject and the entity described in the located context. This is an alternate since the entity denoted by subject in the current context presents other aspects than the entity in the located one.
According to the following provenance locator, provenance descriptions about ex:report1 can be found in bundle bob:bundle1.
hasProvenanceIn(ex:report1, bob:bundle1, -)
According to the following provenance locator, provenance descriptions about ex:report1 can be found in bundle bob:bundle1, which is available from the provenance service identified by the provided URI.
hasProvenanceIn(ex:report1, bob:bundle1, -, [ prov:service-uri="http://example.com/service" %% xsd:anyURI ])
Let us again consider the same scenario involving two entities ex:report1 and ex:report2.
The first bundle can be expressed with all Bob's observations about the creation of ex:report1.
bundle bob:bundle4 entity(ex:report1, [ prov:type="report", ex:version=1 ]) wasGeneratedBy(ex:report1, -, 2012-05-24T10:00:01) endBundle
Likewise, Alice's observation about the derivation of ex:report2 from ex:report1, is expressed in a separate bundle.
bundle alice:bundle5 entity(ex:report1) hasProvenanceIn(ex:report1, bob:bundle4, -) entity(ex:report2, [ prov:type="report", ex:version=2 ]) wasGeneratedBy(ex:report2, -, 2012-05-25T11:00:01) wasDerivedFrom(ex:report2, ex:report1) endBundle
In bundle alice:bundle5, there is a description for entity ex:report1, and a provenance locator pointing to bundle bob:bundle4. The locator indicates that some provenance description for ex:report1 can be found in bundle bob:bundle4. The purpose of the locator is twofold. First, it allows for incremental navigation of provenance [PROV-AQ]. Second, it makes entity ex:report1 described in alice:bundle5 an alternate of ex:report1 described in bob:bundle4.
Alternatively, Alice may have decided to use a different identifier for ex:report1.
bundle alice:bundle6 entity(alice:report1) hasProvenanceIn(alice:report1, bob:bundle4, ex:report1) entity(ex:report2, [ prov:type="report", ex:version=2 ]) wasGeneratedBy(ex:report2, -, 2012-05-25T11:00:01) wasDerivedFrom(ex:report2, alice:report1) endBundle
Alice can specify the target in the provenance locator to be ex:report1. With such a statement, Alice states that provenance information about alice:report1 can be found in bundle bob:bundle4 under the name ex:report1. In effect, alice:report1 and ex:report1 are declared to be alternate.
Consider the following bundle of descriptions, in which derivation and generations have been identified.
bundle obs:bundle7 entity(ex:report1, [prov:type="report", ex:version=1]) wasGeneratedBy(ex:g1; ex:report1,-,2012-05-24T10:00:01) entity(ex:report2, [prov:type="report", ex:version=2]) wasGeneratedBy(ex:g2; ex:report2,-,2012-05-25T11:00:01) wasDerivedFrom(ex:d; ex:report2, ex:report1) endBundle entity(obs:bundle7, [ prov:type='prov:Bundle' ]) wasAttributedTo(obs:bundle7, ex:observer01)Bundle obs:bundle7 is rendered by a visualisation tool. It may useful for the tool configuration for this bundle to be shared along with the provenance descriptions, so that other users can render provenance as it was originally rendered. The original bundle obviously cannot be changed. However, one can create a new bundle, as follows.
bundle tool:bundle8 entity(tool:bundle8, [ prov:type='viz:Configuration', prov:type='prov:Bundle' ]) wasAttributedTo(tool:bundle8, viz:Visualizer) entity(ex:report1, [viz:color="orange"]) // ex:report1 is a reused identifier hasProvenanceIn(ex:report1, obs:bundle7, -) entity(tool:r2, [viz:color="blue"]) // tool:r2 is a new identifier hasProvenanceIn(tool:r2, obs:bundle7, ex:report2) wasDerivedBy(ex:d; ex:report2, ex:report1, [viz:style="dotted"]) hasProvenanceIn(ex:d, obs:bundle7, -) endBundle
In bundle tool:bundle8, the prefix viz is used for naming visualisation-specific attributes, types or values.
Bundle tool:bundle8 is given type viz:Configuration to indicate that it consists of descriptions that pertain to the configuration of the visualisation tool. This type attribute can be used for searching bundles containing visualization-related descriptions.
For the purpose of illustration, we show that the visualisation tool reused identifier ex:report1, but created a new identifier tool:r2. They denote entities which are alternates of with ex:report1 and ex:report2, described in bundle obs:bundle7, with visualization attribute for the color to be used when rendering these entities. Likewise, the derivation has a style attribute.
According to their definition, derivations have an optional identifier. To express an alternate for a derivation, we need to be able to reference it, by means of an identifier. Hence, it is necessary for it to have an identifier in the first place (ex:d).
The fifth component of PROV-DM is concerned with the notion of collections. A collection is an entity that has some members. The members are themselves entities, and therefore their provenance can be expressed. Some applications need to be able to express the provenance of the collection itself: e.g. who maintains the collection, which members it contains as it evolves, and how it was assembled. The purpose of Component 5 is to define the types and relations that are useful to express the provenance of collections. In PROV, the concept of Collection is implemented by means of dictionaries, which we introduce in this section.
Figure 10 depicts the sixth component with three new classes (Collection, Dictionary, and Pair) and three associations (insertion, removal, and memberOf).
The intent of these relations and types is to express the history of changes that occurred to a collection. Changes to collections are about the insertion of entities in collections and the removal of members from collections. Indirectly, such history provides a way to reconstruct the contents of a collection.
In PROV, the concept of Collection is provided as an extensibility point for other kinds of collections. Collections are implemented by means of dictionaries, which are introduced next.
PROV-DM defines a specific type of collection: a dictionary, specified as follows.
A dictionary is a collection whose members are indexed by keys.Conceptually, a dictionary has a logical structure consisting of key-entity pairs. This structure is often referred to as a map, and is a generic indexing mechanism that can abstract commonly used data structures, including associative lists (also known as "dictionaries" in some programming languages), relational tables, ordered lists, and more. The specification of such specialized structures in terms of key-value pairs is out of the scope of this document.
A given dictionary forms a given structure for its members. A different structure (obtained either by insertion or removal of members) constitutes a different dictionary. Hence, for the purpose of provenance, a dictionary entity is viewed as a snapshot of a structure. Insertion and removal operations result in new snapshots, each snapshot forming an identifiable dictionary entity.
PROV-DM defines the following types related to dictionaries:
entity(d0, [prov:type='prov:EmptyDictionary' ]) // d0 is an empty dictionary entity(d1, [prov:type='prov:Dictionary' ]) // d1 is a dictionary, with unknown content
An Insertion relation, written derivedByInsertionFrom(id; d2, d1, {(key_1, e_1), ..., (key_n, e_n)}, attrs), has:
An Insertion relation derivedByInsertionFrom(id; d2, d1, {(key_1, e_1), ..., (key_n, e_n)}) states that d2 is the state of the dictionary following the insertion of pairs (key_1, e_1), ..., (key_n, e_n) into dictionary d1.
entity(d0, [prov:type='prov:EmptyDictionary' ]) // d0 is an empty dictionary entity(e1) entity(e2) entity(e3) entity(d1, [prov:type='prov:Dictionary' ]) entity(d2, [prov:type='prov:Dictionary' ]) derivedByInsertionFrom(d1, d0, {("k1", e1), ("k2", e2)}) derivedByInsertionFrom(d2, d1, {("k3", e3)})From this set of descriptions, we conclude:
Insertion provides an "update semantics" for the keys that are already present in a dictionary, since a new pair replaces an existing pair with the same key in the new dictionary. This is illustrated by the following example.
entity(d0, [prov:type='prov:EmptyDictionary' ]) // d0 is an empty dictionary entity(e1) entity(e2) entity(e3) entity(d1, [prov:type='prov:Dictionary' ]) entity(d2, [prov:type='prov:Dictionary' ]) derivedByInsertionFrom(d1, d0, {("k1", e1), ("k2", e2)}) derivedByInsertionFrom(d2, d1, {("k1", e3)})This is a case of update of e1 to e3 for the same key, "k1".
A Removal relation, written derivedByRemovalFrom(id; d2, d1, {key_1, ... key_n}, attrs), has:
A Removal relation derivedByRemovalFrom(id; d2,d1, {key_1, ..., key_n}) states that d2 is the state of the dictionary following the removal of the set of pairs corresponding to keys key_1...key_n from d1.
entity(d0, [prov:type="prov:EmptyDictionary"]) // d0 is an empty dictionary entity(e1) entity(e2) entity(e3) entity(d1, [prov:type="prov:Dictionary"]) entity(d2, [prov:type="prov:Dictionary"]) derivedByInsertionFrom(d1, d0, {("k1", e1), ("k2",e2)}) derivedByInsertionFrom(d2, d1, {("k3", e3)}) derivedByRemovalFrom(d3, d2, {"k1", "k3"})From this set of descriptions, we conclude:
The insertion and removal relations make insertions and removals explicit as part of the history of a dictionary. This, however, requires explicit mention of the state of the dictionary prior to each operation. The membership relation removes this need, allowing the state of a dictionary c to be expressed without having to introduce a prior state.
The description memberOf(c, {(key_1, e_1), ..., (key_n, e_n)}) states that c is known to include (key_1, e_1), ..., (key_n, e_n)}, without having to introduce a previous state.
entity(d1, [prov:type='prov:Dictionary' ]) // d1 is a dictionary, with unknown content entity(d2, [prov:type='prov:Dictionary' ]) // d2 is a dictionary, with unknown content entity(e1) entity(e2) memberOf(d1, {("k1", e1), ("k2", e2)} ) memberOf(d2, {("k1", e1), ("k2", e2)}, true) entity(e3) entity(d3, [prov:type='prov:Dictionary' ]) derivedByInsertionFrom(d3, d1, {("k3", e3)})From these descriptions, we conclude:
Thus, the states of d1 and d3 are only partially known.
Further considerations:
A PROV-DM namespace is identified by an IRI [IRI]. In PROV-DM, attributes, identifiers, and values with qualified names as data type can be placed in a namespace using the mechanisms described in this specification.
A namespace declaration consists of a binding between a prefix and a namespace. Every qualified name with this prefix in the scope of this declaration refers to this namespace.
A default namespace declaration consists of a namespace. Every un-prefixed qualified name in the scope of this default namespace declaration refers to this namespace.
The PROV namespace is identified by the URI http://www.w3.org/ns/prov#.
PROV-DM stipulates that a qualified name can be mapped into an IRI by concatenating the IRI associated with the prefix and the local part.
A qualified name's prefix is optional. If a prefix occurs in a qualified name, it refers to a namespace declared in a namespace declaration. In the absence of prefix, the qualified name refers to the default namespace.
An identifier is a qualified name.
An attribute is a qualified name.
The PROV data model introduces a pre-defined set of attributes in the PROV namespace, which we define below. This specification does not provide any interpretation for any attribute declared in any other namespace.
Attribute | Allowed In | value | Section |
prov:label | any construct | xsd:string | Section 5.7.4.1 |
prov:location | Entity, Activity, Usage, and Generation. | Value | Section 5.7.4.2 |
prov:role | Usage, Generation, Association, Start, and End | Value | Section 5.7.4.3 |
prov:type | any construct | Value | Section 5.7.4.4 |
prov:value | Entity | Value | Section 5.7.4.5 |
prov:provenance-uri | Provenance Locator | xsd:anyURI | Section 5.7.4.6 |
prov:service-uri | Provenance Locator | xsd:anyURI | Section 5.7.4.7 |
The attribute prov:label provides a human-readable representation of a PROV-DM element or relation. The value associated with the attribute prov:label must be a string.
The following entity is provided with a label attribute.
entity(ex:e1, [prov:label="This is a label"])
A location can be an identifiable geographic place (ISO 19112), but it can also be a non-geographic place such as a directory, row, or column. As such, there are numerous ways in which location can be expressed, such as by a coordinate, address, landmark, and so forth. This document does not specify how to concretely express locations, but instead provide a mechanism to introduce locations, by means of a reserved attribute.
The attribute prov:location is an optional attribute of entity, activity, usage, and generation. The value associated with the attribute prov:location must be a PROV-DM Value, expected to denote a location.
The following expression describes entity Mona Lisa, a painting, with a location attribute.
entity(ex:MonaLisa, [prov:location="Le Louvres, Paris", prov:type="StillImage"])
The attribute prov:role denotes the function of an entity with respect to an activity, in the context of a usage, generation, association, start, and end. The attribute prov:role is allowed to occur multiple times in a list of attribute-value pairs. The value associated with a prov:role attribute must be a PROV-DM Value.
The following activity is associated with an agent acting as the operator.
wasAssociatedWith(a, ag, [prov:role="operator"])
The attribute prov:type provides further typing information for an element or relation. PROV-DM liberally defines a type as a category of things having common characteristics. PROV-DM is agnostic about the representation of types, and only states that the value associated with a prov:type attribute must be a PROV-DM Value. The attribute prov:type is allowed to occur multiple times.
The following describes an agent of type software agent.
agent(ag, [prov:type='prov:SoftwareAgent' ])
The following types are pre-defined in PROV, and are valid values for the prov:type attribute.
Type | Specification | Core concept |
prov:Bundle | Section 5.5.1 | Entity |
prov:Collection | Section 5.6.1 | Entity |
prov:Dictionary | Section 5.6.2 | Entity |
prov:EmptyDictionary | Section 5.6.2 | Entity |
prov:HadOriginalSource | Section 5.3.4 | Derivation |
prov:Organization | Section 5.2.1 | Agent |
prov:Person | Section 5.2.1 | Agent |
prov:Plan | Section 5.2.3 | Entity |
prov:SoftwareAgent | Section 5.2.1 | Agent |
prov:WasQuotedFrom | Section 5.3.3 | Derivation |
prov:WasRevisionOf | Section 5.3.2 | Derivation |
The attribute prov:value provides a Value associated with an entity.
The attribute prov:value is an optional attribute of entity. The value associated with the attribute prov:value must be a PROV-DM Value. The attribute prov:value may occur at most once in a set of attribute-value pairs.
The following example illustrates the provenance of the number 4 obtained by an activity that computed the length of an input string "abcd". The input and the output are expressed as entities ex:in and ex:out, respectively. They each have a prov:value attribute associated with the corresponding value.
entity(ex:in, [prov:value="abcd"]) entity(ex:out, [prov:value=4]) activity(ex:len, [prov:type="string-length"]) used(ex:len,ex:in) wasGeneratedBy(ex:out,ex:len) wasDerivedFrom(ex:out,ex:in)
The attribute prov:provenance-uri provides an optional IRI in the context of a Provenance Locator; when this IRI is dereferenced, it allows access to provenance descriptions. It is referred to as Provenance-URI in [PROV-AQ].
The attributes prov:service-uri and prov:provenance-uri are mutually exclusive.
According to the following provenance locator, provenance descriptions about ex:report1 can be found in bundle bob:bundle1, which is available from the provenance service identified by the provided URI.
hasProvenanceIn(ex:report1, bob:bundle1, -, [ prov:provenance-uri="http://example.com/service" %% xsd:anyURI ])
The attribute prov:service-uri provides an optional IRI in the context of a Provenance Locator; this IRI denotes a provenance service from which provenance can be retrieved. It is referred to as Service-URI in [PROV-AQ].
The attributes prov:service-uri and prov:provenance-uri are mutually exclusive.
According to the following provenance locator, provenance descriptions about ex:report1 can be found in the resource identified by the provided URI.
hasProvenanceIn(ex:report1, [ prov:service=uri="http://example.com/some-provenance.pn" %% xsd:anyURI ])
By means of attribute-value pairs, the PROV data model can refer to values such as strings, numbers, time, qualified names, and IRIs. The interpretation of such values is outside the scope of PROV-DM.
Each kind of such values is called a datatype. The datatypes are taken from the set of XML Schema Datatypes, version 1.1 [XMLSCHEMA-2] and the RDF specification [RDF-CONCEPTS]. The normative definitions of these datatypes are provided by the respective specifications. Each datatype is identified by its XML xsd:QName.
We note that PROV-DM time instants are defined according to xsd:dateTime [XMLSCHEMA-2].
The following examples respectively are the string "abc", the integer number 1, and the IRI "http://example.org/foo".
"abc" 1 "http://example.org/foo" %% xsd:anyURI
The following example shows a value of type xsd:QName (see QName [XMLSCHEMA-2]). The prefix ex must be bound to a namespace declared in a namespace declaration.
"ex:value" %% xsd:QName
In the following example, the generation time of entity e1 is expressed according to xsd:dateTime [XMLSCHEMA-2].
wasGeneratedBy(e1,a1, 2001-10-26T21:32:52)
The PROV data model provides extensibility points that allow designers to specialize it for specific applications or domains. We summarize these extensibility points here:
The PROV namespace declares a set of reserved attributes catering for extensibility: prov:type, prov:role, prov:location.
In the following example, e2 is a translation of e1, expressed as a sub-type of derivation.
wasDerivedFrom(e2,e1, [prov:type='ex:Translation' ])
In the following example, e is described as a Car, a type of entity.
entity(e, [prov:type='ex:Car' ])
The PROV data model is designed to be application and technology independent, but specializations of PROV-DM are welcome and encouraged. To ensure interoperability, specializations of the PROV data model that exploit the extensibility points summarized in this section must preserve the semantics specified in this document and in [PROV-CONSTRAINTS].
The example of section 3 contains identifiers such as tr:WD-prov-dm-20111215, which denotes a specific version of a technical report. On the other hand, a URI such as http://www.w3.org/TR/prov-dm/ denotes the latest version of a document. One needs to ensure that provenance descriptions for the latter resource remain valid as the resource state changes.
To this end, PROV-DM allows asserters to describe "partial states" of entities by means of attributes and associated values. Some further constraints apply to the use of these attributes, since the values associated with them are expected to remain unchanged for some period of time. The constraints associated to attributes allow provenance descriptions to be refined, they can also be found in the companion specification [PROV-CONSTRAINTS].
The idea of bundling provenance descriptions is crucial to the PROV approach. Indeed, it allows multiple provenance perspectives to be provided for a given entity. It is also the mechanism by which provenance of provenance can be expressed. Descriptions in bundles are expected to satisfy constraints specified in the companion specification [PROV-CONSTRAINTS].
WG membership to be listed here.