W3C

PROV-DM: The PROV Data Model

WD5 for internal review

W3C Editor's Draft 02 April 2012

This version:
http://dvcs.w3.org/hg/prov/raw-file/default/model/prov-dm.html
Latest published version:
http://www.w3.org/TR/prov-dm/
Latest editor's draft:
http://dvcs.w3.org/hg/prov/raw-file/default/model/prov-dm.html
Previous version:
http://www.w3.org/TR/2012/WD-prov-dm-20120202/
Editors:
Luc Moreau, University of Southampton
Paolo Missier, Newcastle University
Contributors:
Khalid Belhajjame, University of Manchester
Stephen Cresswell, legislation.gov.uk
Yolanda Gil, Invited Expert
Reza B'Far, Oracle Corporation
Paul Groth, VU University of Amsterdam
Graham Klyne, University of Oxford
Jim McCusker, Rensselaer Polytechnic Institute
Simon Miles, Invited Expert
James Myers, Rensselaer Polytechnic Institute
Satya Sahoo, Case Western Reserve University

Abstract

PROV-DM, the PROV data model, is a data model for provenance that describes the entities, people and activities involved in producing a piece of data or thing. PROV-DM is structured in six components, dealing with: (1) entities and activities, and the time at which they were created, used, or ended; (2) agents bearing responsibility for entities that were generated and actities that happened; (3) derivations between entities; (4) properties to link entities that refer to a same thing; (5) collections of entities, whose provenance can itself be tracked; (6) a simple annotation mechanism.

This document introduces the provenance concepts underpinning PROV-DM in natural language, and systematically defines PROV-DM types and relations. PROV data model is domain-agnostic, but is equipped with extensibility points allow domain-specific information to be included.

Two further documents complete the specification of PROV-DM. First, a companion document specifies the set of constraints that well-structured provenance descriptions should follow; these provide an interpretation for provenance descriptions. Second, to be able to provide examples of provenance, a notation is used for expressing instances of PROV-DM for human consumption; the syntactic details of this notation are also kept in a separate document.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

PROV Family of Specifications

This document is part of the PROV family of specifications, a set of specifications aiming to define the various aspects that are necessary to achieve the vision of inter-operable interchange of provenance information in heterogeneous environments such as the Web. The specifications are as follows.

How to read the PROV Family of Specifications

Fourth Public Working Draft

This is the fourth public release of the PROV-DM document. Following feedback, the Working Group has decided to reorganize this document substantially, separating the data model, from its contraints, and the notation used to illustrate it. The PROV-DM release is synchronized with the release of the PROV-O, PROV-PRIMER, PROV-N, PROV-DM-CONSTRAINTS documents. We are now making clear what the entry path to the PROV family of specifications is.

This document was published by the Provenance Working Group as an Editor's Draft. If you wish to make comments regarding this document, please send them to public-prov-wg@w3.org (subscribe, archives). All feedback is welcome.

Publication as an Editor's Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1. Introduction

For the purpose of this specification, provenance is defined as a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing. In particular, the provenance of information is crucial in deciding whether information is to be trusted, how it should be integrated with other diverse information sources, and how to give credit to its originators when reusing it. In an open and inclusive environment such as the Web, where users find information that is often contradictory or questionable, provenance can help those users to make trust judgements.

The idea that a single way of representing and collecting provenance could be adopted internally by all systems does not seem to be realistic today. Instead, a pragmatic approach is to consider a core data model for provenance that allows domain and application specific representations of provenance to be translated into such a data model and exchanged between systems. Heterogeneous systems can then export their provenance into such a core data model, and applications that need to make sense of provenance in heterogeneous systems can then import it, process it, and reason over it.

Thus, the vision is that different provenance-aware systems natively adopt their own model for representing their provenance, but a core provenance data model can be readily adopted as a provenance interchange model across such systems.

A set of specifications, referred to as the PROV family of specifications, define the various aspects that are necessary to achieve this vision in an interoperable way:

PROV-DM is a domain-agnostic model, but with clear extensibility points allowing further domain-specific and application-specific extensions to be defined. The PROV data model is structured according to six components covering various aspects of provenance:

  1. component 1: entities and activities, and the time at which they were created, used, or ended;
  2. component 2: agents bearing responsibility for entities that were generated and actities that happened;
  3. component 3: derivations between entities;
  4. component 4: properties to link entities that refer to a same thing;
  5. component 5: collections of entities, whose provenance can itself be tracked;
  6. component 6: a simple annotation mechanism.

This specification intentionally presents the key concepts of the PROV Data Model, without drilling down into all its subtleties. Using these key concepts, it becomes possible to write useful provenance descriptions very quickly, and publish or embed them along side the data they relate to.

However, if data changes, then it is challenging to express its provenance precisely, like it would be for any other form of metadata. To address this challenge, a refinement is proposed to enrich simple provenance, with extra-descriptions that help qualify the specific subject of provenance and provenance itself, with attributes and temporal interval, intended to satisfy a comprehensive set of constraints. These aspects are covered in the companion specification [PROV-DM-CONSTRAINTS].

1.1 Structure of this Document

Section 2 provides starting points for the PROV Data Model, listing a set of types and relations, which are allows users to make initial provenance descriptions.

Section 3 illustrates how PROV-DM can be used to express the provenance of a report published on the Web.

Section 4 provides the definition of PROV-DM concepts, structured according to six components.

Section 5 summarizes PROV-DM extensibility points.

Section 6 introduces the idea that constraints can be applied to the PROV data model to refine provenance descriptions; these are covered in the companion specification [PROV-DM-CONSTRAINTS].

1.2 PROV Namespace

The PROV namespace is http://www.w3.org/ns/prov#.

All the concepts, reserved names and attributes introduced in this specification belong to the PROV namespace.

1.3 Conventions

The key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" in this document are to be interpreted as described in [RFC2119].

2. PROV-DM Starting Points

This section provides an introduction to the PROV data model by overviewing a set of concepts underpinning provenance descriptions that a novice reader would write in a first instance. In Sections 2.1 to 2.4, these concepts are described in natural language and illustrated by small examples. Since PROV-DM is a conceptual data model, Section 2.5 maps the concepts to various types and relations, which are illustrated graphically in a simplified UML diagram. Section 2.6 then summarizes the PROV notation allowing instances of PROV-DM to be written down.

2.1 Entity and Activity

Things we want to describe the provenance of are called entities in PROV. The term "things" encompasses a broad diversity of notions, including digital objects such as a file or web page, physical things such as a building or a printed book, or a car as well as abstract concepts and ideas. One can regard any Web resource as an example of Entity in this context.

An entity is a thing one wants to provide provenance for. For the purpose of this specification, things can be physical, digital, conceptual, or otherwise; things may be real or imaginary.

An entity may be the document at URI http://www.w3.org/TR/prov-dm/, a file in a file system, a car, or an idea.

An activity is something that occurs over a period of time and acts upon or with entities. This action can take multiple forms: consuming, processing, transforming, modifying, relocating, using, generating, or being associated with entities. Activities that operate on digital entities may for example move, copy, or duplicate them.

An activity may be the publishing of a document on the Web, sending a twitter message, extracting metadata embedded in a file, driving a car from Boston to Cambridge, assembling a data set based on a set of measurements, performing a statistical analysis over a data set, sorting news items according to some criteria, running a SPARQL query over a triple store, and editing a file.

2.2 Generation, Usage, Derivation

Activities and entities are associated with each other in two different ways: activities are consumers of entities and activities are producers of entities. The act of producing or consuming an entity may have a duration. The term 'generation' refers to the completion of the the act of producing; likewise, the term 'usage' refers to the beginning of the act of consuming entities. Thus, we define the following notions of generation and usage.

Generation is the completion of production of a new entity by an activity. This entity becomes available for usage after this generation. This entity did not exist before generation.

Usage is the beginning of consumption of an entity by an activity. Before usage, the activity had not begun to consume or use this entity and could not have been affected by the entity.

Examples of generation are the completed creation of a file by a program, the completed creation of a linked data set, and the completed publication of a new version of a document.

Usage examples include a procedure beginning to consume an argument, a service starting to read a value on a port, a program beginning to read a configuration file, or the point at which an ingredient, such as eggs, is being added in a baking activity. Usage may entirely consume an entity (e.g. eggs are no longer available after being added to the mix); alternatively, a same entity may be used multiple times, possibly by different activities (e.g. a file on a file system can be read indefinitely).

Activities are consumers of entities and producers of entities. In some case, the consumption of an entity influences the creation of another in some way. This notion is captured by derivations, defined as follows.

A derivation is a transformation of an entity into another, a construction of an entity into another, or an update of an entity, resulting in a new one.

Examples of derivation include the transformation of a relational table into a linked data set, the transformation of a canvas into a painting, the transportation of a work of art from London to New York, and a physical transformation such as the melting of ice into water.

2.3 Agents and Other Types of Entities

The motivation for introducing agents in the model is to denote the agent's responsibility for activities.

An agent is a type of entity that bears some form of responsibility for an activity taking place.

The definition of agent intentionally stays away from using concepts such as enabling, causing, initiating, triggering, affecting, etc, because many entities also enable, cause, initiate, and affect in some way the activities. Concepts such as triggers are themselves defined in relations between entities and activities. So the notion of having some degree of responsibility is really what makes an agent.

An agent is a particular type of Entity. This means that the model can be used to express provenance of the agents themselves.

Software for checking the use of grammar in a document may be defined as an agent of a document preparation activity, and at the same time one can describe its provenance, including for instance the vendor and the version history.

There are some useful types of entities and agents that are commonly encountered in applications making data and documents available on the Web; we introduce them in the rest of this section.

A plan is an entity that represents a set of actions or steps intended by one or more agents to achieve some goals. There exist no prescriptive requirement on the nature of plans, their representation, the actions or steps they consist of, or their intended goals. Since plans may evolve over time, it may become necessary to track their provenance, so plans themselves are entities. Representing the plan explicitly in the provenance can be useful for various tasks: for example, to validate the execution as represented in the provenance record, to manage expectation failures, or to provide explanations.

A plan can be a blog post tutorial for how to set up a web server, a list of instructions for a micro-processor execution, a cook's written recipe for a chocolate cake, or a workflow for a scientific experiment.

Three types of agents are recognized because they are commonly encountered in applications making data and documents available on the Web: persons, software agents, and organizations.

A Web site and service selling books on the Web and the company hosting them are software agents and organizations, respectively.

A collection is an entity that provides a structure to some constituents, which are themselves entities. These constituents are said to be member of the collections. This concept allows for the provenance of the collection, but also of its constituents to be expressed. Such a notion of collection corresponds to a wide variety of concrete data structures, such as a maps, dictionaries, or associative arrays.

An example of collection is an archive of documents. Each document has its own provenance, but the archive itself also has some provenance: who maintained it, which documents it contained at which point in time, how it was assembled, etc.

An account is an entity that contains a bundle of provenance descriptions. Making an account an entity allows for provenance of provenance to be expressed.

For users to decide whether they can place their trust in a resource, they may want to analyze the resource's provenance, but also determine who its provenance is attributed to, and when it was generated. In other words, users need to be able to determine the provenance of provenance. Hence, provenance is also regarded as an entity (of type Account), by which provenance of provenance can then be expressed.

2.4 Attribution, Association, and Responsibility

Agents can be related to entities, activities, and other agents.

Attribution is the ascribing of an entity to an agent.

A blog post can be attributed to an author, a mobile phone to its manufacturer.

Agents are defined as having some kind of responsibility for activities. However, one may want to be more specific about the nature of an agent's responsibility. For example, a programmer and a researcher could both be associated with running a workflow, but it may not matter which programmer clicked the button to start the workflow while it would matter a lot which researcher told the programmer to do so.

Furthermore, provenance reflects activities that have occurred. In some cases, those activities reflect the execution of a plan that was designed in advance to guide the execution. A plan may be associated to an activity, which represents what was intended to happen.

An activity association is an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity. It further allows for a plan to be specified, which is the plan intended by the agent to achieve some goals in the context of this activity.

Examples of association between an activity and an agent are:

  • creation of a web page under the guidance of a designer;
  • various forms of participation in a panel discussion, including audience member, panelist, or panel chair;
  • a public event, sponsored by a company, and hosted by a museum;
  • an XSLT transform launched by a user based on an XSL style sheet (a plan).

Responsibility is the fact that an agent is accountable for the actions of a "subordinate" agent, in the context of an activity. The nature of this relation is intended to be broad, including delegation or contractual relation.

A student publishing a web page describing an academic department could result in both the student and the department being agents associated with the activity, and it may not matter which student published a web page but it matters a lot that the department told the student to put up the web page.

2.5 Simplified Overview Diagram

So far, we have introduced a series of concepts underpinning provenance. PROV-DM is a conceptual data model consisting of types and relations between these. Table (Mapping of Provenance concepts to types and relations in PROV-DM) shows how provenance concepts can be mapped to types and relations in PROV-DM: the first column lists concepts introduced in this section, the second column indicates whether a concept maps to a type or a relation, whereas the third column contains the corresponding name. We note that names of relations have a verbal form in the past tense to express what happened in the past, as opposed to what may or will happen.

Mapping of Provenance concepts to types and relations in PROV-DM
Provenance ConceptsPROV-DM types or relations
EntityTypes in PROV-DMentity
Activityactivity
Agentagent
GenerationRelations in PROV-DMwasGeneratedBy
Usageused
AttributionwasAttributedTo
AssociationwasAssociatedWith
ResponsibilityactedOnBehalfOf
DerivationwasDerivedFrom

Figure overview-types-and-relations illustrates the three types (entity, activity, and agent) and how they relate to each other. At this stage, all relations are shown to be binary. When examining PROV-DM in details, some relations, while involving two primary elements, are shown to be n-ary.

Simplified  Overview of PROV-DM
Simplified Overview of PROV-DM

Figure overview-types-and-relations is not intended to be complete. It only illustrates types and relations from Section starting-points and exploited in the example discussed in the next section. They will then be explained in detail in Section data-model-components. The third column of Table (Mapping of Provenance concepts to types and relations in PROV-DM) lists names that are part of a textual notation to write instances of the PROV-DM data model. This notation, referred to as the PROV-N notation, is outlined in the next section.

2.6 PROV-N: The Provenance Notation

A key goal of PROV-DM is the specification of a machine-processable data model for provenance so that applications can retrieve provenance and reason about it. As such, representations of PROV-DM are available in RDF and XML.

However, it is important to provide instances of provenance for human consumption, as in this document or elsewhere. To this end, PROV-N is a notation that is designed to write instances of the PROV-DM data model in a compact textual form, without the syntactic baggage and constraints coming with a markup language such as XML or a description framework such as RDF. We outline here some of its key design principles. For full details, the reader is referred to the companion specification [PROV-N].

An activity with identifier a1 and an attribute type with value createFile.

activity(a1, [prov:type="createFile"])
Two entities with identifiers e1 and e2.
entity(e1)
entity(e2)
The activity a1 used e1, and e2 was generated by a1.
used(a1,e1)
wasGeneratedBy(e2,a1)

3. Illustration of PROV-DM by an Example

Section starting-points has introduced some provenance concepts, and how they are expressed as types or relations in the PROV data model. The purpose of this section is to put these concepts into practice in order to express the provenance of some document published on the Web. With this realistic example, PROV-DM constructs are composed together, and a graphical illustration shows a provenance description forming a directed graph, rooted at the entity we want to explain the provenance of, and pointing to the entities, activities, and agents it depended on. This example also shows that, sometimes, multiple provenance descriptions about a same entity can co-exist, which then justifies the need for provenance of provenance.

The World Wide Web Consortium publishes many technical reports. In this example, we consider a technical report, and describe its provenance. Specifically, we consider the second version of the PROV-DM document http://www.w3.org/TR/2011/WD-prov-dm-20111215. Its provenance can be expressed from several perspectives, which we present. In the first one, provenance is concerned with the W3C process, whereas in the second one, it takes the authors' viewpoint; we then provide attribution to these two provenance descriptions.

3.1 The Process View

Description: The World Wide Web Consortium publishes technical reports according to its publication policy. Working drafts are published regularly to reflect the work accomplished by working groups. Every publication of a working draft must be preceded by a "publication request" to the Webmaster. The very first version of a technical report must also preceded by a "transition request" to be approved by the W3C director. All working drafts are made available at a unique URI. In this scenario, we consider two successive versions of a given report, the policy according to which they were published, and the associated requests.

Concretely, in this section, we describe the kind of provenance record that the WWW Consortium could keep for auditors to check that due processes are followed. All entities involved in this example are Web resources, with well defined URIs (some of which locating archived email messages, available to W3C Members).

We now paraphrase some PROV-DM descriptions, and illustrate them with the PROV-N notation, a notation for PROV-DM aimed at human consumption. We then follow them with a graphical illustration. Full details of the provenance record can be found here.

Provenance descriptions can be illustrated graphically. The illustration is not intended to represent all the details of the model, but it is intended to show the essence of a set of provenance descriptions. Therefore, it should not be seen as an alternate notation for expressing provenance.

The graphical illustration takes the form of a graph. Entities, activities and agents are represented as nodes, with oval, rectangular, and pentagonal shapes, respectively. Usage, Generation, Derivation, and Association are represented as directed edges.

Entities are laid out according to the ordering of their generation. We endeavor to show time progressing from left to right. This means that edges for Usage, Generation, Derivation, Association typically point leftwards

Provenance of a Tech Report
Provenance of a Tech Report

This simple example has shown a variety of PROV-DM constructs, such as Entity, Agent, Activity, Usage, Generation, Derivation, and Association. In this example, it happens that all entities were already Web resources, with readily available URIs, which we used. We note that some of the resources are public, whereas others have restricted access: provenance statements only make use of their identifiers. If identifiers do not pre-exist, e.g. for activities, then they can be generated, for instance ex:act2, occurring in the namespace identified by prefix ex. We note that the URI scheme developed by W3C is particularly suited for expressing provenance of these reports, since each URI denotes a specific version of a report. It then becomes very easy to relate the various versions, with PROV-DM constructs. We note that an Association is a ternary relation (represented by a multi-edge labeled wasAssociatedWith) from an activity to an agent and a plan.

3.2 The Authors View

Description: A technical report is edited by some editor, using contributions from various contributors.

Here, we consider another perspective on technical report http://www.w3.org/TR/2011/WD-prov-dm-20111215. Provenance is concerned with the document editing activity, as perceived by authors. This kind of information could be used by authors in their CV or in a narrative about this document.

Again, we paraphrase some PROV-DM descriptions, and illustrate them with the PROV-N notation. Full details of the provenance record can be found here.

Provenance of a Tech Report (b)
Provenance of a Tech Report (b)

3.3 Attribution of Provenance

The two previous sections provide two different perspectives on the provenance of a technical report. By design, the PROV approach allows for the provenance of a subject to be provided by multiple sources. For users to decide whether they can place their trust in the technical report, they may want to analyze its provenance, but also determine who the provenance is attributed to, and when it was generated, etc. In other words, we need to be able to express the provenance of provenance.

No new mechanism is required to support this requirement. PROV-DM makes the assumption that provenance statements have been bundled up, and named, by some mechanism outside the scope of PROV-DM. For instance, in this case, provenance statements were put in a file and exposed on the Web, respectively at ex:w3c-publication1.pn and ex:w3c-publication3.pn. To express their respective provenance, these resources must be seen as entities, and all the constructs of PROV-DM are now available to characterize their provenance. In the example below, ex:w3c-publication1.pn is attributed to the agent w3:Consortium, whereas ex:w3c-publication3.pn to ex:Simon.

entity(ex:w3c-publication1.pn, [prov:type="prov:Account" %% xsd:QName ])
wasAttributedTo(ex:w3c-publication1.pn, w3:Consortium)

entity(ex:w3c-publication3.pn, [prov:type="prov:Account" %% xsd:QName ])
wasAttributedTo(ex:w3c-publication3.pn, ex:Simon)

4. PROV-DM Types and Relations

PROV-DM concepts are structured according to six components that are introduced in this section. Components and their dependencies are illustrated in Figure prov-dm-components. A component that relies on concepts defined in another also sits above it in this figure. PROV-DM consists of the following components.

  1. Component 1: entities and activities. The first component consists of entities, activities, and all concepts linking them, such as generation, usage, start, end. The first component is the only one comprising time-related concepts.
  2. Component 2: agents and responsibility. The second component consists of agents and concepts ascribing responsibility to agents.
  3. Component 3: derivations. The third component is formed with derivations and its derivation subtypes.
  4. Component 4: alternate. The fourth component consists of relations linking entities somehow referring to a same thing.
  5. Component 5: collections. The fifth component is comprised of collections and operations related to collections.
  6. Component 6: annotations. The sixth component is concerned with annotations to PROV-DM concepts.
PROV-DM Components collections alternate annotations activities/entities derivations agents/responsibility
PROV-DM Components

While not all PROV-DM relations are binary, they all involve two primary elements. Hence, Table relations-at-a-glance indexes all relations according to their two primary elements. The table adopts the same color scheme as Figure prov-dm-components, allowing components to be readily identified. Note that for simplicity, this table does not include collection-oriented relations.

PROV-DM Relations At a Glance
EntityActivityAgentNote
EntitywasGeneratedBywasAttributedTohasAnnotation
ActivitywasStartedByActivity
wasInformedBy
wasAssociatedWithhasAnnotation
AgentactedOnBehalfOfhasAnnotation
NotehasAnnotation

Table prov-dm-concepts-and-relations is a complete index of all the concepts and relations in prov-dm, color-coded according to the component they belong too. In the first column, one finds concept names directly linking to their English definition. In the second column, we find their representation in the PROV-N notation, directly linking to the definition of their various constituents.

PROV-DM Concepts and Relations
Entityentity(id, [ attr1=val1, ...])
Activityactivity(id, st, et, [ attr1=val1, ...])
GenerationwasGeneratedBy(id,e,a,t,attrs)
Usageused(id,a,e,t,attrs)
StartwasStartedBy(id,a,e,t,attrs)
EndwasEndedBy(id,a,e,t,attrs)
CommunicationwasInformedBy(id,a2,a1,attrs)
Start by ActivitywasStartedByActivity(id, a2, a1, attrs)
Agentagent(id, [ attr1=val1, ...])
AttributionwasAttributedTo(id,e,ag,attr)
AssociationwasAssociatedWith(id,a,ag,pl,attrs)
ResponsibilityactedOnBehalfOf(id,ag2,ag1,a,attrs)
DerivationwasDerivedFrom(id, e2, e1, a, g2, u1, attrs)
RevisionwasRevisionOf(id,e2,e1,ag,attrs)
QuotationwasQuotedFrom(id,e2,e1,ag2,ag1,attrs)
Original SourcehadOriginalSource(id,e2,e1,attrs)
TraceabilitytracedTo(id,e2,e1,attrs)
AlternatealternateOf(alt1, alt2)
SpecializationspecializationOf(sub, super)
CollectionCollection
InsertionderivedByInsertionFrom(id, c2, c1, {(key_1, e_1), ..., (key_n, e_n)}, attrs)
RemovalderivedByRemovalFrom(id, c2, c1, {key_1, ... key_n}, attrs)
MembershipmemberOf(c, {(key_1, e_1), ..., (key_n, e_n)})
Notenote(id, [ attr1=val1, ...])
AnnotationhasAnnotation(r,n)

In the rest of the section, each concept and relation is defined, in English initially, followed by a more formal definition and some example.

4.1 Component 1: Entities and Activities

The first component of PROV-DM is concerned with entities and activities, and their inter-relations: Usage, Generation, Start, End, Communication, and Start by Activity. Figure figure-component1 overviews the first component, with two "UML classes" and binary associations between them. Associations are not just binary; indeed, Usage, Generation, Start, End are remarkable because they have time attributes, which are placeholders for time information related to provenance.

entities and activities
Entities and Activities Component Overview

4.1.1 Entity

An entity is a thing one wants to provide provenance for. For the purpose of this specification, things can be physical, digital, conceptual, or otherwise; things may be real or imaginary.

An entity, written entity(id, [attr1=val1, ...]) in PROV-N, contains:
  • id: an identifier for an entity;
  • attributes: an optional set of attribute-value pairs ((attr1, val1), ...) representing this entity's situation in the world.

The following expression

entity(tr:WD-prov-dm-20111215, [ prov:type="document", ex:version="2" ])
states the existence of an entity, denoted by identifier tr:WD-prov-dm-20111215, with type document and version number 2. The attributes ex:version is application specific, whereas the attribute type is reserved in the PROV-DM namespace.
The characterization interval of an entity is currently implicit. Making it explicit would allow us to define wasComplementOf more precisely. Beginning and end of characterization interval could be expressed by attributes (similarly to activities). How do we define the end of an entity? This is ISSUE-204.

4.1.2 Activity

An activity is something that occurs over a period of time and acts upon or with entities. This action can take multiple forms: consuming, processing, transforming, modifying, relocating, using, generating, or being associated with entities.

An activity, written activity(id, st, et, [attr1=val1, ...]) in PROV-N, contains:
  • id: an identifier for an activity;
  • startTime: an optional time (st) for the start of the activity;
  • endTime: an optional time (et) for the end of the activity;
  • attributes: an optional set of attribute-value pairs ((attr1, val1), ...) for this activity.

The following expression

activity(a1,2011-11-16T16:05:00,2011-11-16T16:06:00,
        [ex:host="server.example.org",prov:type="ex:edit" %% xsd:QName])

states the existence of an activity with identifier a1, start time 2011-11-16T16:05:00, and end time 2011-11-16T16:06:00, running on host server.example.org, and of type edit. The attribute host is application specific (declared in some namespace with prefix ex). The attribute type is a reserved attribute of PROV-DM, allowing for sub-typing to be expressed.

Further considerations:

  • An activity is not an entity. This distinction is similar to the distinction between 'continuant' and 'occurrent' in logic [Logic].

4.1.3 Generation

Generation is the completion of production of a new entity by an activity. This entity becomes available for usage after this generation. This entity did not exist before generation.

Generation, written wasGeneratedBy(id,e,a,t,attrs) in PROV-N, has the following components:
  • id: an optional identifier for a generation;
  • entity: an identifier (e) for a created entity;
  • activity: an optional identifier (a) for the activity that creates the entity;
  • time: an optional "generation time" (t), the time at which the entity was completely created;
  • attributes: an optional set (attrs) of attribute-value pairs that describes the modalities of generation of this entity by this activity.

While each of the components activity, time, and attributes is optional, at least one of them must be present.

The following expressions

  wasGeneratedBy(e1,a1, 2001-10-26T21:32:52, [ex:port="p1"])
  wasGeneratedBy(e2,a1, 2001-10-26T10:00:00, [ex:port="p2"])

state the existence of two generations (with respective times 2001-10-26T21:32:52 and 2001-10-26T10:00:00), at which new entities, identified by e1 and e2, are created by an activity, identified by a1. The first one is available on port p1, whereas the other is available on port p2. The semantics of port are application specific.

In some cases, we may want to record the time at which an entity was generated without having to specify the activity that generated it. To support this requirement, the activity component in generation is optional. Hence, the following expression indicates the time at which an entity is generated, without naming the activity that did it.

  wasGeneratedBy(e,-,2001-10-26T21:32:52)

4.1.4 Usage

Usage is the beginning of consumption of an entity by an activity. Before usage, the activity had not begun to consume or use this entity and could not have been affected by the entity.

Usage, written used(id,a,e,t,attrs) in PROV-N, has the following constituents:
  • id: an optional identifier for a usage;
  • activity: an identifier (a) for the consuming activity;
  • entity: an identifier (e) for the consumed entity;
  • time: an optional "usage time" (t), the time at which the entity started to be used;
  • attributes: an optional set (attrs) of attribute-value pairs that describe the modalities of usage of this entity by this activity.

A reference to a given entity may appear in multiple usages that share a given activity identifier.

The following usages

  used(a1,e1,2011-11-16T16:00:00,[ex:parameter="p1"])
  used(a1,e2,2011-11-16T16:00:01,[ex:parameter="p2"])

state that the activity identified by a1 used two entities identified by e1 and e2, at times 2011-11-16T16:00:00 and 2011-11-16T16:00:01, respectively; the first one was found as the value of parameter p1, whereas the second was found as value of parameter p2. The semantics of parameter is application specific.

4.1.5 Start

Start is when an activity is deemed to have started. The activity did not exist before its start. Any usage or generation involving an activity follows its start. A start may refer to an entity, known as trigger, that initiated the activity.

An activity start, written wasStartedBy(id,a,e,t,attrs) in PROV-N, contains:
  • id: an optional identifier for the activity start;
  • activity: an identifier (a) for the started activity;
  • trigger: an optional identifier (e) for the entity triggering the activity;
  • time: the optional time (t) at which the activity was started;
  • attributes: an optional set (attrs) of attribute-value pairs describing modalities according to which the activity was started.

The following example contains the description of an activity a1 (a discussion), which was started at a specific time, and was triggered by an email message e1.

entity(e1,[prov:type="email message"])
activity(a1,[prov:type="Discuss"])
wasStartedBy(a1,e1,2011-11-16T16:05:00)
Furthermore, if the activity happens to consume the message content, then the message would also be regarded as an input to the activity, which we describe as follows:
used(a1,e1,-)

In the following example, a race is started by a bang, and responsibility for this trigger is attributed to an agent ex:DarthVader.

activity(ex:foot_race)
wasStartedBy(ex:foot_race,ex:bang,2012-03-09T08:05:08-05:00)
entity(ex:bang)
agent(ex:DarthVader)
wasAttributedTo(ex:foot_race,ex:DarthVader)

The relations wasStartedBy and used are orthogonal, and thus need to be expressed independently, according to the situation being described.

4.1.6 End

End is when an activity is deemed to have ended. The activity no longer exists after its end. Any usage, generation, or invalidation involving an activity precedes its end. An end may refer to an entity, known as trigger, that terminated the activity.

An activity end, written wasEndedBy(id,a,e,t,attrs) in PROV-N, contains:

  • id: an optional identifier for the activity end;
  • activity: an identifier (a) for the ended activity;
  • trigger: an optional identifier (e) for the entity triggering the activity ending;
  • time: the optional time (t) at which the activity was ended;
  • attributes: an optional set (attrs) of attribute-value pairs describing modalities according to which the activity was ended.

The following example is a description of an activity a1 (editing) that was ended following an approval document e1.

entity(e1,[prov:type="approval document"])
activity(a1,[prov:type="Editing"])
wasEndedBy(a1,e1)

4.1.7 Communication

Communication is the exchange of an entity by two activities, one activity using the entity generated by the other.

A communication implies that activity a2 is dependent on another a1, by way of some unspecified entity that is generated by a1 and used by a2.

A communication, written as wasInformedBy(id,a2,a1,attrs) in PROV-N, contains:
  • id: an optional identifier identifying the relation;
  • informed: the identifier (a2) of the informed activity;
  • informant: the identifier (a1) of the informant activity;
  • attributes: an optional set (attrs) of attribute-value pairs to further describe properties of the relation.

Consider two long running services, which we represent by activities s1 and s2.

activity(s1, [prov:type="service"])
activity(s2, [prov:type="service"])
wasInformedBy(s2,s1)
The last line indicates that some entity was generated by s1 and used by s2.

4.1.8 Start by Activity

Start by Activity is the start of an activity with an implicit trigger generated by another activity.

A start by activity, written as wasStartedByActivity(id, a2, a1, attrs) in PROV-N, contains:
  • id: an optional identifier of the relation;
  • started: the identifier (a2) of the started activity;
  • starter: the identifier (a1) of the activity that started the other;
  • attributes: an optional set (attrs) of attribute-value pairs to further describe the properties of the relation.

Suppose activities a1 and a2 are computer processes that are executed on different hosts, and that a1 started a2. This can be expressed as in the following fragment:

activity(a1,t1,t2,[ex:host="server1.example.org",prov:type="workflow"])
activity(a2,t3,t4,[ex:host="server2.example.org",prov:type="subworkflow"])
wasStartedByActivity(a2,a1)

4.2 Component 2: Agents and Responsibility

The second component of PROV-DM is concerned with agents and the notions of Attribution, Association, Responsibility, relating agents to entities, activities, and agents, respectively. Figure figure-component2 depicts the second component, with four "UML classes" (Entity, Activity, Agent, and Plan) and associations between them. So-called "UML association classes" are used to express n-ary relations.

agents and responsibilities
Agents and Responsibilities Component Overview

4.2.1 Agent

An agent is a type of entity that bears some form of responsibility for an activity taking place.

An agent, noted agent(id, [attr1=val1, ...]) in PROV-N, contains:
  • id: an identifier for an agent;
  • attributes: a set of attribute-value pairs ((attr1, val1), ...) representing this agent's situation in the world.

It is useful to define some basic categories of agents from an interoperability perspective. There are three types of agents that are common across most anticipated domains of use:

  • Person: agents of type Person are people.
  • Organization: agents of type Organization are social institutions such as companies, societies etc.
  • SoftwareAgent: a software agent is running software.

It is acknowledged that these types do not cover all kinds of agent.

The following expression is about an agent identified by e1, which is a person, named Alice, with employee number 1234.

agent(e1, [ex:employee="1234", ex:name="Alice", prov:type="prov:Person" %% xsd:QName])

It is optional to specify the type of an agent. When present, it is expressed using the prov:type attribute.

4.2.2 Attribution

Attribution is the ascribing of an entity to an agent.

When an entity e is attributed to agent ag, entity e was generated by some unspecified activity that in turn was associated to agent ag. Thus, this relation is useful when the activity is not known, or irrelevant.

An attribution relation, written wasAttributedTo(id,e,ag,attrs) in PROV-N, contains the following elements:

  • id: an optional identifier for the relation;
  • entity: an entity identifier (e);
  • agent: the identifier (ag) of the agent whom the entity is ascribed to;
  • attributes: an optional set (attrs) of attribute-value pairs to further describe the properties of the relation.

Revisiting the example of Section 3.2, we can ascribe tr:WD-prov-dm-20111215 to some agents without an explicit activity.

agent(ex:Paolo, [ prov:type="Person" ])
agent(ex:Simon, [ prov:type="Person" ])
entity(tr:WD-prov-dm-20111215, [ prov:type="pr:RecsWD" %% xsd:QName ])
wasAttributedTo(tr:WD-prov-dm-20111215, ex:Paolo, [prov:role="editor"])
wasAttributedTo(tr:WD-prov-dm-20111215, ex:Simon, [prov:role="contributor"])

4.2.3 Association

An activity association is an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity. It further allows for a plan to be specified, which is the plan intended by the agent to achieve some goals in the context of this activity.

An activity association, written wasAssociatedWith(id,a,ag,pl,attrs) in PROV-N, has the following constituents:
  • id: an optional identifier for the association between an activity and an agent;
  • activity: an identifier (a) for the activity;
  • agent: an optional identifier (ag) for the agent associated with the activity;
  • plan: an optional identifier (pl) for the plan adopted by the agent in the context of this activity;
  • attributes: an optional set (attrs) of attribute-value pairs that describe the modalities of association of this activity with this agent.

In the following example, a designer and an operator agents are associated with an activity. The designer's goals are achieved by a workflow ex:wf.

activity(ex:a, [prov:type="workflow execution"])
agent(ex:ag1, [prov:type="operator"])
agent(ex:ag2, [prov:type="designer"])
wasAssociatedWith(ex:a, ex:ag1, -, [prov:role="loggedInUser", ex:how="webapp"])
wasAssociatedWith(ex:a, ex:ag2, ex:wf,[prov:role="designer", ex:context="project1"])
entity(ex:wf, [prov:type="prov:Plan" %% xsd:QName, ex:label="Workflow 1", 
              ex:url="http://example.org/workflow1.bpel" %% xsd:anyURI])
Since the workflow ex:wf is itself an entity, its provenance can also be expressed in PROV-DM: it can be generated by some activity and derived from other entities, for instance.

In some cases, one wants to indicate a plan was followed, without having to specify which agent was involved.

activity(ex:a,[prov:type="workflow execution"])
wasAssociatedWith(ex:a,-,ex:wf)
entity(ex:wf,[prov:type="prov:Plan"%% xsd:QName, ex:label="Workflow 1", 
              ex:url="http://example.org/workflow1.bpel" %% xsd:anyURI])
In this case, it is assumed that an agent exists, but it has not been specified.

4.2.4 Responsibility

Responsibility is the fact that an agent is accountable for the actions of a "subordinate" agent, in the context of an activity.

PROV-DM offers a mild version of responsibility in the form of a relation to represent when an agent acted on another agent's behalf. So in the example of someone running a mail program, the program and the person are both agents of the activity; furthermore, the mail software agent is running on the person's behalf. In another example, the student acted on behalf of his supervisor, who acted on behalf of the department chair, who acts on behalf of the university; all those agents are responsible in some way for the activity to take place but we do not say explicitly who bears responsibility and to what degree.

A responsibility relation, written actedOnBehalfOf(id,ag2,ag1,a,attrs) in PROV-N, has the following constituents:
  • id: an optional identifier for the responsibility chain;
  • subordinate: an identifier (ag2) for the agent associated with an activity, acting on behalf of the responsible agent;
  • responsible: an identifier (ag1) for the agent, on behalf of which the subordinate agent acted;
  • activity: an optional identifier (a) of an activity for which the responsibility chain holds;
  • attributes: an optional set (attrs) of attribute-value pairs that describe the modalities of this relation.

In the following example, a programmer, a researcher and a funder agents are described. The programmer and researcher are associated with a workflow activity. The programmer acts on behalf of the researcher (delegation) encoding the commands specified by the researcher; the researcher acts on behalf of the funder, who has an contractual agreement with the researcher. The terms 'delegation' and 'contact' used in this example are domain specific.

activity(a,[prov:type="workflow"])
agent(ag1,[prov:type="programmer"])
agent(ag2,[prov:type="researcher"])
agent(ag3,[prov:type="funder"])
wasAssociatedWith(a,ag1,[prov:role="loggedInUser"])
wasAssociatedWith(a,ag2)
actedOnBehalfOf(ag1,ag2,a,[prov:type="delegation"])
actedOnBehalfOf(ag2,ag3,a,[prov:type="contract"])

4.3 Component 3: Derivations

The third component of PROV-DM is concerned with derivations between entities, and subtypes of derivations Revision, Quotation, Original Source, and Traceability. Figure figure-component3 overviews the third component, with three "UML classes" (Entity, Activity, and Agent) and associations between them. So-called "UML association classes" are used to express n-ary relations.

derivation
Derivation Component Overview

4.3.1 Derivation

A derivation is a transformation of an entity into another, a construction of an entity into another, or an update of an entity, resulting in a new one.

According to Section Starting Points, for an entity to be transformed from, created from, or resulting from an update to another, there must be some underpinning activities performing the necessary actions resulting in such a derivation. A derivation can be described at various levels of precision. In its simplest form, derivation relates two entities. Optionally, attributes can be added to describe modalities of derivation. If the derivation is the result of a single known activity, then this activity can also be optionally expressed. And to provide a completely accurate description of the derivation, the generation and usage of the generated and used entities, respectively, can be provided. The reason for optional information such as activity, generation, and usage to be linked to derivations is to aid analysis of provenance and to facilitate provenance-based reproducibility.

A derivation, written wasDerivedFrom(id, e2, e1, a, g2, u1, attrs) in PROV-N, contains:
  • id: an optional identifier for a derivation;
  • generatedEntity: the identifier (ee) of the entity generated by the derivation;
  • usedEntity: the identifier (e1) of the entity used by the derivation;
  • activity: an optional identifier (a) for the activity using and generating the above entities;
  • generation: an optional identifier (g2) for the generation involving the generated entity and activity;
  • usage: an optional identifier (u1) for the usage involving the used entity and activity;
  • attributes: an optional set (attrs) of attribute-value pairs that describe the modalities of this derivation.

The following descriptions state the existence of derivations.

wasDerivedFrom(e2, e1)
wasDerivedFrom(e2, e1, [prov:type="physical transform"])
wasDerivedFrom(e2, e1, a, g2, u1)
wasGeneratedBy(g2, e2, a, -)
used(u1, a, e1, -)

The first and second lines are about derivations between e2 and e1, but no information is provided as to the identity of the activity (and usage and generation) underpinning the derivation. In the second line, a type attribute is also provided.

The third description expresses that activity a, using the entity e1 according to usage u1, derived the entity e2 and generated it according to generation g2. It is followed by descriptions for generation g2 and usage u1. With such a comprehensive description of derivation, a program that analyzes provenance can identify the activity underpinning the derivation, it can identify how the original entity e1 was used by the activity (e.g. for instance, which argument it was passed as, if the activity is the result of a function invocation), and which output the derived entity e2 was obtained from (say, for a function returning multiple results).

Emphasize the notion of 'affected by' ISSUE-133.

4.3.2 Revision

A revision is a derivation that revises an entity into a revised version.

Deciding whether something is made available as a revision of something else usually involves an agent who takes responsibility for approving that the former is a due variant of the latter. The agent who is responsible for the revision may optionally be specified. Revision is a particular case of derivation of an entity into its revised version.

A revision relation, written wasRevisionOf(id,e2,e1,ag,attrs) in PROV-N, contains:

  • id: an optional identifier for the relation;
  • newer: the identifier (e2) of the revised entity;
  • older: the identifier (e1) of the older entity;
  • responsibility: an optional identifier (ag) for the agent who approved the newer entity as a variant of the older;
  • attributes: an optional set (attrs) of attribute-value pairs to further describe the properties of this relation.

Revisiting the example of Section 3.1, we can now state that the report tr:WD-prov-dm-20111215 is a revision of the report tr:WD-prov-dm-20111018, approved by agent w3:Consortium.

entity(tr:WD-prov-dm-20111215, [ prov:type="pr:RecsWD" %% xsd:QName ])
entity(tr:WD-prov-dm-20111018, [ prov:type="pr:RecsWD" %% xsd:QName ])
wasRevisionOf(tr:WD-prov-dm-20111215, tr:WD-prov-dm-20111018, w3:Consortium)

4.3.3 Quotation

A quotation is the repeat of (some or all of) an entity, such as text or image, by someone other than its original author.

Quotation is a particular case of derivation in which entity e2 is derived from an original entity e1 by copying, or "quoting", some or all of it. A quotation relation, written wasQuotedFrom(id,e2,e1,ag2,ag1,attrs) in PROV-N, contains:

  • id: an optional identifier for the relation;
  • quote: an identifier (e2) for the entity that represents the quote (the partial copy);
  • original: an identifier (e1) for the original entity being quoted;
  • quoterAgent: an optional identifier (ag2) for the agent who performs the quote;
  • originalAgent: an optional identifier (ag1) for the agent to whom the original entity is attributed;
  • attributes: an optional set (attrs) of attribute-value pairs to further describe the properties of the relation.

The following paragraph is a quote from one of the author's blogs.

"During the workshop, it became clear to me that the consensus based models (which are often graphical in nature) can not only be formalized but also be directly connected to these database focused formalizations. I just needed to get over the differences in syntax. This could imply that we could have nice way to trace provenance across systems and through databases and be able to understand the mathematical properties of this interconnection."

If wp:thoughts-from-the-dagstuhl-principles-of-provenance-workshop/ denotes the original blog by agent ex:Paul, and dm:bl-dagstuhl denotes the above paragraph, then the following descriptions express that the above paragraph is copied by agent ex:Luc from a part of the blog, attributed to the agent ex:Paul.

entity(wp:thoughts-from-the-dagstuhl-principles-of-provenance-workshop/)
agent(ex:Luc)
agent(ex:Paul)
wasQuotedFrom(dm:bl-dagstuhl,wp:thoughts-from-the-dagstuhl-principles-of-provenance-workshop/,ex:Luc,ex:Paul)

4.3.4 Original Source

An original source refers to the source material that is closest to the person, information, period, or idea being studied.

An original source relation is a particular case of derivation that aims to give credit to the source that originated some information. It is recognized that it may be hard to determine which entity constitutes an original source. This definition is inspired by original-source as defined in http://googlenewsblog.blogspot.com/2010/11/credit-where-credit-is-due.html.

An original source relation, written hadOriginalSource(id,e2,e1,attrs), contains:

  • id: an optional identifier for the relation;
  • derived: an identifier (e2) for the derived entity;
  • source: an identifier (e1) for the original source entity;
  • attributes: an optional set (attrs) of attribute-value pairs to further describe the properties of the relation.

Let us consider the current section dm:term-original-source, and the Google page go:credit-where-credit-is-due.html, where the notion was originally described.

entity(dm:term-original-source)
entity(go:credit-where-credit-is-due.html)
hadOriginalSource(dm:term-original-source,go:credit-where-credit-is-due.html)

4.3.5 Traceability

Traceability is the ability to link back an entity to another by means of derivation or responsibility relations, possibly repeatedly traversed.

A traceability relation between two entities e2 and e1 is a generic dependency of e2 on e1 that indicates either that e1 may have been necessary for e2 to be created, or that e1 bears some responsibility for e2's existence.

Traceability, written tracedTo(id,e2,e1,attrs) in PROV-N, contains:

  • id: an optional identifier identifying the relation;
  • entity: an identifier (e2) for an entity;
  • ancestor: an identifier (e1) for an ancestor entity that the former depends on;
  • attributes: an optional set (attrs) of attribute-value pairs to further describe properties of the relation.

We note that the ancestor is allowed to be an agent since agents are entities.

Derivation and association are particular cases of traceability.

We refer to the example of Section 3.1, and specifically to Figure prov-tech-report. We can see that there is a path from tr:WD-prov-dm-20111215 to w3:Consortium or to pr:rec-advance. This is expressed as follows.

 tracedTo(tr:WD-prov-dm-20111215,w3:Consortium)
 tracedTo(tr:WD-prov-dm-20111215,pr:rec-advance)

4.4 Component 4: Alternate Entities

The fourth component of PROV-DM is concerned with relations specialization and alternate between entities. Figure figure-component4 overviews the component, which consists of a single "UML Class" and two associations.

alternates
Alternates Component Overview

Wherever two people describe the provenance of a same thing, one cannot expect them to coordinate and agree on the identifiers to use to denote that thing.

User Alice writes an article. In its provenance, she wishes to refer to the precise version of the article with a date-specific URI, as she might edit the article later. Alternatively, user Bob refers to the article in general, indepedently of its variants over time.

To allow for identifiers to be chosen freely and independently by each user, the PROV data model introduces relations that allow entities to be linked together. The following two relations are introduced for expressing specialized or alternate entities.

4.4.1 Specialization

An entity is a specialization of another if they refer to some common thing but the former is a more constrained entity than the latter. The common thing do not need to be identified.

Examples of constraints include a time period, an abstraction, and a context associated with the entity.

A specialization relation, written specializationOf(sub, super) in PROV-N, has the following constituents:
  • specializedEntity: an identifier (sub) of the specialized entity;
  • generalEntity: an identifier (super) of the entity that is being specialized.

The BBC news home page on 2012-03-23 ex:bbcNews2012-03-23 is a specialization of the BBC news page in general bbc:news/. This can be expressed as follows.

specializationOf(ex:bbcNews2012-03-23, bbc:news/)
Given that the BBC news does not define a URI for this day's news page, we are creating a qualified name in the namespace ex.

4.4.2 Alternate

An entity is alternate of another if they are both a specialization of some common entity. The common entity does not need to be identified.

An alternate relation, written alternateOf(e1, e2) in PROV-N, has the following constituents:
  • alternate1: an identifier (e1) of the first of the two entities;
  • alternate2: an identifier (e2) of the second of the two entities.

A given news item on the BBC News site bbc:news/science-environment-17526723 for desktop is an alternate of a bbc:news/mobile/science-environment-17526723 for mobile devices.

entity(bbc:news/science-environment-17526723, [ prov:type="a news item for desktop"])
entity(bbc:news/mobile/science-environment-17526723, [ prov:type="a news item for mobile devices"])
alternateOf(bbc:news/science-environment-17526723, bbc:news/mobile/science-environment-17526723)

They are both specialization of an (unspecified) entity.

Considering again the two versions of the technical report tr:WD-prov-dm-20111215 (second working draft) and tr:WD-prov-dm-20111018 (first working draft). They are alternate of each other.

entity(tr:WD-prov-dm-20111018)
entity(tr:WD-prov-dm-20111215)
alternateOf(tr:WD-prov-dm-20111018,tr:WD-prov-dm-20111215)

They are both specialization of the page http://www.w3.org/TR/prov-dm/.

4.5 Component 5: Collections

The fifth component of PROV-DM is concerned with the notion of collections. A collection is an entity that has some members. The members are themselves entities, and therefore their provenance can be expressed. In many applications, it is also of interest to be able to express the provenance of the collection itself: e.g. who maintains the collection, which member it contains at which point in time, and how it was assembled. The purpose of Component 5 is to define the types and relations that are useful to express the provenance of collections.

Figure figure-component5 overviews the component, which consists of two "UML Class" and three associations.

collections
Collections Component Overview

The intent of these relations and types is to express the history of changes that occurred to a collection. Changes to collections are about the insertion of entities to collections and the removal of members from collections. Indirectly, such history provides a way to reconstruct the contents of a collection.

4.5.1 Collection

A collection is an entity that provides a structure to some constituents, which are themselves entities. These constituents are said to be member of the collections.

Conceptually, a collection has a logical structure consisting of key-entity pairs. This structure is often referred to as a map, and is a generic indexing mechanisms that can abstract commonly used data structures, including associative lists (also known as "dictionaries" in some programming languages), relational tables, ordered lists, and more (the specification of such specialized structures in terms of key-value pairs is out of the scope of this document).

A given collection forms a given structure for its members. A different structure (obtained either by insertion or removal of members) constitutes a different collection. Hence, for the purpose of provenance, a collection entity is viewed as a snapshot of a structure. Insertion and removal operations result in new snapshots, each snapshot forming an identifiable collection entity.

PROV-DM defines the following types related to collections:

  • prov:Collection denotes an entity of type collection, i.e. an entity that can participate in relations amongst collections;
  • prov:EmptyCollection denotes an empty collection.
entity(c0, [prov:type="EmptyCollection" %% xsd:QName])  // c0 is an empty collection
entity(c1, [prov:type="Collection"  %% xsd:QName])      // c1 is a collection, with unknown content

4.5.2 Insertion

Insertion is a derivation that transforms a collection into another, by insertion of one or more key-entity pairs.

A Derivation-by-Insertion relation, written derivedByInsertionFrom(id, c2, c1, {(key_1, e_1), ..., (key_n, e_n)}, attrs), contains:

  • id: an optional identifier identifying the relation;
  • after: an identifier (c2) for the collection after insertion;
  • before: an identifier (c1) for the collection before insertion;
  • key-entity-set: the inserted key-entity pairs (key_1, e_1), ..., (key_n, e_n) in which each key_i is a value, and e_i is an identifier for the entity that has been inserted with the key; each key_i is expected to be unique for the key-entity-set;
  • attributes: an optional set (attrs) of attribute-value pairs to further describe the properties of the relation.

A Derivation-by-Insertion relation derivedByInsertionFrom(id, c2, c1, {(key_1, e_1), ..., (key_n, e_n)}) states that c2 is the state of the collection following the insertion of pairs (key_1, e_1), ..., (key_n, e_n) into collection c1.

entity(c0, [prov:type="EmptyCollection" %% xsd:QName])    // c0 is an empty collection
entity(e1)
entity(e2)
entity(e3)
entity(c1, [prov:type="Collection" %% xsd:QName])
entity(c2, [prov:type="Collection" %% xsd:QName])

derivedByInsertionFrom(c1, c0, {("k1", e1), ("k2", e2)})       
derivedByInsertionFrom(c2, c1, {("k3", e3)})    
From this set of descriptions, we conclude:
   c0 = {  }
   c1 = { ("k1", e1), ("k2", e2) }
   c2 = { ("k1", e1), ("k2", e2), ("k3", e3) }
  

Insertion provides an "update semantics" for the keys that are already present in the collection, as illustrated by the following example.

entity(c0, [prov:type="EmptyCollection" %% xsd:QName])    // c0 is an empty collection
entity(e1)
entity(e2)
entity(e3)
entity(c1, [prov:type="Collection" %% xsd:QName])
entity(c2, [prov:type="Collection" %% xsd:QName])

derivedByInsertionFrom(c1, c0, {("k1", e1), ("k2", e2)})       
derivedByInsertionFrom(c2, c1, {("k1", e3)})    
This is a case of update of e1 to e3 for the same key, "k1".
From this set of descriptions, we conclude:
   c0 = {  }
   c1 = { ("k1", e1), ("k2", e2) }
   c2 = { ("k1", e3), ("k2", e2) }
  

4.5.3 Removal

Removal is a derivation that transforms a collection into another, by removing one or more key-entity pairs.

A Derivation-by-Removal relation, written derivedByRemovalFrom(id, c2, c1, {key_1, ... key_n}, attrs), contains:

  • id: an optional identifier identifying the relation;
  • after: an identifier (c2) for the collection after the deletion;
  • before: an identifier (c1) for the collection before the deletion;
  • key-set: a set of deleted keys key_1, ..., key_n, for which each key_i is a value;
  • attributes: an optional set (attrs) of attribute-value pairs to further describe the properties of the relation.

Derivation-by-Removal relation derivedByRemovalFrom(id, c2,c1, {key_1, ..., key_n}) states that c2 is the state of the collection following the removal of the set of pairs corresponding to keys key_1...key_n from c1.

entity(c0, [prov:type="EmptyCollection"])    // c0 is an empty collection
entity(e1)
entity(e2)
entity(e3)
entity(c1, [prov:type="Collection"])
entity(c2, [prov:type="Collection"])

derivedByInsertionFrom(c1, c0, {("k1", e1), ("k2",e2)})       
derivedByInsertionFrom(c2, c1, {("k3", e3)})
derivedByRemovalFrom(c3, c2, {"k1", "k3"})   
From this set of descriptions, we conclude:
   c0 = {  }
   c1 = { ("k1", e1), ("k2", e2)  }
   c2 = { ("k1", e1), ("k2", e2), ("k3", e3) }
   c3 = { ("k2", e2) }
  

4.5.4 Membership

Membership is the belonging of a key-entity pair to collection.

The insertion and removal relations make insertions and removals explicit as part of the history of a collection. This, however, requires explicit mention of the state of the collection prior to each operation. The membership relation removes this needs, allowing the state of a collection c to be expressed without having to introduce a prior state.

A membership relation, written memberOf(id, c, {(key_1, e_1), ..., (key_n, e_n)}, attrs), contains:
  • id: an optional identifier identifying the relation;
  • after: an identifier (c) for the collection whose members are asserted;
  • key-entity-set: a set of key-entity pairs (key_1, e_1), ..., (key_n, e_n) that are members of the collection;
  • attributes: an optional set (attrs) of attribute-value pairs to further describe the properties of the relation.

The description memberOf(c, {(key_1, e_1), ..., (key_n, e_n)}) states that c is known to include (key_1, e_1), ..., (key_n, e_n)}, without having to introduce a previous state.

entity(c, [prov:type="prov:Collection"  %% xsd:QName])    // c is a collection, with unknown content
activity(a)
wasGeneratedBy(c,a)   // a produced c

entity(e1)
entity(e2)
memberOf(c, {("k1", e1), ("k2", e2)} )  

entity(e3)
entity(c1, [prov:type="prov:Collection"  %% xsd:QName])

derivedByInsertionFrom(c1, c, {("k3", e3)})     
From this set of assertions, we conclude:
   c  contains   ("k1", e1), ("k2", e2) 
   c1 contains   ("k1", e1), ("k2", e2), ("k3", v3) 
  
Note that the state of c1 with these relations is only partially known, because the state of c is unknown.

Further considerations:

4.6 Component 6: Annotations

The sixth component of PROV-DM is concerned with notes and annotations.

As provenance descriptions are exchanged between systems, it may be useful to add extra-information to what they are describing. For instance, a "trust service" may add value-judgements about the trustworthiness of some of the entities or agents involved. Likewise, an interactive visualization component may want to enrich a set of provenance descriptions with information helping reproduce their visual representation. To help with interoperability, PROV-DM introduces a simple annotation mechanism allowing anything that is identifiable to be associated with notes. For this, a type and and a relation are introduced.

The annotation mechanism (with note and annotation) forms a key aspect of the extensibility mechanism of PROV-DM (see extensibility section).

4.6.1 Note

A note, noted note(id, [attr1=val1, ...]) in PROV-N, contains:
  • id: an identifier for a note;
  • attributes: a set of attribute-value pairs ((attr1, val1), ...), whose meaning is application specific.

The following note consists of a set of application-specific attribute-value pairs, intended to help the rendering of what it is associated with, by specifying its color and its position on the screen.

note(ex2:n1,[ex2:color="blue", ex2:screenX=20, ex2:screenY=30])
hasAnnotation(tr:WD-prov-dm-20111215,ex2:n1)

The note is associated with the entity tr:WD-prov-dm-20111215. Relation hasAnnotation is discussed in the next section. The note's identifier and attributes are declared in a separate namespace denoted by prefix ex2.

Alternatively, a reputation service may enrich a provenance record with notes providing reputation ratings about agents. In the following fragment, both agents ex:Simon and ex:Paolo are rated "excellent".

note(ex3:n2,[ex3:reputation="excellent"])
hasAnnotation(ex:Simon,ex3:n2)
hasAnnotation(ex:Paolo,ex3:n2)

The note's identifier and attributes are declared in a separate namespace denoted by prefix ex3.

4.6.2 Annotation

An annotation is a link between something that is identifiable and a note referred to by its identifier.

Multiple notes can be associated with a given identified object; symmetrically, multiple objects can be associated with a given note. Since notes have identifiers, they can also be annotated.

An annotation relation, written hasAnnotation(r,n) in PROV-N, has the following constituents:
  • something: the identifier (r) of something being annotated;
  • note: an identifier (n) of a note.

The following expressions

entity(e1,[prov:type="document"])
entity(e2,[prov:type="document"])
activity(a,t1,t2)
used(u1,a,e1,[ex:file="stdin"])
wasGeneratedBy(e2, a, [ex:file="stdout"])

note(n1,[ex:icon="doc.png"])
hasAnnotation(e1,n1)
hasAnnotation(e2,n1)

note(n2,[ex:style="dotted"])
hasAnnotation(u1,n2)

describe two documents (attribute-value pair: prov:type="document") identified by e1 and e2, and their annotation with a note indicating that the icon (an application specific way of rendering provenance) is doc.png. The example also includes an activity, its usage of the first entity, and its generation of the second entity. The usage is annotated with a style (an application specific way of rendering this edge graphically). To be able to express this annotation, the usage was provided with an identifier u1, which was then referred to in hasAnnotation(u1,n2).

4.7 Further Elements of PROV-DM

This section introduces further elements of PROV-DM.

4.7.1 Namespace Declaration

A PROV-DM namespace is identified by an IRI [IRI]. In PROV-DM, attributes, identifiers, and values with qualified names as data type can be placed in a namespace using the mechanisms described in this specification.

A namespace declaration consists of a binding between a prefix and a namespace. Every qualified name with this prefix in the scope of this declaration refers to this namespace. A default namespace declaration consists of a namespace. Every un-prefixed qualified name in the scope of this default namespace declaration refers to this namespace.

The PROV-DM namespace is http://www.w3.org/ns/prov#.

4.7.2 Qualified Name

A qualified name is a name subject to namespace interpretation. It consists of a namespace, denoted by an optional prefix, and a local name.

PROV-DM stipulates that a qualified name can be mapped into an IRI by concatenating the IRI associated with the prefix and the local part.

A qualified name's prefix is optional. If a prefix occurs in a qualified name, it refers to a namespace declared in a namespace declaration. In the absence of prefix, the qualified name refers to the default namespace.

4.7.3 Identifier

An identifier is a qualified name.

4.7.4 Attribute

An attribute is a qualified name.

The PROV data model introduces a pre-defined set of attributes in the PROV-DM namespace, which we define below. The interpretation of any attribute declared in another namespace is out of scope.

4.7.4.1 prov:label

The attribute prov:label provides a human-readable representation of a PROV-DM element or relation. The value associated with the attribute prov:label must be a string.

The following entity is provided with a label attribute.

 entity(ex:e1, [prov:label="This is a label"])
4.7.4.2 prov:location

A location can be an identifiable geographic place (ISO 19112), but it can also be a non-geographic place such as a directory, row, or column. As such, there are numerous ways in which location can be expressed, such as by a coordinate, address, landmark, and so forth. This document does not specify how to concretely express locations, but instead provide a mechanism to introduce locations, by means of a reserved attribute.

The attribute prov:location is an optional attribute of entity and activity. The value associated with the attribute prov:location must be a PROV-DM Value, expected to denote a location.

The following expression describes entity Mona Lisa, a painting, with a location attribute.

 entity(ex:MonaLisa, [prov:location="Le Louvres, Paris", prov:type="StillImage"])
4.7.4.3 prov:role

The attribute prov:role denotes the function of an entity with respect to an activity, in the context of a usage, generation, association, start, and end. The attribute prov:role is allowed to occur multiple times in a list of attribute-value pairs. The value associated with a prov:role attribute must be a PROV-DM Value.

The following activity start describes the role of the agent identified by ag in this start relation with activity a.

   wasStartedBy(a,ag, [prov:role="program-operator"])
4.7.4.4 prov:type

The attribute prov:type provides further typing information for an element or relation. PROV-DM liberally defines a type as a category of things having common characteristics. PROV-DM is agnostic about the representation of types, and only states that the value associated with a prov:type attribute must be a PROV-DM Value. The attribute prov:type is allowed to occur multiple times.

The following describes an agent of type software agent.

   agent(ag, [prov:type="prov:SoftwareAgent" %% xsd:QName])

The following types are pre-defined in PROV, and are valid values for the prov:type attribute.

  • prov:Plan
  • prov:Account
  • prov:SoftwareAgent
  • prov:Organization
  • prov:Person

4.7.5 Value

By means of attribute-value pairs, the PROV data model can refer to values such as strings, numbers, time, qualified names, and IRIs. The interpretation of such values is outside the scope of PROV-DM.

Each kind of such values is called a datatype. The datatypes are taken from the set of XML Schema Datatypes, version 1.1 [XMLSCHEMA-2] and the RDF specification [RDF-CONCEPTS]. The normative definitions of these datatypes are provided by the respective specifications. Each datatype is identified by its XML xsd:QName.

We note that PROV-DM time instants are defined according to xsd:dateTime [XMLSCHEMA-2].

PROV-DM Data Types
xsd:decimal xsd:double xsd:dateTime
xsd:integer xsd:float
xsd:nonNegativeInteger xsd:string rdf:XMLLiteral
xsd:nonPositiveIntegerxsd:normalizedString
xsd:positiveInteger xsd:token
xsd:negativeInteger xsd:language
xsd:long xsd:Name
xsd:int xsd:NCName
xsd:short xsd:NMTOKEN
xsd:byte xsd:boolean
xsd:unsignedLong xsd:hexBinary
xsd:unsignedInt xsd:base64Binary
xsd:unsignedShortxsd:anyURI
xsd:unsignedByte xsd:QName

The following examples respectively are the string "abc", the string "abc", the integer number 1, and the IRI "http://example.org/foo".

  "abc"
  1
  "http://example.org/foo" %% xsd:anyURI

The following example shows a value of type xsd:QName (see QName [XMLSCHEMA-2]). The prefix ex must be bound to a namespace declared in a namespace declaration.

 
  "ex:value" %% xsd:QName

In the following example, the generation time of entity e1 is expressed according to xsd:dateTime [XMLSCHEMA-2].

 
  wasGeneratedBy(e1,a1, 2001-10-26T21:32:52)
We need to check that we are including all xsd types that are accept in the lastest version of RDF.

5. PROV-DM Extensibility Points

The PROV data model provides several extensibility points that allow designers to specialize it to specific applications or domains. We summarize these extensibility points here:

The PROV data model is designed to be application and technology independent, but specializations of PROV-DM are welcome and encouraged. To ensure interoperability, specializations of the PROV data model that exploit the extensibility points summarized in this section must preserve the semantics specified in the PROV-DM documents (part 1 to 3).

6. Towards a Refinement of the PROV Data Model

A. Acknowledgements

WG membership to be listed here.

B. References

B.1 Normative references

[IRI]
M. Duerst, M. Suignard. Internationalized Resource Identifiers (IRI). January 2005. Internet RFC 3987. URL: http://www.ietf.org/rfc/rfc3987.txt
[RDF-CONCEPTS]
Graham Klyne; Jeremy J. Carroll. Resource Description Framework (RDF): Concepts and Abstract Syntax. 10 February 2004. W3C Recommendation. URL: http://www.w3.org/TR/2004/REC-rdf-concepts-20040210
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Internet RFC 2119. URL: http://www.ietf.org/rfc/rfc2119.txt
[XMLSCHEMA-2]
Paul V. Biron; Ashok Malhotra. XML Schema Part 2: Datatypes Second Edition. 28 October 2004. W3C Recommendation. URL: http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/

B.2 Informative references

[Logic]
W. E. JohnsonLogic: Part III.1924. URL: http://www.ditext.com/johnson/intro-3.html
[PROV-AQ]
Graham Klyne and Paul Groth (eds.) Luc Moreau, Olaf Hartig, Yogesh Simmhan, James Meyers, Timothy Lebo, Khalid Belhajjame, and Simon Miles Provenance Access and Query. 2011, Working Draft. URL: http://www.w3.org/TR/prov-aq/
[PROV-DM-CONSTRAINTS]
Luc Moreau and Paolo Missier (eds.) ... PROV-DM Constraints. 2011, Working Draft. URL: http://www.w3.org/TR/prov-dm-constraints/
[PROV-N]
Luc Moreau and Paolo Missier (eds.) ... PROV-N ..... 2011, Working Draft. URL: http://www.w3.org/TR/prov-n/
[PROV-O]
Satya Sahoo and Deborah McGuinness (eds.) Khalid Belhajjame, James Cheney, Daniel Garijo, Timothy Lebo, Stian Soiland-Reyes, and Stephan Zednik Provenance Formal Model. 2011, Working Draft. URL: http://www.w3.org/TR/prov-o/
[PROV-PRIMER]
Yolanda Gil and Simon Miles (eds.) Khalid Belhajjame, Helena Deus, Daniel Garijo, Graham Klyne, Paolo Missier, Stian Soiland-Reyes, and Stephan Zednik Prov Model Primer. 2011, Working Draft. URL: http://www.w3.org/TR/prov-primer/
[PROV-SEM]
James Cheney Formal Semantics Strawman. 2011, Work in progress. URL: http://www.w3.org/2011/prov/wiki/FormalSemanticsStrawman