Copyright © 2012 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
PROV-DM is a data model for provenance that describes the entities, people and activities involved in producing a piece of data or thing in the world. PROV-DM is domain-agnostic, but is equipped with extensibility points allowing further domain-specific and application-specific extensions to be defined. PROV-DM is accompanied by PROV-N, a technology-independent notation, which allows serializations of PROV-DM instances to be created for human consumption, which facilitates the mapping of PROV-DM to concrete syntax, and which is used as the basis for a formal semantics of PROV-DM.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is released internally by the Provenance Working Group.This document was published by the Provenance Working Group as an Editor's Draft. If you wish to make comments regarding this document, please send them to public-prov-wg@w3.org (subscribe, archives). All feedback is welcome.
Publication as an Editor's Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
For the purpose of this specification, provenance is defined as a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing in the world. In particular, the provenance of information is crucial in deciding whether information is to be trusted, how it should be integrated with other diverse information sources, and how to give credit to its originators when reusing it. In an open and inclusive environment such as the Web, where users find information that is often contradictory or questionable, provenance can help those users to make trust judgements.
The idea that a single way of representing and collecting provenance could be adopted internally by all systems does not seem to be realistic today. Instead, a pragmatic approach is to consider a core data model for provenance that allows domain and application specific representations of provenance to be translated into such a data model and exchanged between systems. Heterogeneous systems can then export their provenance into such a core data model, and applications that need to make sense of provenance in heterogeneous systems can then import it, process it, and reason over it.
Thus, the vision is that different provenance-aware systems natively adopt their own model for representing their provenance, but a core provenance data model can be readily adopted as a provenance interchange model across such systems.
A set of specifications, referred to as the PROV family of specifications, define the various aspects that are necessary to achieve this vision in an interoperable way:
The PROV-DM data model for provenance consists of a set of core concepts, and a few common relations, based on these core concepts. PROV-DM is a domain-agnostic model, but with clear extensibility points allowing further domain-specific and application-specific extensions to be defined.
This specification intentionally presents the key concepts of the PROV Data Model, without drilling down into all its subtleties. Using these key concepts, it becomes possible to write useful provenance assertions very quickly, and publish or embed them along side the data they relate to.
However, if data changes, then it is challenging to express its provenance precisely, like it would be for any other form of metadata. To address this challenge, an upgrade path is proposed to enrich simple provenance, with extra-descriptions that help qualify the specific subject of provenance and provenance itself, with attributes and interval, intended to satisfy a comprehensive set of constraints. These aspects are covered in the companion specification [PROV-DM-CONSTRAINTS].
Section 2 provides an overview of PROV-DM listing its core types and their relations.
In section 3, PROV-DM is applied to a short scenario, encoded in PROV-N, and illustrated graphically.
Section 4 provides the definition of PROV-DM constructs.
Section 5 introduces further relations offered by PROV-DM, including relations for data collections and domain-independent common relations.
Section 6 summarizes PROV-DM extensibility points.
Section 7 introduces constraints that can be applied to the PROV data model and that are covered in [PROV-DM-CONSTRAINTS].
The PROV-DM namespace is http://www.w3.org/ns/prov-dm/ (TBC).
All the elements, relations, reserved names and attributes introduced in this specification belong to the PROV-DM namespace.
The key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" in this document are to be interpreted as described in [RFC2119].
This section provides an overview of the main concepts found in the PROV data model.
PROV-DM is a data model for describing the provenance of Entities, that is, of things in the world. The term "Things" encompasses a broad diversity of concepts, including digital objects such as a file or web page, physical things such as a building or a printed book, or a car as well as abstract concepts and ideas. One can regard any Web resource as an example of Entity in this context.
An entity may be the document at URI http://www.w3.org/TR/prov-dm/, a file in a file system, a car or an idea.
An activity is anything that acts upon or with entities. This action can take multiple forms: consuming, processing, transforming, modifying, relocating, using, generating, or being associated with entities. Activities that operate on digital entities may for example move, copy, or duplicate them.
An activity may be the publishing of a document on the web, sending a twitter message, extracting metadata embedded in a file, or driving a car from Boston to Cambridge, assembling a data set based on a set of measurements, performing a statistical analysis over a data set, sorting news items according to some criteria, running a SPARQL query over a triple store, and editing a file.
An agent is a type of entity that bears some form of responsibility for an activity taking place.
The motivation for introducing agents in the model is to denote the agent's responsibility for activities. The definition of agent intentionally stays away from using concepts such as enabling, causing, initiating, affecting, etc, because many entities also enable, cause, initiate, and affect in some way the activities. Concepts such as initiating are themselves defined as relations between agent and activities. So the notion of having some degree of responsibility is really what makes an agent.
An agent is a particular type of Entity. This means that the model can be used to express provenance of the agents themselves.
Software for checking the use of grammar in a document may be defined as an agent of a document preparation activity, and at the same time one can describe its provenance, including for instance the vendor and the version history.
Activities and entities are associated with each other in two different ways: activities are consumers of entities and activities are producers of entities. The act of producing or consuming an entity may have a duration. The term 'generation' refers to the completion of the the act of producing; likewise, the term 'usage' refers to the beginning of the act of consuming entities. Thus, we define the following notions of generation and usage.
Activities are consumers of entities and producers of entities. In some case, the consumption of an entity influences the creation of another in some way. This notion is captured by derivations, defined as follows.
A derivation is a transformation of an entity into another, a construction of an entity into another, or an update of an entity, resulting in a new one.
Examples of derivation include the transformation of a relational table into a linked data set, the transformation of a canvas into a painting, the transportation of a work of art from London to New York, and a physical transformation such as the melting of ice into water.
There are some useful types of entities and agents that are commonly encountered in applications making data and documents available on the Web; we introduce them in this section.
A plan is an entity that represents a set of actions or steps intended by one or more agents to achieve some goals. PROV-DM is not prescriptive about the nature of plans, their representation, the actions or steps they consist of, or their intended goals. Since plans may evolve over time, it may become necessary to track their provenance, so plans themselves are entities. Representing the plan explicitly in the provenance can be useful for various tasks: for example, to validate the execution as represented in the provenance record, to manage expectation failures, or to provide explanations.
A plan can be a blog post tutorial for how to set up a web server, a list of instructions for a micro-processor execution, a cook's written recipe for a chocolate cake, or a workflow for a scientific experiment.
A collection is an entity that provides structure to some constituents, which are themselves entities. This concept allows for the provenance of the collection, but also of its constituents to be expressed. Such a notion of collection corresponds to a wide variety of concrete data structures, such as a maps, dictionaries or associative arrays.
An example of collection is an archive of documents. Each document has its own provenance, but the archive itself also has some provenance: who maintained it, which documents it contained at which point in time, how it was assembled, etc.
An accountEntity is an entity that contains a bundle of provenance assertions.
Having found a resource, a user may want to retrieve its provenance. For users to decide whether they can place their trust in that resource, they may want to analyze its provenance, but also determine who the provenance is attributed to, and when it was generated. Hence, from the PROV-DM data model, the provenance is regarded as an entity, an AccountEntity, for which provenance can be sought.
Three types of agents are recognized by PROV-DM because they are commonly encountered in applications making data and documents available on the Web: persons, software agents, and organizations.
Even software agents can be assigned some responsibility for the effects they have in the world, so for example if one is using a Text Editor and one's laptop crashes, then one would say that the Text Editor was responsible for crashing the laptop. If one invokes a service to buy a book, that service can be considered responsible for drawing funds from one's bank to make the purchase (the company that runs the service and the web site would also be responsible, but the point here is that we assign some measure of responsibility to software as well).
So when someone models software as an agent for an activity in the PROV-DM model, they mean the agent has some responsibility for that activity.
Agents are defined as having some kind of responsibility for activities. However, one may want to be more specific about the nature of an agent's responsibility. For example, a programmer and a researcher could both be associated with running a workflow, but it may not matter which programmer clicked the button to start the workflow while it would matter a lot which researcher told the programmer to do so. So there is some notion of responsibility that needs to be captured.
Provenance reflects activities that have occurred. In some cases, those activities reflect the execution of a plan that was designed in advance to guide the execution. PROV-DM allows associating a plan to an activity, which represents what was intended to happen.
An activity association is an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity. It further allows for a plan to be specified, which is the plan intended by the agent to achieve some goals in the context of this activity.
Examples of association between an activity and agent are:
For an agent, responsibility is the fact of being accountable for the actions of a "subordinate" agent, in the context of an activity. The nature of this relation is intended to be broad, including delegation or a contractual relation.
A student publishing a web page describing an academic department could result in both the student and the department being agents associated with the activity, and it may not matter which student published a web page but it matters a lot that the department told the student to put up the web page.
The following diagram summarizes the elements and relations just described
add a sentence saying that it is not complete coverage of the dm in diagram.
The text should say that we introduce a few relations based on the concepts introduced in section 2.1-2.4, that these relations are used in the example of section 3, and are fully defined in section 4-5.
The note should also say why relations are in past tense (we had something in previous version of prov-dm)
I have the impression that the diagram presented in Section 2.5 would > be more useful if placed at the beginning of Section 2 [KB]
There is some comments that the picture does not print well. We need to check.
Add links in the svg so that we can click on the figure.
The World Wide Web Consortium publishes many technical reports. In this example, we consider a technical report, and describe its provenance.
Specifically, we consider the second version of the PROV-DM document http://www.w3.org/TR/2011/WD-prov-dm-20111215. Its provenance can be expressed from several perspectives, which we present. In the first one, provenance is concerned with the W3C process, whereas in the second one, it takes the authors' viewpoint.
Description: The World Wide Web Consortium publishes technical reports according to its publication policy. Working drafts are published regularly to reflect the work accomplished by working groups. Every publication of a working draft must be preceded by a "publication request" to the Webmaster. The very first version of a technical report must also preceded by a "transition request" to be approved by the W3C director. All working drafts are made available at a unique URI. In this scenario, we consider two successive versions of a given report, the policy according they were published, and the associated requests.
Concretely, in this section, we describe the kind of provenance record that the WWW Consortium could keep for auditors to check that due processes are followed. All entities involved in this example are Web resources, with well defined URIs (some of which locating archived email messages, available to W3C Members).
We now paraphrase some PROV-DM descriptions, and illustrate them with the PROV-N notation, a notation for PROV-DM aimed at human consumption. We then follow them with a graphical illustration. Full details of the provenance record can be found here.
entity(tr:WD-prov-dm-20111215, [ prov:type="pr:RecsWD" %% xsd:QName ])
activity(ex:act2,,,[prov:type="publish"])
wasGeneratedBy(tr:WD-prov-dm-20111215, ex:act2)
wasDerivedFrom(tr:WD-prov-dm-20111215, tr:WD-prov-dm-20111018)
used(ex:act2,ar3:0111)
wasAssociatedWith(ex:act2, w3:Consortium @ pr:rec-advance)
Provenance descriptions can be illustrated graphically. The illustration is not intended to represent all the details of the model, but it is intended to show the essence of a set of provenance statements. Therefore, it should not be seen as an alternate notation for expressing provenance.
The graphical illustration takes the form of a graph. Entities, activities and agents are represented as nodes, with oval, rectangular, and octagonal shapes, respectively. Usage, Generation, Derivation, and Activity Association are represented as directed edges.
Entities are laid out according to the ordering of their generation event. We endeavor to show time progressing from top to bottom. This means that edges for Usage, Generation and Derivation typically point upwards.
CG: It would be helpful to see the properties labelled in the figure.
This simple example has shown a variety of PROV-DM constructs, such as Entity, Agent, Activity, Usage, Generation, Derivation, and ActivityAssociation. In this example, it happens that all entities were already Web resources, with readily available URIs, which we used. We note that some of the resources are public, whereas others have restricted access: provenance statements only make use of their identifiers. If identifiers do not pre-exist, e.g. for activities, then they can be generated, for instance ex:act2, occurring in the namespace identified by prefix ex. We note that the URI scheme developed by W3C is particularly suited for expressing provenance of these reports, since each URI denotes a specific version of a report. It then becomes very easy to relate the various versions, with PROV-DM constructs.
Description: A technical report is edited by some editor, using contributions from various contributors.
Here, we consider another perspective on technical report http://www.w3.org/TR/2011/WD-prov-dm-20111215. Provenance is concerned with the document editing activity, as perceived by authors. This kind of information could be used by authors in their CV or in a narrative about this document.
Again, we paraphrase some PROV-DM assertions, and illustrate them with the PROV-N notation. Full details of the provenance record can be found here.
entity(tr:WD-prov-dm-20111215, [ prov:type="document", ex:version="2" ])
While this description is about the same report tr:WD-prov-dm-20111215, its details differ from the author's perspective: it is a document and it has a version number.
activity(ex:edit1,,,[prov:type="edit"])
wasGeneratedBy(tr:WD-prov-dm-20111215, ex:edit1)
agent(ex:Paolo, [ prov:type="Person" ]) agent(ex:Simon, [ prov:type="Person" ])
wasAssociatedWith(ex:edit1, ex:Paolo, [prov:role="editor"]) wasAssociatedWith(ex:edit1, ex:Simon, [prov:role="contributor"])
CG: It would be helpful to see the properties labelled in the figure.
simplify the figure (leave just 2 authors (as in the example), or the editors), and label the edges as well.
The two previous sections provide two different perspectives on the provenance of a technical report. By design, the PROV approach allows for the provenance of a subject to be provided by multiple sources. For users to decide whether they can place their trust in the technical report, they may want to analyze its provenance, but also determine who the provenance is attributed to, and when it was generated, etc. In other words, we need to be able to express the provenance of provenance.
No new mechanism is required to support this requirement. PROV-DM makes the assumption that provenance statements have been bundled up, and named, by some mechanism outside the scope of PROV-DM. For instance, in this case, provenance statements were put in a file and exposed on the Web, respectively at ex:prov1 and ex:prov3. To express their respective provenance, these resources must be seen as entities, and all the constructs of PROV-DM are now available to characterize their provenance. In the example below, ex:prov1 is attributed to the agent w3:Consortium, whereas ex:prov3 to ex:Simon.
entity(ex:prov1, [prov:type="prov:AccountEntity" %% xsd:QName ]) wasAttributedTo(ex1:prov1,w3:Consortium) entity(ex:prov3, [prov:type="prov:AccountEntity" %% xsd:QName ]) wasAttributedTo(ex1:prov3,ex:Simon)
In this section, we revisit each concept introduced in Section 2, and provide its detailed definition in the PROV data model, in terms of its various constituents.
In PROV-DM, we distinguish elements from relations, which are respectively discussed in Section 4.1 and Section 4.2.
The following expression
entity(tr:WD-prov-dm-20111215, [ prov:type="document", ex:version="2" ])states the existence of an entity, denoted by identifier tr:WD-prov-dm-20111215, with type document and version number 2. The attributes ex:version is application specific, whereas the attribute type is reserved in the PROV-DM namespace.
Further considerations:
The following expression
activity(a1,2011-11-16T16:05:00,2011-11-16T16:06:00, [ex:host="server.example.org",prov:type="ex:edit" %% xsd:QName])
states the existence of an activity with identifier a1, start time 2011-11-16T16:05:00, and end time 2011-11-16T16:06:00, running on host server.example.org, and of type edit. The attribute host is application specific (declared in some namespace with prefix ex). The attribute type is a reserved attribute of PROV-DM, allowing for sub-typing to be expressed.
Further considerations:
From an interoperability perspective, it is useful to define some basic categories of agents since it will improve the use of provenance by applications. There should be very few of these basic categories to keep the model simple and accessible. There are three types of agents in the model since they are common across most anticipated domains of use:
These types are mutually exclusive, though they do not cover all kinds of agent.
The following expression is about an agent identified by e1, which is a person, named Alice, with employee number 1234.
agent(e1, [ex:employee="1234", ex:name="Alice", prov:type="prov:Person" %% xsd:QName])
It is optional to specify the type of an agent. When present, it is expressed using the prov:type attribute.
As provenance descriptions are exchanged between systems, it may be useful to add extra-information to what they are describing. For instance, a "trust service" may add value-judgements about the trustworthiness of some of the entities or agents involved. Likewise, an interactive visualization component may want to enrich a set of provenance descriptions with information helping reproduce their visual representation. To help with interoperability, PROV-DM introduces a simple annotation mechanism allowing anything that is identifiable to be associated with notes.
A separate PROV-DM relation is used to associate a note with something that is identifiable (see Section on annotation). A given note may be associated with multiple identifiable things.
The following note consists of a set of application-specific attribute-value pairs, intended to help the rendering of what it is associated with, by specifying its color and its position on the screen.
note(ex2:n1,[ex2:color="blue", ex2:screenX=20, ex2:screenY=30]) hasAnnotation(tr:WD-prov-dm-20111215,ex2:n1)
The note is associated with the entity tr:WD-prov-dm-20111215 previously introduced (hasAnnotation is discussed in Section Annotation). The note's identifier and attributes are declared in a separate namespace denoted by prefix ex2.
Alternatively, a reputation service may enrich a provenance record with notes providing reputation ratings about agents. In the following fragment, both agents ex:Simon and ex:Paolo are rated "excellent".
note(ex3:n2,[ex3:reputation="excellent"]) hasAnnotation(ex:Simon,ex3:n2) hasAnnotation(ex:Paolo,ex3:n2)
The note's identifier and attributes are declares in a separate namespace denoted by prefix ex3.
This section describes all the PROV-DM relations between the elements introduced in Section Element. While these relations are not binary, they all involve two primary elements. They can be summarized as follows.
Entity | Activity | Agent | Note | |
Entity | wasDerivedFrom alternateOf specializationOf | wasGeneratedBy | — | hasAnnotation |
Activity | used | — | wasStartedBy wasEndedBy wasAssociatedWith | hasAnnotation |
Agent | — | — | actedOnBehalfOf | hasAnnotation |
Note | — | — | — | hasAnnotation |
While each of the components activity, time, and attributes is optional, at least one of them must be present.
The following expressions
wasGeneratedBy(e1,a1, 2001-10-26T21:32:52, [ex:port="p1", ex:order=1]) wasGeneratedBy(e2,a1, 2001-10-26T10:00:00, [ex:port="p1", ex:order=2])
state the existence of two generations (with respective times 2001-10-26T21:32:52 and 2001-10-26T10:00:00), at which new entities, identified by e1 and e2, are created by an activity, identified by a1. The first one is available as the first value on port p1, whereas the other is the second value on port p1. The semantics of port and order are application specific.
In some cases, we may want to record the time at which an entity was generated without having to specify the activity that generated it. To support this requirement, the activity component in generation is optional. Hence, the following expression indicates the time at which an entity is generated, without naming the activity that did it.
wasGeneratedBy(e,,2001-10-26T21:32:52)
A reference to a given entity may appear in multiple usages that share a given activity identifier.
The following usages
used(a1,e1,2011-11-16T16:00:00,[ex:parameter="p1"]) used(a1,e2,2011-11-16T16:00:01,[ex:parameter="p2"])
state that the activity identified by a1 consumed two entities identified by e1 and e2, at times 2011-11-16T16:00:00 and 2011-11-16T16:00:01, respectively; the first one was found as the value of parameter p1, whereas the second was found as value of parameter p2. The semantics of parameter is application specific.
A usage record's id is optional. It must be present when annotating usage records (see Section Annotation Record) or when defining precise-1 derivations (see Derivation).
As far as responsibility is concerned, PROV-DM offers two kinds of constructs. The first, introduced in this section, is a relation between an agent, a plan, and an activity; the second, introduced in Section Responsibility, is a relation between agents expressing that an agent was acting on behalf of another, in the context of an activity.
activity(ex:a,[prov:type="workflow execution"]) agent(ex:ag1,[prov:type="operator"]) agent(ex:ag2,[prov:type="designer"]) wasAssociatedWith(ex:a,ex:ag1,[prov:role="loggedInUser", ex:how="webapp"]) wasAssociatedWith(ex:a,ex:ag2,ex:wf,[prov:role="designer", ex:context="project1"]) entity(ex:wf,[prov:type="prov:Plan"%% xsd:QName, ex:label="Workflow 1", ex:url="http://example.org/workflow1.bpel" %% xsd:anyURI])Since the workflow ex:wf is itself an entity, its provenance can also be expressed in PROV-DM: it can be generated by some activity and derived from other entities, for instance.
A activity start is a representation of an agent starting an activity. An activity end is a representation of an agent ending an activity. Both relations are specialized forms of wasAssociatedWith. They contain attributes describing the modalities of acting/ending activities.
An activity start, written wasStartedBy(id,a,ag,attrs) in PROV-N, contains:
An activity end, written wasEndedBy(id,a,ag,attrs) in PROV-N, contains:
In the following example,
wasStartedBy(a,ag,[ex:mode="manual"]) wasEndedby(a,ag,[ex:mode="manual"])
there is an activity denoted by a that was started and ended by an agent denoted by ag, in "manual" mode, an application specific characterization of these relations.
PROV-DM offers a mild version of responsibility in the form of a relation to represent when an agent acted on another agent's behalf. So in the example of someone running a mail program, the program is an agent of that activity and the person is also an agent of the activity, but we would also add that the mail software agent is running on the person's behalf. In the other example, the student acted on behalf of his supervisor, who acted on behalf of the department chair, who acts on behalf of the university, and all those agents are responsible in some way for the activity to take place but we do not say explicitly who bears responsibility and to what degree.
We could also say that an agent can act on behalf of several other agents (a group of agents). This would also make possible to indirectly reflect chains of responsibility. This also indirectly reflects control without requiring that control is explicitly indicated. In some contexts there will be a need to represent responsibility explicitly, for example to indicate legal responsibility, and that could be added as an extension to this core model. Similarly with control, since in particular contexts there might be a need to define specific aspects of control that various agents exert over a given activity.
activity(a,[prov:type="workflow"]) agent(ag1,[prov:type="programmer"]) agent(ag2,[prov:type="researcher"]) agent(ag3,[prov:type="funder"]) wasAssociatedWith(a,ag1,[prov:role="loggedInUser"]) wasAssociatedWith(a,ag2) actedOnBehalfOf(ag1,ag2,a,[prov:type="delegation"]) actedOnBehalfOf(ag2,ag3,a,[prov:type="contract"])
Further considerations:
According to Section Overview, for an entity to be transformed from, created from, or resulting from an update to another, there must be some underpinning activities performing the necessary actions resulting in such a derivation. A derivation can be described at various levels of precision. In its simplest form, derivation relates two entities. Optionally, attributes can be added to describe modalities of derivation. If the derivation is the result of a single known activity, then this activity can also be optionally expressed. And to provide a completely accurate description of derivation, the generation and usage of the generated and used entities, respectively, can be provided. The reason for optional information such as activity, generation, and usage to be linked to derivations is to aid analysis of provenance and to facilitate provenance-based reproducibility.
Derivation is not defined to be transitive. Domain-specific specializations of derivation may be defined in such a way that the transitivity property holds.
The following descriptions state the existence of derivations.
wasDerivedFrom(e2,e1) wasDerivedFrom(e2,e1,[prov:type="physical transform"]) wasDerivedFrom(e2,e1,a,g2,u1) wasGeneratedBy(g2,e2,a) used(u1,a,e1)
The first and second lines are about derivations between e2 and e1, but no information is provided as to the identity of the activity (and usage and generation) underpinning the derivation. In the second line, a type attribute is also provided.
The third description expresses that activity a, using the entity e1 according to usage u1, derived the entity e2 and generated it according to generation g2. It is followed by descriptions for generation g2 and usage u1. With such a comprehensive description of derivation, a program that analyzes provenance can identify the activity underpinning the derivation, it can identify how the original entity e1 was used by the activity (e.g. for instance, which argument it was passed as, if the activity is the result of a function invocation), and which output the derived entity e2 was obtained from (say, for a function returning multiple results).
The purpose of this section is to introduce relations between two entities that refer to the same thing in the world. Consider for example three entities:
These entities refer to the same real person Bob, either in different contexts, or at different levels of abstraction. Specifically:
The following two relations are introduced for expressing alternative or specialized entities.
The following expressions describe two persons, respectively holder of a Facebook account and a Twitter account, and their relation as alternate.
entity(facebook:ABC, [ prov:type="person with Facebook account " ]) entity(twitter:XYZ, [ prov:type="person with Twitter account" ]) alternateOf(facebook:ABC, twitter:XYZ)
The following expressions describe two persons, the second of which is holder of a Twitter account. The second entity is a specialization of the first.
entity(ex:Bob, [ prov:type="person", ex:name="Bob" ]) entity(twitter:XYZ, [ prov:type="person with Twitter account" ]) specializationOf(twitter:XYZ, ex:Bob)
Multiple notes can be associated with a given identified object; symmetrically, multiple objects can be associated with a given note. Since notes have identifiers, they can also be annotated. The annotation mechanism (with note and annotation) forms a key aspect of the extensibility mechanism of PROV-DM (see extensibility section).
The following expressions
entity(e1,[prov:type="document"]) entity(e2,[prov:type="document"]) activity(a,t1,t2) used(u1,a,e1,[ex:file="stdin"]) wasGeneratedBy(e2, a, [ex:file="stdout"]) note(n1,[ex:icon="doc.png"]) hasAnnotation(e1,n1) hasAnnotation(e2,n1) note(n2,[ex:style="dotted"]) hasAnnotation(u1,n2)
describe two documents (attribute-value pair: prov:type="document") identified by e1 and e2, and their annotation with a note indicating that the icon (an application specific way of rendering provenance) is doc.png. The example also includes an activity, its usage of the first entity, and its generation of the second entity. The usage is annotated with a style (an application specific way of rendering this edge graphically). To be able to express this annotation, the usage was provided with an identifier u1, which was then referred to in hasAnnotation(u1,n2).
A PROV-DM namespace is identified by an IRI reference [IRI]. In PROV-DM, attributes, identifiers, and literals with qualified names as data type can be placed in a namespace using the mechanisms described in this specification.
A namespace declaration consists of a binding between a prefix and a namespace. Every qualified name with this prefix in the scope of this declaration refers to this namespace. A default namespace declaration consists of a namespace. Every un-prefixed qualified name in the scope of this default namespace declaration refers to this namespace.
The PROV-DM namespace is http://www.w3.org/ns/prov-dm/ (TBC).
An identifier is a qualified name.
A qualified name is a name subject to namespace interpretation. It consists of a namespace, denoted by an optional prefix, and a local name.
PROV-DM stipulates that a qualified name can be mapped into an IRI by concatenating the IRI associated with the prefix and the local part.
A qualified name's prefix is optional. If a prefix occurs in a qualified name, it refers to a namespace declared in a namespace declaration. In the absence of prefix, the qualified name refers to the default namespace.
An attribute is a qualified name.
The PROV data model introduces a pre-defined set of attributes in the PROV-DM namespace, which we define below. The interpretation of any attribute declared in another namespace is out of scope.
The attribute prov:role denotes the function of an entity with respect to an activity, in the context of a usage, generation, activity association, activity start, and activity end. The attribute prov:role is allowed to occur multiple times in a list of attribute-value pairs. The value associated with a prov:role attribute must be a PROV-DM Literal.
The following activity start describes the role of the agent identified by ag in this start relation with activity a.
wasStartedBy(a,ag, [prov:role="program-operator"])
The attribute prov:type provides further typing information for an element or relation. PROV-DM liberally defines a type as a category of things having common characteristics. PROV-DM is agnostic about the representation of types, and only states that the value associated with a prov:type attribute must be a PROV-DM Literal. The attribute prov:type is allowed to occur multiple times.
The following describes an agent of type software agent.
agent(ag, [prov:type="prov:SoftwareAgent" %% xsd:QName])
The attribute prov:label provides a human-readable representation of a PROV-DM element or relation. The value associated with the attribute prov:label must be a string.
A location can be an identifiable geographic place (ISO 19112), but it can also be a non-geographic place such as a directory, row, or column. As such, there are numerous ways in which location can be expressed, such as by a coordinate, address, landmark, and so forth. This document does not specify how to concretely express locations, but instead provide a mechanism to introduce locations, by means of attributes.
The attribute prov:location is an optional attribute of entity and activity. The value associated with the attribute prov:location must be a PROV-DM Literal, expected to denote a location.
The following expression describes entity Mona Lisa, a painting, with a location attribute.
entity(ex:MonaLisa, [prov:location="Le Louvres, Paris", prov:type="StillImage"])
A PROV-DM Literal represents a data value such as a particular string or number. A PROV-DM Literal represents a value whose interpretation is outside the scope of PROV-DM.
The following examples respectively are the string "abc", the string "abc", the integer number 1, and the IRI "http://example.org/foo".
"abc" 1 "http://example.org/foo" %% xsd:anyURI
The following example shows a literal of type xsd:QName (see QName [XMLSCHEMA-2]). The prefix ex must be bound to a namespace declared in a namespace declaration.
"ex:value" %% xsd:QName
Time instants are defined according to xsd:dateTime [XMLSCHEMA-2].
Time is optional in usage, generation, and activity
The following figure summarizes the additional relations described in this section.
A revision is the result of revising an entity into a revised version. Deciding whether something is made available as a revision of something else usually involves an agent who takes responsibility for approving that the former is a due variant of the latter. The agent who is responsible for the revision may optionally be specified. Revision is a particular case of derivation of an entity into its revised version.
A revision relation, written wasRevisionOf(id,e2,e1,ag,attrs) in PROV-N, contains:
Revisiting the example of Section 3.1, we can now state that the report tr:WD-prov-dm-20111215 is a revision of the report tr:WD-prov-dm-20111018, approved by agent w3:Consortium.
entity(tr:WD-prov-dm-20111215, [ prov:type="pr:RecsWD" %% xsd:QName ]) entity(tr:WD-prov-dm-20111018, [ prov:type="pr:RecsWD" %% xsd:QName ]) wasRevisionOf(tr:WD-prov-dm-20111215, tr:WD-prov-dm-20111018, w3:Consortium)
Attribution is the ascribing of an entity to an agent. More precisely, when an entity e is attributed to agent ag, entity e was generated by some activity a, which in turn was associated to agent ag. Thus, this relation is useful when the activity is not known, or irrelevant.
An attribution relation, written wasAttributedTo(id,e,ag,attr) in PROV-N, contains the following elements:
Revisiting the example of Section 3.2, we can ascribe tr:WD-prov-dm-20111215 to some agents without having to make an activity explicit.
agent(ex:Paolo, [ prov:type="Person" ]) agent(ex:Simon, [ prov:type="Person" ]) entity(tr:WD-prov-dm-20111215, [ prov:type="pr:RecsWD" %% xsd:QName ]) wasAttributedTo(tr:WD-prov-dm-20111215, ex:Paolo, [prov:role="editor"]) wasAttributedTo(tr:WD-prov-dm-20111215, ex:Simon, [prov:role="contributor"])
The following relations express dependencies amongst activities.
An information flow ordering relation, written as wasInformedBy(id,a2,a1,attrs) in PROV-N, contains:
Relation wasInformedBy is not transitive.
Consider two long running services, which we represent by activities s1 and s2.
activity(s1,,,[prov:type="service"]) activity(s2,,,[prov:type="service"]) wasInformedBy(s2,s1)The last line indicates that some entity was generated by s1 and used by s2.
A control ordering relation, written as wasStartedBy(id, a2, a1, attrs) in PROV-N, contains:
Suppose activities a1 and a2 are computer processes that are executed on different hosts, and that a1 started a2. This can be expressed as in the following fragment:
activity(a1,t1,t2,[ex:host="server1.example.org",prov:type="workflow"]) activity(a2,t3,t4,[ex:host="server2.example.org",prov:type="subworkflow"]) wasStartedBy(a2,a1)
A traceability relation between two entities e2 and e1 is a generic dependency of e2 on e1 that indicates either that e1 was necessary for e2 to be created, or that e1 bears some responsibility for e2's existence.
A traceability relation, written tracedTo(id,e2,e1,attrs) in PROV-N, contains:
We note that the ancestor is allowed to be an agent since agents are entities.
We refer to the example of Section 3.1, and specifically to Figure prov-tech-report. We can see that there is a path from tr:WD-prov-dm-20111215 to w3:Consortium or to pr:rec-advance. This is expressed as follows.
tracedTo(tr:WD-prov-dm-20111215,w3:Consortium) tracedTo(tr:WD-prov-dm-20111215,pr:rec-advance)
Derivation and association are particular cases of traceability.
A quotation is the repeat of an entity (such as text or image) by someone other that its original author. Quotation is a particular case of derivation in which entity e2 is derived from entity e1 by copying, or "quoting", parts of it.
A quotation relation, written wasQuotedFrom(id,e2,e1,ag2,ag1,attrs) in PROV-N, contains:
An original source relation is a particular case of derivation that states that an entity e2 (derived) was originally part of some other entity e1 (the original source).
An original source relation, written hadOriginalSource(id,e2,e1,attrs), contains:
Collection relations address the need to describe the evolution of entities that have a collection structure, that is, which may contain other entities. Specifically, this section exploits the built-in type for entities, called collection, and two relations to describe the effect of adding elements to, and removing elements from, a collection entity. The intent of these relations and entity types is to capture the history of changes that occurred to a collection.
A collection is an entity that has a logical internal structure consisting of key-value pairs, often referred to as a map. More precisely, the following entity types are introduced:
entity(c, [prov:type="EmptyCollection"]) // e is an empty collection entity(v1) entity(v2) entity(c1, [prov:type="Collection"]) entity(c2, [prov:type="Collection"]) CollectionAfterInsertion(c1, c, "k1", v1) // c1 = { ("k1",v1) } CollectionAfterInsertion(c2, c1, "k2", v2) // c2 = { ("k1",v1), ("k2", v2) } CollectionAfterRemoval(c3, c2, k1) // c3 = { ("k2",v2) }
A relation CollectionAfterInsertion, written CollectionAfterInsertion(collAfter, collBefore, key, value), contains:
A relation CollectionAfterDeletion, written CollectionAfterDeletion(collAfter, collBefore, key), contains:
Further considerations:
The PROV data model provides several extensibility points that allow designers to specialize it to specific applications or domains. We summarize these extensibility points here:
The PROV-DM namespace declares a set of reserved attributes catering for extensibility: type, location.
To this end, the PROV-DM namespace declares a reserved attribute: role.
The PROV data model is designed to be application and technology independent, but specializations of PROV-DM are welcome and encouraged. To ensure interoperability, specializations of the PROV data model that exploit the extensibility points summarized in this section must preserve the semantics specified in the PROV-DM documents (part 1 to 3).
The example of section 3 contains identifiers such as tr:WD-prov-dm-20111215, which denotes a specific version of a technical report. On the other hand, a URI such as http://www.w3.org/TR/prov-dm/ points to the latest version of a document. One needs to ensure that provenance descriptions for the latter document remain valid as denoted resources change.
To this end, PROV-DM allows asserters to describe "partial states" of entities by means of attributes and associated values. Some further constraints apply to the use of these attributes, since the values associated with them are expected to remain unchanged for some period of time. The constraints associated to attributes are also specified in the companion specification [PROV-DM-CONSTRAINTS].
Even though a mechanism for blundling up provenance descriptions and naming them is not part of PROV-DM, the idea of a bundle of descriptions is crucial to the PROV approach. Indeed, it allows multiple provenance perspectives to be provided for a given entity. It is also the mechanism by which provenance of provenance can be expressed. Such a named bundle is being referred to as an account and is regarded as an AccountEntity so that its provenance can be expressed. The notion of account is specified in the companion specification [PROV-DM-CONSTRAINTS], as well as constraint that structurally well-formed descriptions are expected to satisfy.