This document aims to provide an intuitive guide to the Prov Data Model, with worked examples.

This is a document for internal discussion, which will ultimately evolve in the first Public Working Draft of the Primer.

Introduction

The Prov Data Model (Prov-DM) is used to describe the provenance of things, i.e. how something came to be, from what sources, its history, etc. As such, Prov-DM data consists of assertions about the past. These assertions are not assessments, e.g. as to something's authenticity, but the plain facts from which such assessments might be derived.

This guide aims to ease the adoption of the standard by providing:

Provenance

Provenance has many meanings depending on what one is interested with regards to the object or resource in question. Different people may have different perspectives, focusing on different types of information that might be captured in a provenance record.

One perspective might focus on entity-centered provenance, that is, what entities were involved in generating or manipulating the information in question. Examples of entities include author, editor, publisher, curator, etc.

A second perspective might be one to focus on document-centered provenance, by tracing the origins of portions of a document to other documents. An example is referring to other news sources, quoting statistics from reports by some government or non-government agencies, etc.

A third perspective one might take is on process-centered provenance, capturing the actions and steps taken to generate the information in question. (e.g., a data transformation, an edit, etc.). An example is the records of execution of processes as workflows of web services.

Provenance as data

Explains the contexts in which the reader may see or create Prov-DM data.

Intuitive overview of Prov-DM

This section provides an intuitive explanation of the concepts in Prov-DM. As with the rest of this document, it should be treated as a starting point for understanding the model, and not normative in itself. The model specification provides the precise definitions and constraints to be followed in using Prov-DM.

Entities

An intuitive overview of how to think about entities and their characterising attributes in Prov-DM.

Process Executions

An intuitive overview of how to think about provenance executions in Prov-DM.

Used and WasGeneratedBy

An intuitive overview of how to think about use and generation events in Prov-DM.

Agents

An intuitive overview of how to think about agents in Prov-DM.

Accounts

An intuitive overview of how to think about accounts in Prov-DM.

Roles

An intuitive description of how the roles of entities in processes are expressed.

Revision

An intuitive overview of how to think about revision relations in Prov-DM.

Complementarity

Several asserted entities can be characterizing the same thing, in particular when entities are asserted by different accounts or over different time periods. If two such entities have overlapping lifespans, and the first entity have some attributes that have not been asserted (and not neccessarily always true) for the second entity, then the first entity is said to be complementing the second entity, that is the first entity helps form a more detailed description of the second entity, at least for the duration of the overlapping lifespan.

In addition, if :A prov:wasComplementOf :B, then of all the attributes of the entity :A which can be mapped to compatible attributes of :B MUST be matching for the contiuous duration of the overlap of :A and :B's lifespans. It is out of scope for PROV to specify or assert the nature of the compatibility mapping and matching, the exact interpretation of these is left to the asserter of wasComplementOf

If :B also have some attributes which are not asserted (or not always true) about :A, then this MAY be asserted using the inverse relation :B prov:wasComplementOf :A. If two entities both complement each other in this manner, both MUST have some attributes the other does not have, although those attributes MAY not have been asserted in the provenance. Note that the lack of such an inverse assertion does not neccessarily mean that :B did not have any additional attributes for :A in the timespan, only that this has not been asserted.

In the simplest case, both entites are described using the same attributes, in which case matching means the values SHOULD literally be the same (matching by identity). On the other hand an attribute like ex1:speed_in_mph can be mapped to a compatible ex2:speed_in_kmh attribute. Not all attributes might be mappable in both directions, for instance ex1:city to ex2:country, but not vice versa.

Note that it is out of scope for PROV to assert or explain any mapping of compatible attributes. This is merely a conclusion that can be drawn from the assertion that the two entities both described the same thing in the overlapping time spans. Also note that asserting a complementary relationship does not detail how the two entity timespans overlap, this could be anything from complete one-to-one match (where all attributes are always true for both entities) to merely touching overlaps.

Derivation

An intuitive overview of how to think about the different kinds of derivation relation in Prov-DM.

Worked Examples

In the following sections, we show how Prov-DM can be used to model provenance in specific examples.

We include examples of how the formal ontology can be used to represent the Prov-DM assertions as RDF triples. These are shown using the Turtle notation. In the latter depictions, the namespace prefix prov denotes terms from the Prov ontology, while ex1, ex2, etc. denote terms specific to the example.

We also provide a representation of the examples in the Abstract Syntax Model used in the conceptual model document. The full ASM data is included in the appendix.

Entities

An online newspaper publishes an article making using of data (GovData) provided through a government portal, in England. The article includes a chart based on GovData. A blogger, Betty, looking at the chart, spots what she thinks to be an error. Betty retrieves the provenance of the chart, to determine from where the facts presented derive.

The Prov data includes the assertions:

ex1:chart1 a prov:Entity .
ex1:dataSet1 a prov:Entity .

These statements, in order, assert that the chart (ex1:chart1) is an entity, the data set (ex1:dataSet1) is an entity.

Process Executions

Further, the Prov data asserts that there was a process execution (ex1:compiled) denoting the compilation of the chart from the data set

ex1:compiled a prov:ProcessExecution .

Used and WasGeneratedBy

Finally, the Prov data asserts that the chart was generated by this compilation process, the compilation process made use of GovData, and the chart was derived from the data set (more on derivation below).

ex1:chart1 prov:wasGeneratedBy ex1:compiled .
ex1:compiled prov:used ex1:dataSet1 .
ex1:chart1 prov:wasDerivedFrom ex1:dataSet1 .

From this information Betty can see that the mistake could have been in the original data set or else was introduced in the compilation process, and sets out to discover which.

Agents

Suggested example: Digging deeper, Betty wants to know who compiled the chart. This turns out to be an independent analyst, Derek.

Accounts

Suggested example: The analyst provides his own record of how he compiled GovData to create the chart, which provides more detail than in the newspaper's provenance data. Specifically, the analysts account separates compilation into two stages: aggregating data by region and then producing the graphic. Therefore, there are two separate accounts of the same events.

Roles

Suggested example: For Betty to know where the error lies, she needs to understand what other information the compilation process was based on. The aggregation step of the process used a list of regions not present in the original data, but determined by the analyst. How does she distinguish the roles played by the two inputs to the aggregation process?

Revision

Suggested example: After looking at the detail of the compilation process, there appears to be nothing wrong, so Betty concludes the error is in GovData. She contacts the government, and a new revision of GovData is created. How does the provenance document that the new revision is a revision of the old revision?

Complementarity

Betty lets Derek know that a new revision of the data set exists, and he looks at the provenance of the new data to understand what he needs to reanalyse.

In addition to specifying that ex1:dataSet2 is a new revision of ex1:dataSet1, the provenance from DataGov also asserts that both of these entities were a complement of another entity ex1:dataSet.

     ex1:dataSet1 prov:wasComplementOf ex1:dataSet .
     ex1:dataSet2 prov:wasComplementOf ex1:dataSet .
     

This assertion means that ex1:dataSet1 at some point shared its characterising attributes with ex1:dataSet, and the same for ex2:dataSet2. Thus the entity ex1:dataSet1 did at some point represent the same thing as characterized by the entity ex1:dataSet. The same is true for ex1:dataSet2 - but not neccessarily at the same point in time.

The term was complement of here means that the ex1:dataSet1 provide additional details that adds to the details of ex1:dataSet (complementing it), and that both of these entities represented the same thing. Characterizing attributes of ex1:dataSet are from this asserted to have been compatible with the properties of ex1:dataSet1 and ex1:dataSet2. Compatible here means that some kind of mapping can be established between the attributes, they don't neccessarily have to match directly.

Derek then looks at the characterization of ex1:dataSet to find these compatible attributes:

     ex1:dataSet a ex1:DataSet ;
         ex1:regions ( ex1:North, ex1:NorthWest, ex1:East ) ;
         dc:creator ex1:DataGov ;
         dc:title "Regional incidence dataset 2011" .
     

Derek can from this deduce that both datasets had at some point the same creator and title. Derek then compares this to the attributes for each of the complementing entities:

     ex1:dataSet1 a ex1:DataSet ;
         ex1:postCodes ( "N1", "N2", "NW1", "E1", "E2" ) ;
         ex1:totalIncidents 141 ;
         dc:creator ex1:DataGov ;
         dc:title "Regional incidence dataset 2011" .
     

Derek sees that the creator and title are directly mappable and equal between these entities. He also knows (from his region aggregation method) that the ex1:postCodes N1 and N2 are in the region ex1:North, and so on, and can confirm that although this regional characterisation of the data is not expressed using the same attributes in the two entities, they are compatible.

Derek notes that ex1:totalIncidents is not stated for ex1:dataSet, and not mappable to any of the other existing attributes. Thus this could be one of the complementing attributes that makes ex1:dataSet1 more specific than ex1:dataSet. Derek can from the assertion ex1:dataSet1 prov:wasComplementOf ex1:dataSet see that ex1:dataSet did have 141 incidents when its characterization interval overlapped that of ex1:dataSet1, but not neccessarily throughout its lifetime. Note that in this example the provenance assertions are not providing any direct description of the characterization interval of the entities.

Due to the open world assumption (more information might be added later) he can not conclude from this alone that ex1:dataSet at any point did not have 141 incidents. He therefore does not know for sure that ex1:totalIncidents is a complementing attribute which ex1:dataSet does not have in its characterisation.

Derek finally compares the newer revision ex1:dataSet2 with ex1:dataSet:

     ex1:dataSet2 a ex1:DataSet ;
         ex1:postCodes ( "N1", "N2", "NW1", "NW2", "E1", "E2" ) ;
         ex1:totalIncidents 158 ;
         dc:creator ex1:DataGov ;
         dc:title "Regional incidence dataset 2011" .
     

In this revision, the new postcode NW2 appears, this is still compatible with the region ex1:NorthWest of ex1:dataSet On the other hand, the attribute prov:totalIncidents have gone up to 158.

From the prov:wasComplementOf assertion Derek knows that ex1:dataSet2 also provides additional attributes for ex1:dataSet, but because the total incidents can't both be 141 and 158, the attribute ex1:totalIncidents is a complementing attribute, and changes over the characterisation interval (lifespan) of ex1:dataSet, and is thus not one of its characterising attributes. He also now knows that ex1:dataSet is a common characterisation of the dataset that spans (parts of) both revisions. It has however not been asserted explicitly that the ex1:dataSet is a somewhat more general characterisation, just that it allows mutability on the prov:totalIncidents attribute and overlapped (parts of) the timespans of the two revisions.

From this Derek concludes that he can still use the regions Nort, North West and East in the diagram layout, but as the ex1:totalIncidents differ, something in the raw data has changed. He can't from this provenance assertion alone tell if that is merely from the addition of the post code NW2, or if data for the other post codes have changed as well. Derek desides to redo the aggregation by region using ex1:dataSet2 and regenerate the graphics using the same layout.

Derivation

Suggested example: Derek creates a new chart based on the revised data, using the same compilation process as before. Betty checks the article again at a later point, and wants to know if it is based on the old or new GovData. The newspaper's provenance data says that the article is "derived from" the updated GovData, while the analyst's provenance data says it is "eventually derived from" the same. How should she interpret this?

Frequently asked questions

Abstract Syntax Notation for Examples

Acknowledgements

WG membership to be listed here.