This document aims to provide an intuitive guide to the Prov Data Model, with worked examples.
This is a document for internal discussion, which will ultimately evolve in the first Public Working Draft of the Primer.
The Prov Data Model (Prov-DM) is used to describe the provenance of things, i.e. how something came to be, from what sources, its history, etc. As such, Prov-DM data consists of assertions about the past. These assertions are not assessments, e.g. as to something's authenticity, but the plain facts from which such assessments might be derived.
This guide aims to ease the adoption of the standard by providing:
Provenance has many meanings depending on what one is interested with regards to the object or resource in question. Different people may have different perspectives, focusing on different types of information that might be captured in a provenance record.
One perspective might focus on entity-centered provenance, that is, what entities were involved in generating or manipulating the information in question. Examples of entities include author, editor, publisher, curator, etc.
A second perspective might be one to focus on document-centered provenance, by tracing the origins of portions of a document to other documents. An example is referring to other news sources, quoting statistics from reports by some government or non-government agencies, etc.
A third perspective one might take is on process-centered provenance, capturing the actions and steps taken to generate the information in question. (e.g., a data transformation, an edit, etc.). An example is the records of execution of processes as workflows of web services.
Explains the contexts in which the reader may see or create Prov-DM data.
This section provides an intuitive explanation of the concepts in Prov-DM. As with the rest of this document, it should be treated as a starting point for understanding the model, and not normative in itself. The model specification provides the precise definitions and constraints to be followed in using Prov-DM.
An intuitive overview of how to think about entities and their characterising attributes in Prov-DM.
An intuitive overview of how to think about provenance executions in Prov-DM.
An intuitive overview of how to think about use and generation events in Prov-DM.
An intuitive overview of how to think about agents in Prov-DM.
An intuitive overview of how to think about accounts in Prov-DM.
An intuitive description of how the roles of entities in processes are expressed.
An intuitive overview of how to think about revision relations in Prov-DM.
Several asserted entities can be characterizing the same thing, in particular when entities are asserted by different accounts or over different time periods. If two such entities have overlapping lifespans, and the first entity have some attributes that have not been asserted (and not neccessarily always true) for the second entity, then the first entity is said to be complementing the second entity, that is the first entity helps form a more detailed description of the second entity, at least for the duration of the overlapping lifespan.
In addition, if :A prov:wasComplementOf :B
, then of all the
attributes of the entity :A
which can be mapped to
compatible attributes of :B
MUST be matching
for the contiuous duration of the overlap of :A
and
:B
's lifespans.
It is out of scope for PROV to specify or assert the nature of
the compatibility mapping and matching, the exact
interpretation of these is left to the asserter of
wasComplementOf
If :B
also have some attributes which
are not asserted (or not always true) about :A
,
then this MAY be asserted using the
inverse relation :B prov:wasComplementOf :A
. If two entities
both complement each other in this manner, both MUST have some
attributes the other does not have, although those attributes MAY
not have been asserted in the provenance. Note that the
lack of such an inverse assertion does not neccessarily
mean that :B
did not have any additional attributes
for :A
in the timespan, only that this has not
been asserted.
In the simplest case, both entites are described using the same
attributes, in which case matching means the values SHOULD
literally be the same (matching by identity). On the other hand an
attribute like ex1:speed_in_mph
can be mapped to
a compatible ex2:speed_in_kmh
attribute. Not all
attributes might be mappable in both directions, for instance
ex1:city
to ex2:country
, but not vice
versa.
Note that it is out of scope for PROV to assert or explain any mapping of compatible attributes. This is merely a conclusion that can be drawn from the assertion that the two entities both described the same thing in the overlapping time spans. Also note that asserting a complementary relationship does not detail how the two entity timespans overlap, this could be anything from complete one-to-one match (where all attributes are always true for both entities) to merely touching overlaps.
An intuitive overview of how to think about the different kinds of derivation relation in Prov-DM.
In the following sections, we show how Prov-DM can be used to model provenance in specific examples.
We include examples of how the formal ontology can be used to represent the Prov-DM assertions as RDF triples. These are shown using the Turtle notation. In the latter depictions, the namespace prefix prov denotes terms from the Prov ontology, while ex1, ex2, etc. denote terms specific to the example.
We also provide a representation of the examples in the Abstract Syntax Model used in the conceptual model document. The full ASM data is included in the appendix.
An online newspaper publishes an article making using of data (GovData) provided through a government portal, in England. The article includes a chart based on GovData. A blogger, Betty, looking at the chart, spots what she thinks to be an error. Betty retrieves the provenance of the chart, to determine from where the facts presented derive.
The Prov data includes the assertions:
ex1:chart1 a prov:Entity .
ex1:dataSet1 a prov:Entity .
These statements, in order, assert that the chart (ex1:chart1) is an entity, the data set (ex1:dataSet1) is an entity.
Further, the Prov data asserts that there was a process execution (ex1:compiled) denoting the compilation of the chart from the data set
ex1:compiled a prov:ProcessExecution .
Finally, the Prov data asserts that the chart was generated by this compilation process, the compilation process made use of GovData, and the chart was derived from the data set (more on derivation below).
ex1:chart1 prov:wasGeneratedBy ex1:compiled .
ex1:compiled prov:used ex1:dataSet1 .
ex1:chart1 prov:wasDerivedFrom ex1:dataSet1 .
From this information Betty can see that the mistake could have been in the original data set or else was introduced in the compilation process, and sets out to discover which.
Suggested example: Digging deeper, Betty wants to know who compiled the chart. This turns out to be an independent analyst, Derek.
Suggested example: The analyst provides his own record of how he compiled GovData to create the chart, which provides more detail than in the newspaper's provenance data. Specifically, the analysts account separates compilation into two stages: aggregating data by region and then producing the graphic. Therefore, there are two separate accounts of the same events.
Suggested example: For Betty to know where the error lies, she needs to understand what other information the compilation process was based on. The aggregation step of the process used a list of regions not present in the original data, but determined by the analyst. How does she distinguish the roles played by the two inputs to the aggregation process?
Suggested example: After looking at the detail of the compilation process, there appears to be nothing wrong, so Betty concludes the error is in GovData. She contacts the government, and a new revision of GovData is created. How does the provenance document that the new revision is a revision of the old revision?
Betty lets Derek know that a new revision of the data set exists, and he looks at the provenance of the new data to understand what he needs to reanalyse.
In addition to specifying that
ex1:dataSet2
is a new revision of
ex1:dataSet1
, the provenance from DataGov also
asserts that both of these entities were a complement of
another entity ex1:dataSet
.
ex1:dataSet1 prov:wasComplementOf ex1:dataSet . ex1:dataSet2 prov:wasComplementOf ex1:dataSet .
This assertion means that ex1:dataSet1
at some point shared
its characterising attributes with ex1:dataSet
, and the same for
ex2:dataSet2
. Thus the entity
ex1:dataSet1
did at some point represent the same
thing as characterized by the entity ex1:dataSet
. The same is
true for ex1:dataSet2
- but not neccessarily at the
same point in time.
The term was complement of here means that the
ex1:dataSet1
provide additional details that adds to the details of
ex1:dataSet
(complementing it), and that both of these
entities represented the same thing.
Characterizing attributes of ex1:dataSet
are from this
asserted to have been compatible with the properties of
ex1:dataSet1
and ex1:dataSet2
.
Compatible here means that some kind of mapping can be
established between the attributes, they don't neccessarily have to
match directly.
Derek then looks at the characterization of
ex1:dataSet
to find these compatible attributes:
ex1:dataSet a ex1:DataSet ; ex1:regions ( ex1:North, ex1:NorthWest, ex1:East ) ; dc:creator ex1:DataGov ; dc:title "Regional incidence dataset 2011" .
Derek can from this deduce that both datasets had at some point the same creator and title. Derek then compares this to the attributes for each of the complementing entities:
ex1:dataSet1 a ex1:DataSet ; ex1:postCodes ( "N1", "N2", "NW1", "E1", "E2" ) ; ex1:totalIncidents 141 ; dc:creator ex1:DataGov ; dc:title "Regional incidence dataset 2011" .
Derek sees that the creator and title are directly mappable and
equal between these entities. He also knows (from his region
aggregation method) that the ex1:postCodes
N1
and
N2
are in the
region ex1:North
, and so on, and can confirm that although
this regional characterisation of the data is not expressed
using the same attributes in the two entities, they are compatible.
Derek notes that ex1:totalIncidents
is not stated
for ex1:dataSet
, and not mappable to any of the
other existing attributes. Thus this could be one of the
complementing attributes that makes ex1:dataSet1
more specific than ex1:dataSet
.
Derek can from the assertion ex1:dataSet1
prov:wasComplementOf ex1:dataSet
see that ex1:dataSet
did have 141 incidents when its characterization interval
overlapped that of ex1:dataSet1
, but not neccessarily
throughout its lifetime. Note that in this example the provenance
assertions are not providing any direct description of the
characterization interval of the entities.
Due to the open world assumption (more
information might be added later) he can not conclude
from this alone that ex1:dataSet
at any point did
not have 141 incidents. He therefore does not know
for sure that ex1:totalIncidents
is a complementing
attribute which ex1:dataSet
does not have in its
characterisation.
Derek finally compares the newer revision
ex1:dataSet2
with
ex1:dataSet
:
ex1:dataSet2 a ex1:DataSet ; ex1:postCodes ( "N1", "N2", "NW1", "NW2", "E1", "E2" ) ; ex1:totalIncidents 158 ; dc:creator ex1:DataGov ; dc:title "Regional incidence dataset 2011" .
In this revision, the new postcode NW2 appears, this is still
compatible with the region ex1:NorthWest
of ex1:dataSet
On the other hand, the attribute prov:totalIncidents
have gone up to 158.
From the prov:wasComplementOf
assertion Derek knows that
ex1:dataSet2
also provides additional attributes for
ex1:dataSet
, but because the total incidents can't
both be 141 and 158, the attribute ex1:totalIncidents
is a complementing attribute, and changes over the
characterisation interval (lifespan) of ex1:dataSet
,
and is thus not one of its characterising attributes. He also now
knows that ex1:dataSet
is a common characterisation
of the dataset that spans (parts of) both revisions. It has
however not been asserted explicitly that the
ex1:dataSet
is a somewhat more general
characterisation, just that it allows mutability on the
prov:totalIncidents
attribute and overlapped (parts
of) the timespans of the two revisions.
From this Derek concludes that he can still use the regions Nort,
North West and East in the diagram layout, but as the
ex1:totalIncidents
differ, something in the
raw data has changed. He can't from this provenance assertion
alone tell if that is merely from the addition of the post code
NW2, or if data for the other post codes have changed as well.
Derek desides to redo the aggregation by region using
ex1:dataSet2
and regenerate the
graphics using the same layout.
Suggested example: Derek creates a new chart based on the revised data, using the same compilation process as before. Betty checks the article again at a later point, and wants to know if it is based on the old or new GovData. The newspaper's provenance data says that the article is "derived from" the updated GovData, while the analyst's provenance data says it is "eventually derived from" the same. How should she interpret this?
WG membership to be listed here.