Rationale for inferences and constraints

This section collects all of the explanatory material that I was not certain how to interpret as an unambiguous inference or constraint. Some of these observations may need to be folded into the explanatory text in respective sections (for example for events, accounts or collections). Editing is also needed to decrease redundancy.

Entities and Attributes

When we talk about things in the world in natural language and even when we assign identifiers, we are often imprecise in ways that make it difficult to clearly and unambiguously report provenance: a resource with a URL may be understood as referring to a report available at that URL, the version of the report available there today, the report independent of where it is hosted over time, etc. However, to write precise descriptions of the provenance of things that change over time, we need ways of disambiguating which versions of things we are talking about.

To describe the provenance of things that can change over time, PROV-DM uses the concept of entities with fixed attributes. From a provenance viewpoint, it is important to identify a partial state of something, i.e. something with some aspects that have been fixed, so that it becomes possible to express its provenance (i.e. what caused the thing with these specific aspects). An entity encompasses a part of a thing's history during which some of the attributes are fixed. An entity can thus be thought of as a part of a thing with some associated partial state. Attributes in PROV-DM are used to fix certain aspects of entities.

An entity is a thing one wants to provide provenance for and whose situation in the world is described by some fixed attributes. An entity has a characterization interval, or lifetime, defined as the period between its generation event and its invalidation event. An entity's attributes are established when the entity is created and describe the entity's situation and (partial) state during an entity's lifetime.

A different entity (perhaps representing a different user or system perspective) may fix other aspects of the same thing, and its provenance may be different. Different entities that are aspects of the same thing are called alternate, and the PROV-DM relations of specialization and alternate can be used to link such entities.

Different users may take different perspectives on a resource with a URL. A provenance record might use one (or more) different entities to talk about different perspectives, such as:

a report available at a URL: fixes the nature of the thing, i.e. a document, and its location;
the version of the report available there today: fixes its version number, contents, and its date;
the report independent of where it is hosted and of its content over time: fixes the nature of the thing as a conceptual artifact.

The provenance of these three entities may differ, and may be along the following lines:

the provenance of a report available at a URL may include: the act of publishing it and making it available at a given location, possibly under some license and access control;
the provenance of the version of the report available there today may include: the authorship of the specific content, and reference to imported content;
the provenance of the report independent of where it is hosted over time may include: the motivation for writing the report, the overall methodology for producing it, and the broad team involved in it.

We do not assume that any entity is a better or worse description of reality than any other. That is, we do not assume an absolute ground truth with respect to which we can judge correctness or completeness of descriptions. In fact, it is possible to describe the processing that occurred for the report to be commissioned, for individual versions to be created, for those versions to be published at the given URL, etc., each via a different entity with attribute-value pairs that fix some aspects of the report appropriately.

Besides entities, a variety of other PROV-DM objects have attributes, including activity, generation, usage, start, end, communication, attribution, association, responsibility, and derivation. Each object has an associated duration interval (which may be a single time point), and attribute-value pairs for a given object are expected to be descriptions that hold for the object's duration.

However, the attributes of entities have special meaning because they are considered to be fixed aspects of underlying, changing things. This motivates constraints on alternateOf and specializationOf relating the attribute values of different entities.

TODO: Constraints on alternateOf/specializationOf for this?

TODO: Further discussion of entities moved from the old "Definitional constraints" section. Should merge with the surrounding discussion to avoid repetition.

An entity is a thing one wants to provide provenance for and whose situation in the world is described by some attribute-value pairs. An entity's attribute-value pairs are established as part of the entity statement and their values remain unchanged for the lifetime of the entity. An entity's attribute-value pairs are expected to describe the entity's situation and (partial) state during an entity's characterization interval.

If an entity's situation or state changes, this may result in its statement being invalid, because one or more attribute-value pairs no longer hold. In that case, from the PROV viewpoint, there exists a new entity, which needs to be given a distinct identifier, and associated with the attribute-value pairs that reflect its new situation or state.

Further considerations:

In order to describe the provenance of something during an interval over which relevant attributes of the thing are not fixed, it is required to create multiple entities, each with its own identifier, characterization interval, and fixed attributes, and express dependencies between the various entities using events. For example, if we want to describe the provenance of several versions of a document, involving attributes such as authorship that change over time, we need different entities for the versions linked by appropriate generation, usage, revision, and invalidation events.
There is no assumption that the set of attributes is complete, nor that the attributes are independent or orthogonal of each other.
There is no assumption that the attributes of an entity uniquely identify it. Two different entities that are aspects of different things can have the same attributes.
A characterization interval may collapse into a single instant.

Activities

TODO: Further discussion of activities moved from old "Definitional constraints and inferences" section. Edit to avoid repeating information.

An activity is delimited by its start and its end events; hence, it occurs over an interval delimited by two instantaneous events. However, an activity record need not mention start or end time information, because they may not be known. An activity's attribute-value pairs are expected to describe the activity's situation during its interval, i.e. an interval between two instantaneous events, namely its start event and its end event.

Further considerations:

An activity is not an entity. Indeed, an entity exists in full at any point in its lifetime, persists during this interval, and preserves the characteristics that makes it identifiable. In contrast, an activity is something that occurs, happens, unfolds, or develops through time, but is typically not identifiable by the characteristics it exhibits at any point during its duration. This distinction is similar to the distinction between 'continuant' and 'occurrent' in logic [[Logic]].

Description, Assertion, and Inference

PROV-DM is a provenance data model designed to express descriptions of the world.

A file at some point during its lifecycle, which includes multiple edits by multiple people, can be described by its type, its location in the file system, a creator, and content.

The data model is designed to capture activities that happened in the past, as opposed to activities that may or will happen. However, this distinction is not formally enforced. Therefore, PROV-DM descriptions are intended to be interpreted as what has happened, as opposed to what may or will happen.

This specification does not prescribe the means by which descriptions can be arrived at; for example, descriptions can be composed on the basis of observations, reasoning, or any other means.

Sometimes, inferences about the world can be made from descriptions conformant to the PROV-DM data model. This specification defines some such inferences, allowing new descriptions to be inferred from existing ones. Hence, descriptions of the world can result either from direct assertion or from inference by application of inference rules defined by this specification.

Events and Time

Time is critical in the context of provenance, since it can help corroborate provenance claims. For instance, if an entity is claimed to be obtained by transforming another, then the latter must have existed before the former. If it is not the case, then there is something wrong with such a provenance claim.

Although time is critical, we should also recognize that provenance can be used in many different contexts within individual systems and across the Web. Different systems may use different clocks which may not be precisely synchronized, so when provenance records are combined by different systems, we may not be able to align the times involved to a single global timeline. Hence, PROV-DM is designed to minimize assumptions about time.

Hence, to talk about the constraints on valid PROV-DM data, we refer to instantaneous events that correspond to interactions between activities and entities. The term "event" is commonly used in process algebra with a similar meaning. For instance, in CSP [[CSP]], events represent communications or interactions; they are assumed to be atomic and instantaneous.

Event Ordering

The following paragraphs are unclear and need to be revised, to address review concerns: if we aren't saying anything about how events and time relate, and time is the only concrete information about event ordering in PROV-DM, then how can implementations check that event ordering constraints are satisfied?

How the precedes partial order is implemented in practice is beyond the scope of this specification. This specification only assumes that each instantaneous event can be mapped to an instant in some form of timeline. The actual mapping is not in scope of this specification. Likewise, whether this timeline is formed of a single global timeline or whether it consists of multiple Lamport-style clocks is also beyond this specification. The follows and precedes orderings of events should be consistent with the ordering of their associated times over these timelines.

This specification defines event ordering constraints between instantaneous events associated with provenance descriptions. PROV-DM data MUST satisfy such constraints.

PROV-DM also allows for time observations to be inserted in specific statements, for each recognized instantaneous event introduced in this specification. The presence of a time observation for a given instantaneous event fixes the mapping of this instantaneous event to the timeline. It can also help with the verification of associated ordering constraints (though, again, this verification is outside the scope of this specification).

Types of Events

Five kinds of instantaneous events are used for the PROV-DM data model. The activity start and activity end events delimit the beginning and the end of activities, respectively. The entity usage, entity generation, and entity invalidation events apply to entities, and the generation and invalidation events delimit the characterization interval of an entity. More specifically:

An activity start event is the instantaneous event that marks the instant an activity starts.

An activity end event is the instantaneous event that marks the instant an activity ends.

An entity usage event is the instantaneous event that marks the first instant of an entity's consumption timespan by an activity. Before this instant the entity had not begun to be used by the activity.

An entity generation event is the instantaneous event that marks the final instant of an entity's creation timespan, after which it is available for use. The entity did not exist before this event.

An entity invalidation event is the instantaneous event that marks the initial instant of the destruction, invalidation, or cessation of an entity, after which the entity is no longer available for use. The entity no longer exists after this event.