PROV-DM PROPOSAL on Derivation

TO EDIT

PROV-DM is a data model for provenance for building representations of the entities, people and activities involved in producing a piece of data or thing in the world. PROV-DM is domain-agnostic, but is equipped with extensibility points allowing further domain-specific and application-specific extensions to be defined. PROV-DM is accompanied by PROV-ASN, a technology-independent abstract syntax notation, which allows serializations of PROV-DM instances to be created for human consumption, which facilitates its mapping to concrete syntax, and which is used as the basis for a formal semantics.

TO EDIT

This document is part of the PROV family of specifications, a set of specifications aiming to define the various aspects that are necessary to achieve the vision of inter-operable interchange of provenance information in heterogeneous environments such as the Web. This document defines the PROV-DM data model for provenance, accompanied with a notation to express instances of that data model for human consumption. Other documents are:

PROV-DM-CONSTRAINTS, a set of constraints applying to the PROV-DM data model,
PROV-ASN, a notation for provenance aimed at human consumption,
PROV-O, the provenance ontology: by means of a mapping of PROV-DM to the OWL2 Web Ontology Language, this specification provides a normative serialization of PROV-DM in RDF
PROV-AQ, provenance access and query: the mechanisms for accessing and querying provenance;
PROV-PRIMER: a primer for the PROV-DM provenance data model,
PROV-SEM: a formal semantics for the PROV-DM provenance data model.

Derivation

According to Section Overview, for an entity to be transformed from, created from, or resulting from an update to another, there must be some underpinning activities performing the necessary actions resulting in such a derivation. A derivation can be described at various levels of precision. In its simplest form, derivation relates two entities. Optionally, attributes can be added to describe modalities of derivation. If the derivation is the result of a single known activity, then this activity can also be optionally expressed. And to provide a completely accurate description of derivation, the generation and usage of the generated and used entities, respectively, can be provided. The reason for optional information such as activity, generation, and usage to be linked to derivations is to aid analysis of provenance and to facilitate provenance-based reproducibility.

A derivation, written wasDerivedFrom(id, e2, e1, a, g2, u1, attrs) in PROV-ASN, contains:

id: an OPTIONAL identifier for a derivation;
generatedEntity: the identifier (e2) of the entity generated by the derivation;
usedEntity: the identifier (e1) of the entity used by the derivation;
activity: an OPTIONAL identifier (a) for the activity using and generating the above entities;
generation: an OPTIONAL identifier for the generation (g2) involving the generated entity and activity;
usage: an OPTIONAL identifier for the usage (u1) involving the used entity and activity;
attributes: an OPTIONAL set of attribute-value pairs that describe the modalities of this derivation.

Derivation is not defined to be transitive. Domain-specific specializations of derivation may be defined in such a way that the transitivity property holds.

The following descriptions state the existence of derivations.

wasDerivedFrom(e2,e1)
wasDerivedFrom(e2,e1,[prov:type="physical transform"])
wasDerivedFrom(e2,e1,a,g2,u1)
  wasGeneratedBy(g2,e2,a)
  used(u1,a,e1)

The first and second lines are about derivations of e2 (the generated entity) from e1 (the used entity), but no information is provided as to the identity of the activity (and usage and generation) underpinning the derivation. In the second line, a type attribute is also provided.

The third description expresses that activity a, using the entity e1 according to usage u1, derived the entity e2 and generated it according to generation g2. It is followed by descriptions for generation g2 and usage u1. With such a comprehensive description of derivation, a program that analyzes provenance can identify the activity underpinning the derivation, it can identify how the original entity e1 was used by the activity (e.g. for instance, which argument it was passed as, if the activity is the result of a function invocation), and which output the derived entity e2 was obtained from (say, for a function returning multiple results).

Several points were raised about the attribute steps. Its name, its default value ISSUE-180. ISSUE-179.

Is imprecise-1 derivation necessary? Can we just use precise-1 and imprecise-n? ISSUE-249.

Emphasize the notion of 'affected by' ISSUE-133.

TO GO in PART II: Derivation

Section Derivation-Relation edited as follows.

A derivation is more informative if it contains a reference to an activity, generation, and usage. Hence, the following implication holds.

Given two entities denoted by e1 and e2, if the assertion wasDerivedFrom(e2, e1, a, g2, u1, attrs) holds for some generation g2, usage u1, and set of attribute-value pairs attrs, then wasDerivedFrom(e2,e1, attrs) also holds.

For the interpretation of a derivation, see derivation-usage-generation-ordering and derivation-generation-generation-ordering

Note that inferring derivation from usage and generation does not hold in general. Indeed, when a generation wasGeneratedBy(g, e2, a, attrs2) precedes used(u, a, e1, attrs1), for some e1, e2, attrs1, attrs2, and a, one cannot infer derivation wasDerivedFrom(e2, e1, a, g, u) or wasDerivedFrom(e2,e1) since of e2 cannot possibly be derived from e1, given the creation of e2 precedes the use of e1.

See derivation-use for a structural constraint on derivations.

Several points were raised about the attribute steps. Its name, its default value ISSUE-180. ISSUE-179.

Emphasize the notion of 'affected by' ISSUE-133.

Simplify derivation ISSUE-249.

TO GO in PART 2: PROV-DM Account Constraints

To replace derivation-use.

A further inference is permitted from derivations with an explicit activity and no usage:

Given an activity a, entities denoted by e1 and e2, and sets of attribute-value pairs dAttrs, gAttrs, if wasDerivedFrom(e2,e1, a, dAttrs) and wasGeneratedBy(e2,a,gAttrs) hold, then used(a,e1,uAttrs) also holds for some set of attribute-value pairs uAttrs.

This inference is justified by the fact that the entity denoted by e2 is generated by at most one activity in a given account (see generation-uniqueness). Hence, this activity is also the one referred to by the usage of e1.