* Renamed to dataset and toplevel instance
authorJames Cheney <jcheney@inf.ed.ac.uk>
Thu, 09 Aug 2012 17:59:33 +0100
changeset 4325 1d34bb12f1a7
parent 4324 2e942c753bd4
child 4326 5bb7ffe45480
* Renamed to dataset and toplevel instance
model/prov-constraints.html
--- a/model/prov-constraints.html	Thu Aug 09 17:28:07 2012 +0100
+++ b/model/prov-constraints.html	Thu Aug 09 17:59:33 2012 +0100
@@ -432,7 +432,9 @@
 over provenance.  They can also be used to <a title="normal
 form">normalize</a> PROV instances to forms that can easily be
 compared in order to determine whether two PROV instances are
-<a>equivalent</a>.</p>
+<a>equivalent</a>.  Validity and equivalence are also defined for PROV
+bundles (that is, named instances) and datasets (that is, a toplevel
+instance together with zero or more bundles).</p>
 
 </section>
 
@@ -517,9 +519,10 @@
 
 <p>The PROV Data Model, PROV-DM, is a conceptual data model for provenance, which is
 realizable using different representations such as PROV-N and PROV-O.
-A PROV <dfn>instance</dfn> is a set of PROV statements,
-possibly including <a>bundles</a>, or named sets of statements. For
-example, such a PROV instance could be a .provn document, the result
+A PROV <dfn>dataset</dfn> is a set of PROV statements,
+together with zero or more <a>bundles</a>, or named sets of
+statements.  For
+example, a PROV dataset could be a .provn document, the result
 of a query, a triple store containing PROV statements in RDF, etc. The
 PROV-DM specification [[PROV-DM]] imposes minimal requirements upon
 PROV instances. A <a>valid</a> PROV instance corresponds to a
@@ -546,14 +549,16 @@
 form">normalize</a> PROV instances, and event ordering, typing, and
 impossibility constraints can be checked on the normal form to determine
 <a title="valid">validity</a>.  Equivalence of two PROV 
-instances can be determined by comparing their normal forms.
+instances can be determined by comparing their normal forms.  For PROV
+datasets, validity and equivalence amount to checking the validity or
+pairwise equivalence of their respective datasets.
 </p>
 <p>
 This document outlines an algorithm for validity checking based on
 <a title="normal form">normalization</a>.  Applications MAY implement
 validity and equivalence checking using normalization, as suggested
 here, or in
-any other way as long as the same instances are considered valid or
+any other way as long as the same instances or datasets are considered valid or
 equivalent, respectively.
 </p>
 
@@ -563,9 +568,9 @@
 to support them at all.  
 Applications producing provenance SHOULD ensure that it is
 <a>valid</a>, and similarly applications consuming provenance MAY reject provenance that is not <a>valid</a>.  Applications
-which are determining whether PROV instances convey the same
+which are determining whether PROV instances or datasets convey the same
 information SHOULD check equivalence as specified here, and SHOULD
-treat equivalent instances in the same way.
+treat equivalent instances or datasets in the same way.
 </p>
 </section>
 <section>
@@ -857,7 +862,7 @@
   <p>
   For the purpose of constraint checking, we view PROV statements
   (possibly involving existential variables) as
-  <dfn>formulas</dfn>.  An instance can be viewed as "theory" in
+  <dfn>formulas</dfn>.  An instance is analogous to a "theory" in
   logic, that is, a set of formulas all thought to describe the same
   situation.  The set can also be thought of a single, large formula:
   the conjunction of all of the atomic formulas.
@@ -899,6 +904,12 @@
   definitions, inferences, and constraints as being satisfied or not
   satisfied in a PROV instance.  
   </p>
+  <ol><li>Definitions</li>
+  <li>Inferences</li>
+  <li>Uniqueness constraints</li>
+  <li>Key constraints</li>
+  <li>Ordering constraints</li>
+</ol>  
 
 <h2>Merging</h2>
 
@@ -1043,6 +1054,25 @@
   those obtained by the algorithm in this specification.
   </p>
 
+<h2>From Instances to Datasets</h2>
+
+<p>PROV datasets can contain multiple instances: a <a>toplevel
+instance</a>, the set of statements not appearing within a bundle, and
+zero or more named instances called <a>bundle</a>s.  For the purpose
+of this document, these instances are treated independently: that is,
+a PROV dataset is valid provided that each instance in it is valid
+and the names of its bundles are distinct; a PROV dataset is
+equivalent to another if their toplevel instances are equivalent, they
+have the same number of bundles with the same names, and the instances
+of their corresponding bundles are equivalent.  As for blank nodes in
+RDF, the scope of an existential variable in PROV is the instance
+level, so existential variables with the same name occurring in
+different instances do not necessarily denote the same term.  This
+is a consequence of the fact that the instances of two equivalent
+datasets only need to be  pairwise isomorphic; this is a weaker
+property than requiring that there be a single isomorphism that works
+for all of the corresponding instances.
+</p>
 </section>
 
 <section>
@@ -1328,13 +1358,13 @@
   <ol><li>When processing provenance, an
     application MAY apply the inferences and definitions in <a
     href="#inferences" class='sectionRef'></a>.</li>
-   <li>If determining whether a PROV instance is <a>valid</a>, an
+   <li>If determining whether a PROV instance or dataset is <a>valid</a>, an
     application MUST check that all of the
     constraints of <a href="#constraints" class="sectionRef"></a> are
-  satisfied  on the <a>normal form</a> of the instance.  </li>
+  satisfied  on the <a>normal form</a> of the instance or dataset.  </li>
    <li> If producing provenance meant for other applications to
     use, the application SHOULD produce <a>valid</a> provenance, as specified in <a href="#normalization-validity-equivalence" class="sectionRef"></a>. </li>
-    <li>If determining whether two PROV instances are
+    <li>If determining whether two PROV instances or datasets are
   <a>equivalent</a>, an application MUST determine whether their
   normal forms are equal, as specified in <a href="#normalization-validity-equivalence" class="sectionRef"></a>.
   </ol>
@@ -2008,7 +2038,7 @@
 <span
 class="name">e1</span> may be used multiple times by 
  <span class="name">a</span>, usage  <span class="name">use</span>
-may not be involved in the derivation  (for instance, it may well have taken place after 
+may not be involved in the derivation  (for example, it may well have taken place after 
 the generation of <span
 expanclass="name">e2</span>).</p>
  </div>
@@ -3819,12 +3849,13 @@
 
   <p>We define the notions of <a title="normal form">normalization</a>, <a
 title="valid">validity</a> and
-<a title="equivalence">equivalence</a> of PROV instances.  We first define these concepts
-for PROV instances that consist of a single, unnamed <a>bundle</a> of
-statements, called the <dfn>toplevel bundle</dfn>.</p>
-
+<a title="equivalence">equivalence</a> of PROV datasets and instances.  We first define these concepts
+for PROV instances and then extend them to PROV datasets.</p>
+
+<section id="instance">
+  <h2>Instances</h2>
+  
 <div class="remark">
-
   Implementations should decide up front what reasoning about
   co-reference should be applied, and rewrite the instance (by
   replacing co-referent identifiers with a single common identifier) to
@@ -3950,27 +3981,29 @@
 original order of statements in a file and avoid expanding
 inferences.)  </p>
 
+</section>
 
 <section id="bundle-constraints">
-<h2>Bundles</h2>
+<h2>Bundles and Datasets</h2>
 
 
 <p>The definitions, inferences, and constraints, and
 the resulting notions of normalization, validity and equivalence,
-assume a PROV instance that consists of exactly one <a>bundle</a>, the <a>toplevel
-bundle</a>, containing all PROV statements in the top level of the
-bundle (that is, not enclosed in a named <a>bundle</a>).  In this section, we describe how to deal with PROV
-instances consisting of multiple named bundles.  Briefly, each bundle is
+assume a PROV dataset that consists only of a <a>toplevel
+instance</a>, containing all PROV statements in the top level of the
+dataset (that is, not enclosed in a named <a>bundle</a>).  In this
+section, we describe how to deal with general PROV
+datasets, possibly including multiple named bundles.  Briefly, each bundle is
 handled independently; there is no interaction between bundles from
 the perspective of applying definitions, inferences, or constraints,
 computing normal forms, or checking validity or equivalence.</p>
 
-<p> We model a general PROV instance, containing <span class="name">n</span> named bundles
+<p> We model a general PROV dataset, containing <span class="name">n</span> named bundles
 <span class="name">b<sub>1</sub>...b<sub>n</sub></span>, as a tuple
 <span
 class="name">(B<sub>0</sub>,[b<sub>1</sub>=B<sub>1</sub>,...,b<sub>n</sub>=B<sub>n</sub>])</span>
 where <span class="name">B<sub>0</sub></span> is the set of
-statements of the <a>toplevel bundle</a>, and for each <span class="name">i</span>, <span class="name">B<sub>i</sub></span> is the set of
+statements of the <a>toplevel instance</a>, and for each <span class="name">i</span>, <span class="name">B<sub>i</sub></span> is the set of
 statements of bundle <span class="name">b<sub>i</sub></span>.  Names <span class="name">b<sub>1</sub>...b<sub>n</sub></span> are assumed to be distinct.  This notation is shorthand for the
 following PROV-N syntax:</p>
 <pre>
@@ -3984,15 +4017,15 @@
 endBundle
 </pre>
 
-<p> The <a>normal form</a> of a general PROV instance
+<p> The <a>normal form</a> of a PROV dataset
 <span class="name">(B<sub>0</sub>,[b<sub>1</sub>=B<sub>1</sub>,...,[b<sub>n</sub>=B<sub>n</sub>])</span> is <span class="name">(B'<sub>0</sub>,[b<sub>1</sub>=B'<sub>1</sub>,...,b<sub>n</sub>=B'<sub>n</sub>])</span>
 where <span class="name">B'<sub>i</sub></span> is the normal
 form of <span class="name">B<sub>i</sub></span> for each <span class="name">i</span> between 0 and <span class="name">n</span>.  </p>
 
-<p>A general PROV instance is <a>valid</a> if each of the bundles <span class="name">B<sub>0</sub></span>,
+<p>A PROV dataset is <a>valid</a> if each of the bundles <span class="name">B<sub>0</sub></span>,
 ..., <span class="name">B<sub>n</sub></span> are valid and none of the bundle identifiers <span class="name">b<sub>i</sub></span> are repeated.</p>
 
-<p>Two (valid) general PROV instances <span class="name">(B<sub>0</sub>,[b<sub>1</sub>=B<sub>1</sub>,...,b<sub>n</sub>=B<sub>n</sub>])</span> and
+<p>Two (valid) PROV datasets <span class="name">(B<sub>0</sub>,[b<sub>1</sub>=B<sub>1</sub>,...,b<sub>n</sub>=B<sub>n</sub>])</span> and
 <span class="name">(B'<sub>0</sub>,[b<sub>1</sub>'=B'<sub>1</sub>,...,b'<sub>m</sub>=B'<sub>m</sub>])</span> are <a>equivalent</a> if <span class="name">B<sub>0</sub></span> is
 equivalent to <span class="name">B'<sub>0</sub></span> and <span class="name">n = m</span> and
 there exists a permutation <span class="name">P : {1..n} -> {1..n}</span> such that for each <span class="name">i</span>, <span class="name">b<sub>i</sub> =