--- a/model/prov-constraints.html Thu Aug 09 17:28:07 2012 +0100
+++ b/model/prov-constraints.html Thu Aug 09 17:59:33 2012 +0100
@@ -432,7 +432,9 @@
over provenance. They can also be used to <a title="normal
form">normalize</a> PROV instances to forms that can easily be
compared in order to determine whether two PROV instances are
-<a>equivalent</a>.</p>
+<a>equivalent</a>. Validity and equivalence are also defined for PROV
+bundles (that is, named instances) and datasets (that is, a toplevel
+instance together with zero or more bundles).</p>
</section>
@@ -517,9 +519,10 @@
<p>The PROV Data Model, PROV-DM, is a conceptual data model for provenance, which is
realizable using different representations such as PROV-N and PROV-O.
-A PROV <dfn>instance</dfn> is a set of PROV statements,
-possibly including <a>bundles</a>, or named sets of statements. For
-example, such a PROV instance could be a .provn document, the result
+A PROV <dfn>dataset</dfn> is a set of PROV statements,
+together with zero or more <a>bundles</a>, or named sets of
+statements. For
+example, a PROV dataset could be a .provn document, the result
of a query, a triple store containing PROV statements in RDF, etc. The
PROV-DM specification [[PROV-DM]] imposes minimal requirements upon
PROV instances. A <a>valid</a> PROV instance corresponds to a
@@ -546,14 +549,16 @@
form">normalize</a> PROV instances, and event ordering, typing, and
impossibility constraints can be checked on the normal form to determine
<a title="valid">validity</a>. Equivalence of two PROV
-instances can be determined by comparing their normal forms.
+instances can be determined by comparing their normal forms. For PROV
+datasets, validity and equivalence amount to checking the validity or
+pairwise equivalence of their respective datasets.
</p>
<p>
This document outlines an algorithm for validity checking based on
<a title="normal form">normalization</a>. Applications MAY implement
validity and equivalence checking using normalization, as suggested
here, or in
-any other way as long as the same instances are considered valid or
+any other way as long as the same instances or datasets are considered valid or
equivalent, respectively.
</p>
@@ -563,9 +568,9 @@
to support them at all.
Applications producing provenance SHOULD ensure that it is
<a>valid</a>, and similarly applications consuming provenance MAY reject provenance that is not <a>valid</a>. Applications
-which are determining whether PROV instances convey the same
+which are determining whether PROV instances or datasets convey the same
information SHOULD check equivalence as specified here, and SHOULD
-treat equivalent instances in the same way.
+treat equivalent instances or datasets in the same way.
</p>
</section>
<section>
@@ -857,7 +862,7 @@
<p>
For the purpose of constraint checking, we view PROV statements
(possibly involving existential variables) as
- <dfn>formulas</dfn>. An instance can be viewed as "theory" in
+ <dfn>formulas</dfn>. An instance is analogous to a "theory" in
logic, that is, a set of formulas all thought to describe the same
situation. The set can also be thought of a single, large formula:
the conjunction of all of the atomic formulas.
@@ -899,6 +904,12 @@
definitions, inferences, and constraints as being satisfied or not
satisfied in a PROV instance.
</p>
+ <ol><li>Definitions</li>
+ <li>Inferences</li>
+ <li>Uniqueness constraints</li>
+ <li>Key constraints</li>
+ <li>Ordering constraints</li>
+</ol>
<h2>Merging</h2>
@@ -1043,6 +1054,25 @@
those obtained by the algorithm in this specification.
</p>
+<h2>From Instances to Datasets</h2>
+
+<p>PROV datasets can contain multiple instances: a <a>toplevel
+instance</a>, the set of statements not appearing within a bundle, and
+zero or more named instances called <a>bundle</a>s. For the purpose
+of this document, these instances are treated independently: that is,
+a PROV dataset is valid provided that each instance in it is valid
+and the names of its bundles are distinct; a PROV dataset is
+equivalent to another if their toplevel instances are equivalent, they
+have the same number of bundles with the same names, and the instances
+of their corresponding bundles are equivalent. As for blank nodes in
+RDF, the scope of an existential variable in PROV is the instance
+level, so existential variables with the same name occurring in
+different instances do not necessarily denote the same term. This
+is a consequence of the fact that the instances of two equivalent
+datasets only need to be pairwise isomorphic; this is a weaker
+property than requiring that there be a single isomorphism that works
+for all of the corresponding instances.
+</p>
</section>
<section>
@@ -1328,13 +1358,13 @@
<ol><li>When processing provenance, an
application MAY apply the inferences and definitions in <a
href="#inferences" class='sectionRef'></a>.</li>
- <li>If determining whether a PROV instance is <a>valid</a>, an
+ <li>If determining whether a PROV instance or dataset is <a>valid</a>, an
application MUST check that all of the
constraints of <a href="#constraints" class="sectionRef"></a> are
- satisfied on the <a>normal form</a> of the instance. </li>
+ satisfied on the <a>normal form</a> of the instance or dataset. </li>
<li> If producing provenance meant for other applications to
use, the application SHOULD produce <a>valid</a> provenance, as specified in <a href="#normalization-validity-equivalence" class="sectionRef"></a>. </li>
- <li>If determining whether two PROV instances are
+ <li>If determining whether two PROV instances or datasets are
<a>equivalent</a>, an application MUST determine whether their
normal forms are equal, as specified in <a href="#normalization-validity-equivalence" class="sectionRef"></a>.
</ol>
@@ -2008,7 +2038,7 @@
<span
class="name">e1</span> may be used multiple times by
<span class="name">a</span>, usage <span class="name">use</span>
-may not be involved in the derivation (for instance, it may well have taken place after
+may not be involved in the derivation (for example, it may well have taken place after
the generation of <span
expanclass="name">e2</span>).</p>
</div>
@@ -3819,12 +3849,13 @@
<p>We define the notions of <a title="normal form">normalization</a>, <a
title="valid">validity</a> and
-<a title="equivalence">equivalence</a> of PROV instances. We first define these concepts
-for PROV instances that consist of a single, unnamed <a>bundle</a> of
-statements, called the <dfn>toplevel bundle</dfn>.</p>
-
+<a title="equivalence">equivalence</a> of PROV datasets and instances. We first define these concepts
+for PROV instances and then extend them to PROV datasets.</p>
+
+<section id="instance">
+ <h2>Instances</h2>
+
<div class="remark">
-
Implementations should decide up front what reasoning about
co-reference should be applied, and rewrite the instance (by
replacing co-referent identifiers with a single common identifier) to
@@ -3950,27 +3981,29 @@
original order of statements in a file and avoid expanding
inferences.) </p>
+</section>
<section id="bundle-constraints">
-<h2>Bundles</h2>
+<h2>Bundles and Datasets</h2>
<p>The definitions, inferences, and constraints, and
the resulting notions of normalization, validity and equivalence,
-assume a PROV instance that consists of exactly one <a>bundle</a>, the <a>toplevel
-bundle</a>, containing all PROV statements in the top level of the
-bundle (that is, not enclosed in a named <a>bundle</a>). In this section, we describe how to deal with PROV
-instances consisting of multiple named bundles. Briefly, each bundle is
+assume a PROV dataset that consists only of a <a>toplevel
+instance</a>, containing all PROV statements in the top level of the
+dataset (that is, not enclosed in a named <a>bundle</a>). In this
+section, we describe how to deal with general PROV
+datasets, possibly including multiple named bundles. Briefly, each bundle is
handled independently; there is no interaction between bundles from
the perspective of applying definitions, inferences, or constraints,
computing normal forms, or checking validity or equivalence.</p>
-<p> We model a general PROV instance, containing <span class="name">n</span> named bundles
+<p> We model a general PROV dataset, containing <span class="name">n</span> named bundles
<span class="name">b<sub>1</sub>...b<sub>n</sub></span>, as a tuple
<span
class="name">(B<sub>0</sub>,[b<sub>1</sub>=B<sub>1</sub>,...,b<sub>n</sub>=B<sub>n</sub>])</span>
where <span class="name">B<sub>0</sub></span> is the set of
-statements of the <a>toplevel bundle</a>, and for each <span class="name">i</span>, <span class="name">B<sub>i</sub></span> is the set of
+statements of the <a>toplevel instance</a>, and for each <span class="name">i</span>, <span class="name">B<sub>i</sub></span> is the set of
statements of bundle <span class="name">b<sub>i</sub></span>. Names <span class="name">b<sub>1</sub>...b<sub>n</sub></span> are assumed to be distinct. This notation is shorthand for the
following PROV-N syntax:</p>
<pre>
@@ -3984,15 +4017,15 @@
endBundle
</pre>
-<p> The <a>normal form</a> of a general PROV instance
+<p> The <a>normal form</a> of a PROV dataset
<span class="name">(B<sub>0</sub>,[b<sub>1</sub>=B<sub>1</sub>,...,[b<sub>n</sub>=B<sub>n</sub>])</span> is <span class="name">(B'<sub>0</sub>,[b<sub>1</sub>=B'<sub>1</sub>,...,b<sub>n</sub>=B'<sub>n</sub>])</span>
where <span class="name">B'<sub>i</sub></span> is the normal
form of <span class="name">B<sub>i</sub></span> for each <span class="name">i</span> between 0 and <span class="name">n</span>. </p>
-<p>A general PROV instance is <a>valid</a> if each of the bundles <span class="name">B<sub>0</sub></span>,
+<p>A PROV dataset is <a>valid</a> if each of the bundles <span class="name">B<sub>0</sub></span>,
..., <span class="name">B<sub>n</sub></span> are valid and none of the bundle identifiers <span class="name">b<sub>i</sub></span> are repeated.</p>
-<p>Two (valid) general PROV instances <span class="name">(B<sub>0</sub>,[b<sub>1</sub>=B<sub>1</sub>,...,b<sub>n</sub>=B<sub>n</sub>])</span> and
+<p>Two (valid) PROV datasets <span class="name">(B<sub>0</sub>,[b<sub>1</sub>=B<sub>1</sub>,...,b<sub>n</sub>=B<sub>n</sub>])</span> and
<span class="name">(B'<sub>0</sub>,[b<sub>1</sub>'=B'<sub>1</sub>,...,b'<sub>m</sub>=B'<sub>m</sub>])</span> are <a>equivalent</a> if <span class="name">B<sub>0</sub></span> is
equivalent to <span class="name">B'<sub>0</sub></span> and <span class="name">n = m</span> and
there exists a permutation <span class="name">P : {1..n} -> {1..n}</span> such that for each <span class="name">i</span>, <span class="name">b<sub>i</sub> =