> Dear PROV working group,
 > 
 > 
 > 
 > I'm sorry to send my comments so late, but I hope you will take a close attention to what I'm saying here.
 > 
 > 
 > I've read PROV data model and PROV constraints and while I find PROV-DM satisfying, I have concerns about PROV Constraints.
 > 
 > The document tries to define a kind of semantics for the PROV-DM, but it's not actually defining a proper semantics at all. And since it does not truly define a formal semantics, it has to redefine the terms that are borrowed from logic, and subtly diverge from the standard notions found in logic. For someone who knows about logic, it makes the document larger, more complicated than it should be, more difficult to read, and more difficult to implement. For someone who does not know about logic, it's not made at all simpler, on the contrary.
 > 

We would like to clarify that we are not attempting to define a
semantics (in the sense of model theory or programming language
semantics) for PROV in PROV-CONSTRAINTS.  We may do this in a future
version of PROV-SEM.

What PROV-CONSTRAINTS does is define a subset of PROV documents,
currently called "valid" by analogy with the notion of validity in
other Web standards such as XML, CSS, and so on.

We agree that it would be best to avoid redefining standard
terminology from logic in nonstandard ways, and you are correct that
"valid" means something different in logic than the sense in which it
is usually used in Web standards.

Possible actions:
- change to consistent/consistency checking
- keep as is, for compatibility with previous standards, but flag the 
  terminological difference

 > To give an example before getting into the details, let us examine the notion of PROV-equivalence, in Section 2.4. Equivalence is a well known and understood notion in maths and logics, which is necessarily reflexive. In PROV, equivalence isn't reflexive! For X to be PROV-equivalent to Y, it is required that both X and Y are PROV-valid, which is the PROV term used to say "consistent".
 > 


This issue was discussed within the group already, and we could not
come to an agreement on how equivalence should behave on invalid
instances.  Therefore, we decided not to define equivalence on invalid
instances.

From a mathematical point of view, we only define equivalence as a
relation over valid documents/instances, not all instances.  This
avoids the problem of deciding what to do with equivalence for invalid
instances.

By analogy consider a typed programming language.  An expression 1 +
"foo" is not well-typed; technically one could consider a notion of
equivalence on such expressions, so that for example, 1 + "foo" would
be equivalent to 1 + 1 + "foo".  But these ill-typed expressions are
(by the definition of the language) not allowed. Similarly, for
applications that care about validity, invalid PROV documents can be
ignored, so (to us) there seems to be no negative consequence to
defining equivalence to hold only on this subset of documents, or to
defining all invalid documents to be equivalent (as would follow from
the logical definition of equivalence).

However, for other applications, such as information retrieval, it is
not safe to assume that an invalid instance is equivalent to "false";
we can imagine scenarios where an application wants to search for
documents similar to an existing (possibly invalid) document.  If the
definition of equivalence considers all invalid documents equivalent,
then there will be a hige number of matches that have no (intuitive)
similarity to the query document.

We propose to resolve your concern as follows:
- specify that equivalence is an equivalence relation on *all* documents
- specify that no invalid document is equivalent to a valid one
- specify equivalence between valid documents as already done
- leave it up to implementations how (if at all) to test equivalence
on different invalid documents.


 > If a proper semantics was defined, there would be no way the document diverge from legacy logical concepts. There would be less chances for mistakes in the definitions and in the rules laters. There would be more flexibililty in the way implementations support the semantics.
 > 
 > I see two escapes for this:
 > 1)  define an ad hoc model theory for PROV-DM, in the line of what's sketched in PROV-SEM;
 > 2)  rely on a well known logic (such as FOL), to which all PROV statements are mapped, and add FOL axioms to model the constraints.
 > 
 > 1) is ambitious and I understand there is not enough time to investigate it enough. But 2) is really easy. PROV statements already look a lot like FOL predicates, and PROV Constraints already rely heavily on FOL notions and notations.
 > 


This is intentional; the goal was to reuse as much of standard
techniques from logic and particularly database theory as possible.
However, since our audience (as reflected by the composition of the
WG) is not expected to be familiar already with first-order logic.

Moreover, writing an arbitrary FOL axiomatization has its own
problems: since there is currently no standard way to do this we would
have to restate a lot of the standard definitions in order to make the
spceification self-contained (as we have already done).  In addition,
an arbitrary FOL theory is not guaranteed to be decidable, even over
finite models.  We resolved that the constraints document had to
demonstrate decidability/computability, as a basic prerequisite for
implementability.  Simply giving a set of FOL axioms on its own would
not be enough to do this, and would leave (the vast majority of)
implementors not familiar with FOL theorem proving/databases/
constraint solving at sea with respect to implementation.

Proposed solution: Add non-normative material to appendix / sec. 2 giving a FOL axiomatization, possibly proof of soundness/completeness with respect to the algorithm in the spec.  

 > Definitions and inferences of 4.1 are not needed if the semantics is defined on the abstract syntax. The abstract syntax is not PROV-N and does not have "syntactic sugar" or "syntactic shortcuts". Meaning, relations in abstract syntax always contain all the arguments, optional or not, possibly with existential variables. The short forms only exist in the surface syntax.
 > 

This is not true; PROV-DM and PROV-N nowhere specify how missing
arguments are to be expanded. You're correct that Definition 1 (which
expands short forms) is in a sense implicit in PROV-DM, which only
discusses the long forms and their optional arguments, but it isn't
said explicitly in either PROV-DM or PROV-N how the PROV-N short forms
are to be expanded to PROV-DM.  Furthermore, Def. 2-4 deal with
special cases concerning optional/implicit parameters which are
definitely not explicit anywhere else.  We recognize that there is a
certain amount of PROV-N centrism in these definitions, but since
PROV-N is formally specified and the abstract syntax is not, we feel
it's important to make fully clear how arbitrary PROV-N can be
translated to the subset of PROV-N that corresponds to the abstract
syntax of PROV-DM.  This is to ensure that there is no room for
misinterpretation among multiple readers, who may expect different
conventions for expansion/implicit parameters (even if the rules we
specified seem "obvious").


 > Inferences in 4.2, 4.3, 4.4 and 4.5 are all writable as FOL axioms.
 > 


In fact, they already are, modulo trivial syntactic distinctions, as
explained in sec. 2.  We plan to address this by giving the
axiomatization explicitly as a non-normative appendix.

 > Constraints require equality and an infinite total order (Time with order "precedes"). This does not make the logic more complicated than FOL.
 > 
 > 

Agreed; this is consistent with the specification as is.  (however, in
fact it is fine to limit attention to a finite partial order, because
event ordering does not necessarily correspond to a fixed timeline
ordering, to avoid e.g. time-zone centrism).


 > Detailed comments:
 > -----------------
 > 
 > Section 1.2:
 > """
 > We define validity and equivalence in terms of a concept called normalization.
 > """
 > 
 > This is not required and suggests that implementations have to rely on normalisation. The choice of formalisation, being entirely procedural, imposes that normalisation is explicitely mentioned in almost all definitions, which in turn asks for repeatedly marking that implementations are not forced to produce normal forms. This obfuscates the document, which could be smaller and simpler.
 > 

Just saying that we *define* validity and equivalence in terms of a
normalization procedure does not require that all implementations
explicitly perform normalization.  We discussed this issue
extensively, and one consequence of this is that the implementation
criteria for the constraints document will only test the extensional
behavior of validity/equivalence checks; implementations only need to
classify documents as valid/invalid/equivalent etc. in the same way as
the reference implementation, they do not have to "be" the reference implementation.

Nevertheless, you are correct that as written, it is difficult to see
how else one could implement the specification.  In fact, you are
correct that there is a simple, declarative specification via a FOL
theory of what the normalization algorithm does, which could be used
as a starting point for people with a formal background or those who
wish to implement the specification in some other way.  However, we
disagree that it would improve the specification to adopt the
declarative view as normative.

Making the document smaller and simpler in this way would detract
from its usefulness to implementors that are not already experts in
coputational logic.  In other words, we recognize that some implementors may want to
check the constraints in other ways, but we believe that the
algorithm we used to specify validity and equivalence is a particular,
good way by default, because it sits within a well-understood formalism known
from database constraints and finite first-order model theory.  

The normal forms are essentially "universal instances" in the sense of
Fagin et al. 2005, and the algoithm we outline is easily seen to be in
polynomial time; in contrast, simply giving a FOL theory on its own
gives no guarantee of efficiency or even decidability.

Proposed solution:

- present the declarative specification as a (non-normative) appendix, building on the discussion in sec. 2. Logicians ought to be able to understand the specification just by reading sec. 2 and the appendix.
- Keep the procedural specification as is, but clarify that it is just one of many possible ways of implementing the declarative specification, and that other way that gives the same answers to validity / equivalence checks is allowed.

 > 
 > """
 > Definitions, inferences, and uniqueness constraints can be applied to normalize PROV instances, and event ordering, typing, and impossibility constraints can be checked on the normal form to determine validity. Equivalence of two PROV instances can be determined by comparing their normal forms.
 > """
 > 
 > This also strongly suggests normalising by "applying" constraints, as later defined, although it is said later that normalisation is not required in implementations.
 > 

We disagree that this strongly suggests normalization; we discussed
the wording extensively within the group since several members also
felt it was important to avoid "specifying the algorithm".  Simply
saying that something can be done in a certain way does not say that
it MUST be done that way; we were very careful not to make the
adoption of the algorithm a formal requirement.

By analogy, one can define greatest common divisors procedurally by
giving Euclid's algorithm or declaratively by a formula that specifies
the greatest common divisor property.  The latter is mathematically
cleaner but gives little indication how one can efficiently compute
the GCD.  Similarly, specifications such as XQuery describe the
semantics of expressions by giving an algorithm that serves as a
reference implementation for the language, but this is not required or
even encouraged as the only or best way to implement XQuery.  To
facilitate implementation, we chose to specify in terms of an
algorithm that presents one (reasonably efficient, but in any case
clearly defined) default way to determine validity or equivalence,
without specifying that it is the only (or best) way.

Moreover, it is true that normalization is a somewhat extraneous
concept, but it is a totally natural one from the point of view of
finite model theory; we are working within a class of constraints that
admits universal solutions, and the normal form is a universal
solution.

However, this issue arose relatively late in the process and we did
miss some places where the document gives a misleading impression that
normalization is required to implement the spec.

Proposed solution: Revise this and other paragraphs to make sure that
normalization is an auxiliary notion used to specify validity and
equivalence.  This may be tied in with the non-normative declarative
specification which implementors may use as a starting point for
alternative implementations.

 > 
 > """
 > For PROV documents, validity and equivalence amount to checking the validity or pairwise equivalence of their respective documents.
 > """
 > 
 > should be: "... of their respective instances."
 > 

Thanks.  Proposed solution: adopt change.


 > 
 > """
 > This specification defines how validity and equivalence are to be checked, if an application elects to support them at all.
 > """
 > 
 > It was said before that implementations can choose the way they check validity or equivalence. Again, this is a very strong suggestion that implementation should be made according to the presented procedures.
 > 

Thanks, we agree this is too strong, and it escaped our notice when
revising to make it clearer that we are not specifying the algorithm.

Proposed solution: Rewrite to "This specification defines validity and
equivalence.  If an application elects to support validity and
equivalence checking, then it must define these concepts as specified
here."


 > 
 > """
 > Applications [...] should treat equivalent instances or documents in the same way.
 > """
 > 
 > Does it mean that an instance in a bundle can be safely replaced by another equivalent instance inside the bundle?  This is important because it may be understood that a bundle identifier identifies the exact set of PROV statements given in a PROV document, or that the bundle identifier simply identifies the logical content of the bundle, or even any content that logically "implies" the content of the bundle in the said document.
 > The definitions of equivalence and inference suggests that it is the later option. Note that this is something that may have to be understood when writing PROV documents as RDF datasets, given that RDF 1.1 is not specifying any particular relationship between a graph IRI and the content of a named graph.
 > 


Since validity and equivalence are optional, this is not a formal
requirement, but a guideline; what it means for an application to
treat equivalent instances/documents "in the same way" is application
specific, and there are natural settings where it makes sense for an
application (evenone that cares about validity) to have different
behavior for equivalent documents.  We give one example of
formatting/pretty printing.  You give some additional examples;
digital signing is a third.  Because we have no way of circumscribing
what applications might do or what it means for an application to
treat documents "in the same way", we just leave this as a guideline.

Proposed solution: Make it clearer that this is a guideline, not a
formal requirement; give additional examples such as the above.


 > 
 > Section 2.4:
 > 
 > """
 > Merging
 > 
 > Merging is an operation that takes two terms and compares them to see if they are equal, or can be made equal by substituting an existential variable with another term.
 > """
 > 
 > Normally, merging is an operation that takes two terms and produce a single term. I do not understand what this definition has to do with "merging". After, it is said:
 > 
 > """
 > Merging two terms t,t' results in either substitution S such that S(t) = S(t'), or failure indicating that there is no substitution that can be applied to both t and t' to make them equal.
 > """
 > 
 > Now, merging results in a substitution. Again, it is at odd with standard definitions of "merging".
 > 


The high-level description we give first is consistent with what comes
later, but can be rewritten to be clearer.

For terms, "merging" is exactly unification in the usual first-order
logic / logic programming sense, as we state in a remark.  For
predicates that carry attribute lists, things are more complicated
since key constraints require the attribute lists be unioned, not
unified in the usual sense.

Proposed solution: 
- Use "unification" for "merging" at the level of terms
- Declaratively describe unification as producing "either failure or a
  substitution that makes both sides equal", as well as giving the 
  (standard) algorithm
- Retain "merging" for the nonstandard operation on predicates that
unifies the term arguments and unions the lists of attributes.


 > 
 > """
 > Applying definitions, inferences, and constraints
 > """
 > 
 > This section is really not needed if all is defined in terms of FOL. This section very strongly suggests one way of implementing PROV-CONSTRAINTS, which is not good. The description, and what's presented later in the normative sections, almost imposes inference materialisation as the normal way to implement all the logical operations on PROV statements (even though the spec says later, in passing, that implementations that lead to the same results are OK).
 > 
 > Using the notion of "failure" again strongly suggests that the spec is implemented with applications of operations indicated in this section and in Section 6, when in fact any consistency checker would be alright too.
 > 

As noted above, we disagree that rewriting everything in terms of pure
first-order logic would lead to a satisfactory specification (as
opposed to a satisfactory research paper, say).  The goal of the
non-normative section here is essentially to link the (implicit)
declarative semantics of the first-order theory, which we described
informally earlier, with the procedural way in which normalization 
handles this behavior.

Proposed solution: As noted above, give the declarative specification
as an annex; make it more explicit that the goal of the specification
is to give a procedural implementation that is definitely correct.

Alternatively, we could make the declarative specification itself the
normative thing, and make the rest non-normative, with a comment that
it is one correct way to implement.


 > 
 > """
 > Checking ordering, typing, and impossibility constraints
 > 
 > The ordering, typing, and impossibility constraints are checked rather than applied.  This means that they do not generate new formulas expressible in PROV,
 > """
 > 
 > This sentence again strongly suggests that constraints are "applied", in the sense defined previously, indicating that implementations should follow the operations mentioned previously, when in fact there is no reason to force any materialisation or "normalisastion" at all.
 > This is again supported by the following sentence:
 > 
 > "Checking such constraints follows a saturation strategy similar to that for normalization"
 > 
 > What follows is again dictating an implementation when everything is in fact just logical formulas that have to be consistency-checked.
 > 

See above; same response.

 > PROV then uses the word "valid" where logic would use "consistent". In logic, a formula is valid iff it is necessarily true (it's a tautology). The use of a different term does not help non-logician (they would not be disturbed by the term "consistent") but may be a problem for logicians who are used to other terms.
 > 

See above, where we've discussed the reasons for the use of the term
"valid" in this context.


 > 
 > """
 > Equivalence and Isomorphism
 > 
 > [...]  This is similar to the notion of equivalence used in [RDF]
 > """
 > 
 > RDF does not defined its own notion of equivalence. Equivalence in RDF is simply imposed by its formal semantics, following the standard notion of equivalence in all logics.
 > 

OK.  It's not clear what needs to change here, provided that our
refinement of the notion o equivalence (to be an equivalence relation
on all instances, and point out that it is the same as "theory
equivalence" on valid instances) is acceptable.


 > 
 > """
 > [...] to be considered equivalent, the two instances must also be valid
 > """
 > 
 > This is very strange, as "equivalence" is defined as a non-reflexive relationship. Again, the definition of equivalence would simply follow from a proper formalisation in FOL or model theory.
 > 
 > Moreover, again, the paragraph strongly suggests relying on "applying" normalisation to check equivalence.
 > 
 > See comments on Section 6.1.
 > 


These restate issues already discussed above.  Briefly, we disagree
that the notion of equivalence should be aligned with that of standard
model theory, but agree that it is better if we can make it an
equivalence relation over all instances, even if we only care about
its behavior on valid ones.


 > 
 > """
 > From Instances to Bundles and Documents
 > 
 > [...] a toplevel instance consisting of the set of statements not appearing within a bundle
 > """
 > 
 > This /seems/ to indicate that a statement in the top-level instance cannot be repeated in a bundle, while the PROV data model does not have this constraint.
 > 

This is unclear.

Proposed solution: rewrite to "a toplevel instance consisting of statements that appear outside of any bundle"

- to make it clear that appearing in the toplevel instance does not rule out appearing inside a bundle too.


 > 
 > """
 > Analogously to blank nodes in [RDF], the scope of an existential variable in PROV is the instance level
 > """
 > 
 > This is true for RDF graphs, but RDF doesn't say anything about the scope of bnodes in complex structures formed out of multiple RDF graphs, such as RDF datasets. This is therefore a constraint that must be indicated and highlighted in PROV.
 > 

We believe that it is clear that any complex structure built up out of
RDF graphs would, at least, have the behavior we describe, because
the blank nodes are scoped at the level of each individual graph
within it.  To do otherwise would cause huge problems with
compositionality, but it does appear that this hasn't been specified
(nor seems likely to be in the near future) in RDF.

To avoid confusion, we propose to revise this to avoid reference to
RDF blank nodes, and just say "the scope of an existential variable in
PROV is at the instance level".  This is implicit in the notion of
equivalence of instances (where bijective renaming of existential
variables preserves equivalence) and documents (where the existential
variables in different instances are considered independently).

 > 
 > Section 3:
 > 
 > This section uses vocabulary that is not yet defined normatively at this position in the document.
 > 
 > """
 > If determining whether two PROV instances or documents are equivalent, an application must determine whether their normal forms are equal
 > """
 > 
 > This is once more dictating a way of implementing, while relying on normal forms should be completely irrelevant, as indicated by the following sentence.
 > 

This sentence merely defines that equivalence means that the normal forms of the two documents/instances are equal (actually should be isomorphic).  This in turn (declaratively) means that the two documents/instances are valid and logically equivalent with respect to the declarative theory.  Any implementation that determines equivalence therefore has also determined that their normal forms are isomorphic, whether or not these normal forms are explicitly constructed.

Proposed solution: Rewrite this to make the distinction between specification and implementation clearer here.

 > 
 > Section 4.1
 > 
 > This section defines equivalence of syntactic constructs that are purely a PROV-N issue. There is no reason to put this in the logic of PROV structures.
 > 

This is discussed above; briefly, the syntactic desugaring rules need
to be said somewhere, and do not seem to belong in PROV-DM or PROV-N.  

An alternative would be for PROV-DM to give a formal abstract syntax
and for one of PROV-DM or PROV-N to specify the expansion/desugaring
rules.  This might have advantages but the group feels that this
additional material detracts from the simplicity of the other
documents, for a purpose that only becomes evident for
constraint-checking.

 > 
 > Section 5.1
 > 
 > Merging is defined in a strange way, with a procedure, which makes it difficult to grasp what notion it is supposed to convey.
 > 

Merging/unification is discussed above;  briefly, we plan to add its declarative specification and clarify that this is basically standard unification.  (There is already a remark saying this.)

 > 
 > Section 5.2
 > 
 > It is never specified explicitly that "strictly-precedes" is irreflexive. Therefore, the algorithm given in Section 6.1 for checking validity, which relies on this assumption, provides strictly more information than the definition of validity.
 > 

Good point; since section 6.1 is also normative, what is there is correct but unclear.  

Proposed solution: Make irreflexivity of strictly-precedes a formal
constraint.

 > 
 > Section 6.1
 > 
 > """
 > A PROV instance is valid if its normal form exists and satisfies all of the validity constraints
 > """
 > 
 > Here, it is not clear what "satisfies" means. Satisfaction is a standard term in logic, provided that a proper semantics is defined. Here, there is no proper semantics so the notion of satisfaction would have to be defined. Assuming that PROV statements are indeed mapped to FOL, and constraints expressed as FOL axioms, then the standard notion of "satisfaction" would not work here. For example, the empty document doesn't satisfy any of the constraints (an empty FOL theory only satisfies tautologies, a.k.a., valid formulas, in the usual logical sense of "valid").
 > 
 > There wouldn't be any problem and it would be very much shorter with a proper first order semantics.
 > 

We disagree that making the specification shorter and more declarative
would be appropriate, but agree that this passage is unclear about
what "satisfies" means.

Proposed solution: Align terminology and make sure that we either
define "satisfies" or use a clearer term for the constraint checking
process.

 > 
 > In this section, "equivalence" does not require validity, as opposed to Section 2.4.
 > 

Good point; this and other discussion needs to be revised anyway
following the discussion above.


 > Equivalence is defined according to the notion of isomorphic normal form, which is never normatively defined (the only definition being in the non-normative Section 2.4). In any case, there is no reason to rely on isomorphims at all, if only the logic of PROV was properly defined.
 > 

It is also the case that equivalence (for valid instances) is the same
as logical equivalence of theories.  This is because the normal form
is a universal instance and two theories are equivalent (modulo the
PROV FOL axiomatization) iff they have isomorphic universal instances.
For perhaps not very good reasons, such as conciseness and a perceived
need to avoid making the constraints document too formalism-heavy, the
spec does not include the formal details explicitly linking the
procedural and declarative specifications.  We plan to address this by
including the declarative specification as a non-normative appendix.

 > Unfortunately, since equivalence is given in terms of normal forms, and normal forms are given purely in procedural terms, there can be no sensible justification for 2 instances to be non-equivalent. The justification would be "because the procedure gave us different results", which isn't satisfactory at all. This is again a drawback of not providing an appropriate logical account of PROV statements.
 > 

By definition, equivalence (for valid instances0 means *determining*
whether the normal forms are equivalent.  This does not require
materializing the normal forms or computing them in a specific way,
only producing the same answers as if this had been done.  Any
algorithm that determines that two instances are not equivalent,
essentially has determined that their normal forms are different.
This is just because any inequivalent (valid) instances have different
universal instances, obtained by the chase procedure in Fagin et al.
This is true for any FOL theory consisting of tuple-generating and
functional dependencies, and has nothing to do with any specific
algorithm.

This would not change (for valid instances) if we took the definition
of equivalence to be equivalence modulo the FOL axiomatization,
because it is a theorem (Fagin et al. 2005) that two finite structures
are equivalent modulo a weakly acyclic TGD+EGD theory precisely when
their universal instances are isomorphic, i.e., precisely when their
normal forms are isomorphic.  Since the normal forms are essentially
the "meaning" of the instance either way we define equivalence, we
chose to make the definition in terms of procedural terms that we
believe will be more immediately accessible to the expected audience
of PROV-CONSTRAINTS.

This partly reflects the process followed by the group of developing
and agreeing to the constraints; we are assuming that the target
audience for PROV-CONSTRAINTS is primarily implementors of (Semantic)
Web tools who may not have a formal background, and the group
consensus is that presenting the definitions of validity and
equivalence in easily-grasped (and easily-implemented) procedural
terms is preferable for making the constraints document accessible to
this audience.


However we agree it would be helpful to make it clear that
(in)equivalence (for valid instances) corresponds to logical
(in)equivalence, as a starting point for any alternative
implementations of the declarative specification.  The idea of making
the declarative specification normative and providing the
normalization algorithm as a non-normative suggested reference
implementation is attractive, but ultimately we feel that doing so
would be misleading to developers who may not be able to understand
how to implement the FOL specification in an efficient way without
reading the non-normative materials, making PROV-CONSTRAINTS
inaccessible to most of its intended audience.

 > 
 > Best regards,
 > -- 
 > Antoine Zimmermann
 > ISCOD / LSTI - Institut Henri Fayol
 > �cole Nationale Sup�rieure des Mines de Saint-�tienne
 > 158 cours Fauriel
 > 42023 Saint-�tienne Cedex 2
 > France
 > T�l:+33(0)4 77 42 66 03
 > Fax:+33(0)4 77 42 66 66
 > http://zimmer.aprilfoolsreview.com/
 >