> Dear PROV working group, > > > > I'm sorry to send my comments so late, but I hope you will take a close attention to what I'm saying here. > > > I've read PROV data model and PROV constraints and while I find PROV-DM satisfying, I have concerns about PROV Constraints. > > The document tries to define a kind of semantics for the PROV-DM, but it's not actually defining a proper semantics at all. And since it does not truly define a formal semantics, it has to redefine the terms that are borrowed from logic, and subtly diverge from the standard notions found in logic. For someone who knows about logic, it makes the document larger, more complicated than it should be, more difficult to read, and more difficult to implement. For someone who does not know about logic, it's not made at all simpler, on the contrary. > We would like to clarify that we are not attempting to define a semantics (in the sense of model theory or programming language semantics) for PROV in PROV-CONSTRAINTS. We may do this in a future version of PROV-SEM. What PROV-CONSTRAINTS does is define a subset of PROV documents, currently called "valid" by analogy with the notion of validity in other Web standards such as XML, CSS, and so on. We agree that it would be best to avoid redefining standard terminology from logic in nonstandard ways, and you are correct that "valid" means something different in logic than the sense in which it is usually used in Web standards. Possible actions: - change to consistent/consistency checking - keep as is, for compatibility with previous standards, but flag the terminological difference > To give an example before getting into the details, let us examine the notion of PROV-equivalence, in Section 2.4. Equivalence is a well known and understood notion in maths and logics, which is necessarily reflexive. In PROV, equivalence isn't reflexive! For X to be PROV-equivalent to Y, it is required that both X and Y are PROV-valid, which is the PROV term used to say "consistent". > This issue was discussed within the group already, and we could not come to an agreement on how equivalence should behave on invalid instances. Therefore, we decided not to define equivalence on invalid instances. From a mathematical point of view, we only define equivalence as a relation over valid documents/instances, not all instances. This avoids the problem of deciding what to do with equivalence for invalid instances. By analogy consider a typed programming language. An expression 1 + "foo" is not well-typed; technically one could consider a notion of equivalence on such expressions, so that for example, 1 + "foo" would be equivalent to 1 + 1 + "foo". But these ill-typed expressions are (by the definition of the language) not allowed. Similarly, for applications that care about validity, invalid PROV documents can be ignored, so (to us) there seems to be no negative consequence to defining equivalence to hold only on this subset of documents, or to defining all invalid documents to be equivalent (as would follow from the logical definition of equivalence). However, for other applications, such as information retrieval, it is not safe to assume that an invalid instance is equivalent to "false"; we can imagine scenarios where an application wants to search for documents similar to an existing (possibly invalid) document. If the definition of equivalence considers all invalid documents equivalent, then there will be a hige number of matches that have no (intuitive) similarity to the query document. We propose to resolve your concern as follows: - specify that equivalence is an equivalence relation on *all* documents - specify that no invalid document is equivalent to a valid one - specify equivalence between valid documents as already done - leave it up to implementations how (if at all) to test equivalence on different invalid documents. > If a proper semantics was defined, there would be no way the document diverge from legacy logical concepts. There would be less chances for mistakes in the definitions and in the rules laters. There would be more flexibililty in the way implementations support the semantics. > > I see two escapes for this: > 1) define an ad hoc model theory for PROV-DM, in the line of what's sketched in PROV-SEM; > 2) rely on a well known logic (such as FOL), to which all PROV statements are mapped, and add FOL axioms to model the constraints. > > 1) is ambitious and I understand there is not enough time to investigate it enough. But 2) is really easy. PROV statements already look a lot like FOL predicates, and PROV Constraints already rely heavily on FOL notions and notations. > This is intentional; the goal was to reuse as much of standard techniques from logic and particularly database theory as possible. However, since our audience (as reflected by the composition of the WG) is not expected to be familiar already with first-order logic. Moreover, writing an arbitrary FOL axiomatization has its own problems: since there is currently no standard way to do this we would have to restate a lot of the standard definitions in order to make the spceification self-contained (as we have already done). In addition, an arbitrary FOL theory is not guaranteed to be decidable, even over finite models. We resolved that the constraints document had to demonstrate decidability/computability, as a basic prerequisite for implementability. Simply giving a set of FOL axioms on its own would not be enough to do this, and would leave (the vast majority of) implementors not familiar with FOL theorem proving/databases/ constraint solving at sea with respect to implementation. Proposed solution: Add non-normative material to appendix / sec. 2 giving a FOL axiomatization, possibly proof of soundness/completeness with respect to the algorithm in the spec. > Definitions and inferences of 4.1 are not needed if the semantics is defined on the abstract syntax. The abstract syntax is not PROV-N and does not have "syntactic sugar" or "syntactic shortcuts". Meaning, relations in abstract syntax always contain all the arguments, optional or not, possibly with existential variables. The short forms only exist in the surface syntax. > This is not true; PROV-DM and PROV-N nowhere specify how missing arguments are to be expanded. You're correct that Definition 1 (which expands short forms) is in a sense implicit in PROV-DM, which only discusses the long forms and their optional arguments, but it isn't said explicitly in either PROV-DM or PROV-N how the PROV-N short forms are to be expanded to PROV-DM. Furthermore, Def. 2-4 deal with special cases concerning optional/implicit parameters which are definitely not explicit anywhere else. We recognize that there is a certain amount of PROV-N centrism in these definitions, but since PROV-N is formally specified and the abstract syntax is not, we feel it's important to make fully clear how arbitrary PROV-N can be translated to the subset of PROV-N that corresponds to the abstract syntax of PROV-DM. This is to ensure that there is no room for misinterpretation among multiple readers, who may expect different conventions for expansion/implicit parameters (even if the rules we specified seem "obvious"). > Inferences in 4.2, 4.3, 4.4 and 4.5 are all writable as FOL axioms. > In fact, they already are, modulo trivial syntactic distinctions, as explained in sec. 2. We plan to address this by giving the axiomatization explicitly as a non-normative appendix. > Constraints require equality and an infinite total order (Time with order "precedes"). This does not make the logic more complicated than FOL. > > Agreed; this is consistent with the specification as is. (however, in fact it is fine to limit attention to a finite partial order, because event ordering does not necessarily correspond to a fixed timeline ordering, to avoid e.g. time-zone centrism). > Detailed comments: > ----------------- > > Section 1.2: > """ > We define validity and equivalence in terms of a concept called normalization. > """ > > This is not required and suggests that implementations have to rely on normalisation. The choice of formalisation, being entirely procedural, imposes that normalisation is explicitely mentioned in almost all definitions, which in turn asks for repeatedly marking that implementations are not forced to produce normal forms. This obfuscates the document, which could be smaller and simpler. > Just saying that we *define* validity and equivalence in terms of a normalization procedure does not require that all implementations explicitly perform normalization. We discussed this issue extensively, and one consequence of this is that the implementation criteria for the constraints document will only test the extensional behavior of validity/equivalence checks; implementations only need to classify documents as valid/invalid/equivalent etc. in the same way as the reference implementation, they do not have to "be" the reference implementation. Nevertheless, you are correct that as written, it is difficult to see how else one could implement the specification. In fact, you are correct that there is a simple, declarative specification via a FOL theory of what the normalization algorithm does, which could be used as a starting point for people with a formal background or those who wish to implement the specification in some other way. However, we disagree that it would improve the specification to adopt the declarative view as normative. Making the document smaller and simpler in this way would detract from its usefulness to implementors that are not already experts in coputational logic. In other words, we recognize that some implementors may want to check the constraints in other ways, but we believe that the algorithm we used to specify validity and equivalence is a particular, good way by default, because it sits within a well-understood formalism known from database constraints and finite first-order model theory. The normal forms are essentially "universal instances" in the sense of Fagin et al. 2005, and the algoithm we outline is easily seen to be in polynomial time; in contrast, simply giving a FOL theory on its own gives no guarantee of efficiency or even decidability. Proposed solution: - present the declarative specification as a (non-normative) appendix, building on the discussion in sec. 2. Logicians ought to be able to understand the specification just by reading sec. 2 and the appendix. - Keep the procedural specification as is, but clarify that it is just one of many possible ways of implementing the declarative specification, and that other way that gives the same answers to validity / equivalence checks is allowed. > > """ > Definitions, inferences, and uniqueness constraints can be applied to normalize PROV instances, and event ordering, typing, and impossibility constraints can be checked on the normal form to determine validity. Equivalence of two PROV instances can be determined by comparing their normal forms. > """ > > This also strongly suggests normalising by "applying" constraints, as later defined, although it is said later that normalisation is not required in implementations. > We disagree that this strongly suggests normalization; we discussed the wording extensively within the group since several members also felt it was important to avoid "specifying the algorithm". Simply saying that something can be done in a certain way does not say that it MUST be done that way; we were very careful not to make the adoption of the algorithm a formal requirement. By analogy, one can define greatest common divisors procedurally by giving Euclid's algorithm or declaratively by a formula that specifies the greatest common divisor property. The latter is mathematically cleaner but gives little indication how one can efficiently compute the GCD. Similarly, specifications such as XQuery describe the semantics of expressions by giving an algorithm that serves as a reference implementation for the language, but this is not required or even encouraged as the only or best way to implement XQuery. To facilitate implementation, we chose to specify in terms of an algorithm that presents one (reasonably efficient, but in any case clearly defined) default way to determine validity or equivalence, without specifying that it is the only (or best) way. Moreover, it is true that normalization is a somewhat extraneous concept, but it is a totally natural one from the point of view of finite model theory; we are working within a class of constraints that admits universal solutions, and the normal form is a universal solution. However, this issue arose relatively late in the process and we did miss some places where the document gives a misleading impression that normalization is required to implement the spec. Proposed solution: Revise this and other paragraphs to make sure that normalization is an auxiliary notion used to specify validity and equivalence. This may be tied in with the non-normative declarative specification which implementors may use as a starting point for alternative implementations. > > """ > For PROV documents, validity and equivalence amount to checking the validity or pairwise equivalence of their respective documents. > """ > > should be: "... of their respective instances." > Thanks. Proposed solution: adopt change. > > """ > This specification defines how validity and equivalence are to be checked, if an application elects to support them at all. > """ > > It was said before that implementations can choose the way they check validity or equivalence. Again, this is a very strong suggestion that implementation should be made according to the presented procedures. > Thanks, we agree this is too strong, and it escaped our notice when revising to make it clearer that we are not specifying the algorithm. Proposed solution: Rewrite to "This specification defines validity and equivalence. If an application elects to support validity and equivalence checking, then it must define these concepts as specified here." > > """ > Applications [...] should treat equivalent instances or documents in the same way. > """ > > Does it mean that an instance in a bundle can be safely replaced by another equivalent instance inside the bundle? This is important because it may be understood that a bundle identifier identifies the exact set of PROV statements given in a PROV document, or that the bundle identifier simply identifies the logical content of the bundle, or even any content that logically "implies" the content of the bundle in the said document. > The definitions of equivalence and inference suggests that it is the later option. Note that this is something that may have to be understood when writing PROV documents as RDF datasets, given that RDF 1.1 is not specifying any particular relationship between a graph IRI and the content of a named graph. > Since validity and equivalence are optional, this is not a formal requirement, but a guideline; what it means for an application to treat equivalent instances/documents "in the same way" is application specific, and there are natural settings where it makes sense for an application (evenone that cares about validity) to have different behavior for equivalent documents. We give one example of formatting/pretty printing. You give some additional examples; digital signing is a third. Because we have no way of circumscribing what applications might do or what it means for an application to treat documents "in the same way", we just leave this as a guideline. Proposed solution: Make it clearer that this is a guideline, not a formal requirement; give additional examples such as the above. > > Section 2.4: > > """ > Merging > > Merging is an operation that takes two terms and compares them to see if they are equal, or can be made equal by substituting an existential variable with another term. > """ > > Normally, merging is an operation that takes two terms and produce a single term. I do not understand what this definition has to do with "merging". After, it is said: > > """ > Merging two terms t,t' results in either substitution S such that S(t) = S(t'), or failure indicating that there is no substitution that can be applied to both t and t' to make them equal. > """ > > Now, merging results in a substitution. Again, it is at odd with standard definitions of "merging". > The high-level description we give first is consistent with what comes later, but can be rewritten to be clearer. For terms, "merging" is exactly unification in the usual first-order logic / logic programming sense, as we state in a remark. For predicates that carry attribute lists, things are more complicated since key constraints require the attribute lists be unioned, not unified in the usual sense. Proposed solution: - Use "unification" for "merging" at the level of terms - Declaratively describe unification as producing "either failure or a substitution that makes both sides equal", as well as giving the (standard) algorithm - Retain "merging" for the nonstandard operation on predicates that unifies the term arguments and unions the lists of attributes. > > """ > Applying definitions, inferences, and constraints > """ > > This section is really not needed if all is defined in terms of FOL. This section very strongly suggests one way of implementing PROV-CONSTRAINTS, which is not good. The description, and what's presented later in the normative sections, almost imposes inference materialisation as the normal way to implement all the logical operations on PROV statements (even though the spec says later, in passing, that implementations that lead to the same results are OK). > > Using the notion of "failure" again strongly suggests that the spec is implemented with applications of operations indicated in this section and in Section 6, when in fact any consistency checker would be alright too. > As noted above, we disagree that rewriting everything in terms of pure first-order logic would lead to a satisfactory specification (as opposed to a satisfactory research paper, say). The goal of the non-normative section here is essentially to link the (implicit) declarative semantics of the first-order theory, which we described informally earlier, with the procedural way in which normalization handles this behavior. Proposed solution: As noted above, give the declarative specification as an annex; make it more explicit that the goal of the specification is to give a procedural implementation that is definitely correct. Alternatively, we could make the declarative specification itself the normative thing, and make the rest non-normative, with a comment that it is one correct way to implement. > > """ > Checking ordering, typing, and impossibility constraints > > The ordering, typing, and impossibility constraints are checked rather than applied. This means that they do not generate new formulas expressible in PROV, > """ > > This sentence again strongly suggests that constraints are "applied", in the sense defined previously, indicating that implementations should follow the operations mentioned previously, when in fact there is no reason to force any materialisation or "normalisastion" at all. > This is again supported by the following sentence: > > "Checking such constraints follows a saturation strategy similar to that for normalization" > > What follows is again dictating an implementation when everything is in fact just logical formulas that have to be consistency-checked. > See above; same response. > PROV then uses the word "valid" where logic would use "consistent". In logic, a formula is valid iff it is necessarily true (it's a tautology). The use of a different term does not help non-logician (they would not be disturbed by the term "consistent") but may be a problem for logicians who are used to other terms. > See above, where we've discussed the reasons for the use of the term "valid" in this context. > > """ > Equivalence and Isomorphism > > [...] This is similar to the notion of equivalence used in [RDF] > """ > > RDF does not defined its own notion of equivalence. Equivalence in RDF is simply imposed by its formal semantics, following the standard notion of equivalence in all logics. > OK. It's not clear what needs to change here, provided that our refinement of the notion o equivalence (to be an equivalence relation on all instances, and point out that it is the same as "theory equivalence" on valid instances) is acceptable. > > """ > [...] to be considered equivalent, the two instances must also be valid > """ > > This is very strange, as "equivalence" is defined as a non-reflexive relationship. Again, the definition of equivalence would simply follow from a proper formalisation in FOL or model theory. > > Moreover, again, the paragraph strongly suggests relying on "applying" normalisation to check equivalence. > > See comments on Section 6.1. > These restate issues already discussed above. Briefly, we disagree that the notion of equivalence should be aligned with that of standard model theory, but agree that it is better if we can make it an equivalence relation over all instances, even if we only care about its behavior on valid ones. > > """ > From Instances to Bundles and Documents > > [...] a toplevel instance consisting of the set of statements not appearing within a bundle > """ > > This /seems/ to indicate that a statement in the top-level instance cannot be repeated in a bundle, while the PROV data model does not have this constraint. > This is unclear. Proposed solution: rewrite to "a toplevel instance consisting of statements that appear outside of any bundle" - to make it clear that appearing in the toplevel instance does not rule out appearing inside a bundle too. > > """ > Analogously to blank nodes in [RDF], the scope of an existential variable in PROV is the instance level > """ > > This is true for RDF graphs, but RDF doesn't say anything about the scope of bnodes in complex structures formed out of multiple RDF graphs, such as RDF datasets. This is therefore a constraint that must be indicated and highlighted in PROV. > We believe that it is clear that any complex structure built up out of RDF graphs would, at least, have the behavior we describe, because the blank nodes are scoped at the level of each individual graph within it. To do otherwise would cause huge problems with compositionality, but it does appear that this hasn't been specified (nor seems likely to be in the near future) in RDF. To avoid confusion, we propose to revise this to avoid reference to RDF blank nodes, and just say "the scope of an existential variable in PROV is at the instance level". This is implicit in the notion of equivalence of instances (where bijective renaming of existential variables preserves equivalence) and documents (where the existential variables in different instances are considered independently). > > Section 3: > > This section uses vocabulary that is not yet defined normatively at this position in the document. > > """ > If determining whether two PROV instances or documents are equivalent, an application must determine whether their normal forms are equal > """ > > This is once more dictating a way of implementing, while relying on normal forms should be completely irrelevant, as indicated by the following sentence. > This sentence merely defines that equivalence means that the normal forms of the two documents/instances are equal (actually should be isomorphic). This in turn (declaratively) means that the two documents/instances are valid and logically equivalent with respect to the declarative theory. Any implementation that determines equivalence therefore has also determined that their normal forms are isomorphic, whether or not these normal forms are explicitly constructed. Proposed solution: Rewrite this to make the distinction between specification and implementation clearer here. > > Section 4.1 > > This section defines equivalence of syntactic constructs that are purely a PROV-N issue. There is no reason to put this in the logic of PROV structures. > This is discussed above; briefly, the syntactic desugaring rules need to be said somewhere, and do not seem to belong in PROV-DM or PROV-N. An alternative would be for PROV-DM to give a formal abstract syntax and for one of PROV-DM or PROV-N to specify the expansion/desugaring rules. This might have advantages but the group feels that this additional material detracts from the simplicity of the other documents, for a purpose that only becomes evident for constraint-checking. > > Section 5.1 > > Merging is defined in a strange way, with a procedure, which makes it difficult to grasp what notion it is supposed to convey. > Merging/unification is discussed above; briefly, we plan to add its declarative specification and clarify that this is basically standard unification. (There is already a remark saying this.) > > Section 5.2 > > It is never specified explicitly that "strictly-precedes" is irreflexive. Therefore, the algorithm given in Section 6.1 for checking validity, which relies on this assumption, provides strictly more information than the definition of validity. > Good point; since section 6.1 is also normative, what is there is correct but unclear. Proposed solution: Make irreflexivity of strictly-precedes a formal constraint. > > Section 6.1 > > """ > A PROV instance is valid if its normal form exists and satisfies all of the validity constraints > """ > > Here, it is not clear what "satisfies" means. Satisfaction is a standard term in logic, provided that a proper semantics is defined. Here, there is no proper semantics so the notion of satisfaction would have to be defined. Assuming that PROV statements are indeed mapped to FOL, and constraints expressed as FOL axioms, then the standard notion of "satisfaction" would not work here. For example, the empty document doesn't satisfy any of the constraints (an empty FOL theory only satisfies tautologies, a.k.a., valid formulas, in the usual logical sense of "valid"). > > There wouldn't be any problem and it would be very much shorter with a proper first order semantics. > We disagree that making the specification shorter and more declarative would be appropriate, but agree that this passage is unclear about what "satisfies" means. Proposed solution: Align terminology and make sure that we either define "satisfies" or use a clearer term for the constraint checking process. > > In this section, "equivalence" does not require validity, as opposed to Section 2.4. > Good point; this and other discussion needs to be revised anyway following the discussion above. > Equivalence is defined according to the notion of isomorphic normal form, which is never normatively defined (the only definition being in the non-normative Section 2.4). In any case, there is no reason to rely on isomorphims at all, if only the logic of PROV was properly defined. > It is also the case that equivalence (for valid instances) is the same as logical equivalence of theories. This is because the normal form is a universal instance and two theories are equivalent (modulo the PROV FOL axiomatization) iff they have isomorphic universal instances. For perhaps not very good reasons, such as conciseness and a perceived need to avoid making the constraints document too formalism-heavy, the spec does not include the formal details explicitly linking the procedural and declarative specifications. We plan to address this by including the declarative specification as a non-normative appendix. > Unfortunately, since equivalence is given in terms of normal forms, and normal forms are given purely in procedural terms, there can be no sensible justification for 2 instances to be non-equivalent. The justification would be "because the procedure gave us different results", which isn't satisfactory at all. This is again a drawback of not providing an appropriate logical account of PROV statements. > By definition, equivalence (for valid instances0 means *determining* whether the normal forms are equivalent. This does not require materializing the normal forms or computing them in a specific way, only producing the same answers as if this had been done. Any algorithm that determines that two instances are not equivalent, essentially has determined that their normal forms are different. This is just because any inequivalent (valid) instances have different universal instances, obtained by the chase procedure in Fagin et al. This is true for any FOL theory consisting of tuple-generating and functional dependencies, and has nothing to do with any specific algorithm. This would not change (for valid instances) if we took the definition of equivalence to be equivalence modulo the FOL axiomatization, because it is a theorem (Fagin et al. 2005) that two finite structures are equivalent modulo a weakly acyclic TGD+EGD theory precisely when their universal instances are isomorphic, i.e., precisely when their normal forms are isomorphic. Since the normal forms are essentially the "meaning" of the instance either way we define equivalence, we chose to make the definition in terms of procedural terms that we believe will be more immediately accessible to the expected audience of PROV-CONSTRAINTS. This partly reflects the process followed by the group of developing and agreeing to the constraints; we are assuming that the target audience for PROV-CONSTRAINTS is primarily implementors of (Semantic) Web tools who may not have a formal background, and the group consensus is that presenting the definitions of validity and equivalence in easily-grasped (and easily-implemented) procedural terms is preferable for making the constraints document accessible to this audience. However we agree it would be helpful to make it clear that (in)equivalence (for valid instances) corresponds to logical (in)equivalence, as a starting point for any alternative implementations of the declarative specification. The idea of making the declarative specification normative and providing the normalization algorithm as a non-normative suggested reference implementation is attractive, but ultimately we feel that doing so would be misleading to developers who may not be able to understand how to implement the FOL specification in an efficient way without reading the non-normative materials, making PROV-CONSTRAINTS inaccessible to most of its intended audience. > > Best regards, > -- > Antoine Zimmermann > ISCOD / LSTI - Institut Henri Fayol > École Nationale Supérieure des Mines de Saint-Étienne > 158 cours Fauriel > 42023 Saint-Étienne Cedex 2 > France > Tél:+33(0)4 77 42 66 03 > Fax:+33(0)4 77 42 66 66 > http://zimmer.aprilfoolsreview.com/ >