Initial draft of Dataset Semantics
authorAZ
Mon, 28 Jan 2013 09:48:16 +0100
changeset 590 93b224780b7f
parent 571 53c8e57a67c4
child 591 4aaffc81773f
Initial draft of Dataset Semantics
rdf-dataset/index.html
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/rdf-dataset/index.html	Mon Jan 28 09:48:16 2013 +0100
@@ -0,0 +1,337 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
+    <title>RDF Dataset Semantics</title>
+    <style type="text/css">
+        h5 {
+            margin-bottom: 0;
+            padding-bottom: 0.15em;
+            border-bottom: 1px solid #ccc;
+            background: transparent;
+            color: #005a9c;
+            font-weight: normal;
+            text-align: left;
+            font-family: sans-serif;
+            margin-left: 0;
+			margin-left: 1em
+        }
+    </style>
+    <script src='../ReSpec.js/js/respec.js' class='remove'></script>
+    <script class='remove'>
+      var respecConfig = {
+          // specification status (e.g. WD, LCWD, NOTE, etc.). If in doubt use ED.
+          specStatus:           "ED",
+          
+          // the specification's short name, as in http://www.w3.org/TR/short-name/
+          shortName:            "rdf-datasets",
+
+          // if your specification has a subtitle that goes below the main
+          // formal title, define it here
+          // subtitle   :  "an excellent document",
+
+          // if you wish the publication date to be other than today, set this
+          publishDate:  "2013-01-15",
+
+          // if the specification's copyright date is a range of years, specify
+          // the start date here:
+          //          copyrightStart: "2013",
+
+          // if there is a previously published draft, uncomment this and set its YYYY-MM-DD date
+          // and its maturity status
+//          previousPublishDate:  "2004-02-10",
+//          previousMaturity:  "REC",
+
+          // if there a publicly available Editor's Draft, this is the link
+//
+          edDraftURI:           "http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-datasets/index.html",
+
+          // if this is a LCWD, uncomment and set the end of its review period
+          // lcEnd: "2009-08-05",
+
+          // if there is an earler version of this specification at the Recommendation level,
+          // set this to the shortname of that version. This is optional and not usually
+          // necessary.
+          //          prevRecShortname: "rdf-concepts",
+
+          // if you want to have extra CSS, append them to this list
+          // it is recommended that the respec.css stylesheet be kept
+          extraCSS:             ["http://dvcs.w3.org/hg/rdf/raw-file/default/ReSpec.js/css/respec.css"],
+
+          // editors, add as many as you like
+          // only "name" is required
+          editors:  [
+              { name: "Antoine Zimmermann", url: "http://www.emse.fr/~zimmermann/",
+                company: "École Nationale Supérieure des Mines de Saint-Étienne", companyURL: "http://www.emse.fr/",
+              },
+          ],
+          //otherContributors: {
+          //    "Contributor": [
+          //{ name: "Sandro Hawke", url:"http://www.w3.org/People/Sandro",
+          //company:"W3C", companyURL: "http://www.w3.org", note:"Initial text"},
+          //]
+          //},
+
+          // authors, add as many as you like. 
+          // This is optional, uncomment if you have authors as well as editors.
+          // only "name" is required. Same format as editors.
+
+          //authors:  [
+          //    { name: "Your Name", url: "http://example.org/",
+          //      company: "Your Company", companyURL: "http://example.com/" },
+          //],
+          
+          // name of the WG
+          wg:           "RDF Working Group",
+          
+          // URI of the public WG page
+          wgURI:        "http://www.w3.org/2011/rdf-wg/",
+          
+          // name (with the @w3c.org) of the public mailing to which comments are due
+          wgPublicList: "public-rdf-comments",
+          
+          // URI of the patent status for this WG, for Rec-track documents
+          // !!!! IMPORTANT !!!!
+          // This is important for Rec-track documents, do not copy a patent URI from a random
+          // document unless you know what you're doing. If in doubt ask your friendly neighbourhood
+          // Team Contact.
+          wgPatentURI:  "http://www.w3.org/2004/01/pp-impl/46168/status",
+
+          // if this parameter is set to true, ReSpec.js will embed various RDFa attributes
+          // throughout the generated specification. The triples generated use vocabulary items
+          // from the dcterms, foaf, and bibo. The parameter defaults to false.
+          doRDFa: true,
+      };
+
+//  A number of references have been patched into the local berjon.biblio and need to be added to the global biblio in CVS:
+    </script>
+  </head>
+
+  <body>
+
+<section id="abstract">
+  <p>RDF defines the concept of RDF datasets, a structure composed of a distinguished RDF graph and zero or more named graphs, being pairs comprising an IRI and an RDF graph. While RDF graphs have a formal model-theoretic semantics that determines what arrangements of the world make an RDF graph true, no agreed formal semantics exists for RDF datasets. This document presents the issues to be addressed when defining a formal semantics for datasets, as they have been discussed in the RDF 1.1 Working Group, and specify several semantics in terms of model theory, each corresponding to a certain design choice for RDF datasets.</p>
+</section>
+
+<section id="sec-introduction">
+    <h2>Introduction</h2>
+
+    <p>The <a href="http://www.w3.org/TR/rdf11-concepts/">Resource Description Framework (RDF)</a> version 1.1 defines the concept of RDF datasets, a notion introduced first by the SPARQL specification [[RDF-SPARQL-QUERY]].  A dataset is defined as a collection of <a title="RDF graph">RDF graphs</a> where all but one are <a title="named graph">named graphs</a> associated with an <a>IRI</a>, and the unnamed default graph [[RDF-CONCEPTS]].  Given that RDF is a data model equiped with a formal semantics [[RDF-MT]], it is natural to try and define what the semantics of datasets should be.</p>
+
+    <p>The RDF 1.1 Working Group was initially chartered to provide such semantics in its recommendation:</p>
+    <blockquote cite="http://www.w3.org/2011/01/rdf-wg-charter">
+        <h5>Required features</h5>
+        <ul><li id="ng">Standardize a model and semantics for multiple graphs and graphs stores [...]</li></ul>
+    </blockquote>
+
+	<p>However, discussions within the Working Group revealed that very different assumptions were currently existing among practitioners, who are using RDF datasets with their own intuition of the meaning of the datasets.  Defining the semantics of RDF datasets requires an understanding of the two following issues:</p>
+	<ul>
+		<li>what the named graph IRIs denote;</li>
+		<li>how the triples in the named graph influence the meaning of the dataset.</li>
+	<ul>
+	
+	<p>Possible choices for the denotation of graph IRIs are:</p>
+	<ul>
+		<li>it denotes the RDF graph in the (name,graph) pair;</li>
+		<li>it denotes the pair itself;</li>
+		<li>it denotes a supergraph of the graph inside the pair;</li>
+		<li>it denotes a container for the RDF graph, that is, a mutable element;</li>
+		<li>it denotes the information resource that can be obtained by dereferencing the IRI (if such resource exists);</li>
+		<li>it denotes an arbitrary resource that is constrained to be in a special relationship with the graph inside the pair;</li>
+		<li>it denotes an unconstrained resource.</li>
+	</ul>
+	
+	<p>Possible choices for the meaning of the triples in the named graphs include:</p>
+	<ul>
+		<li>all the triples in the named graphs and default graphs contribute to the truth of the dataset in the same way triples contribute to the truth of a single graph;</li>
+		<li>the triples of the named graphs are considered part of the knowledge of the default graph;</li>
+		<li>different named graphs indicate different "contexts", or different "worlds", and the triples inside a named graph are assumed to be true in the associated context only; in this case, the default graph can be interpreted as yet another context, or be considered as a "global context" which must hold in all contexts;</li>
+		<li>the named graphs are considered as "hypothetical graphs" which bear the same consequences as their RDF graphs, but they do not participate in the truth of the dataset; this is similar to the "context" option above but it allows a graph to contain contradictions without making the dataset contradictory;</li>
+		<li>the triples are merely quoted without any indication of what they mean; they do not participate in the truth of a dataset;</li>
+	</ul>
+	
+	<p>Depending on the assumptions taken with respect to these two issues, the formalization of the semantics of RDF datasets can vary very much.</p>
+	<p>In this Working Group Note, we examine the propositions that were given by Working Group members in the course of a one-year-and-a-half debate.</p>
+</section>
+	
+<section id="sec-existing-work">
+	<h2>Existing Work</h2>
+
+	<p>We first take a look at existing specifications that could shed a light on how the semantics of datasets should be defined. There are three important documents that closely relate to the issue:</p>
+	<ul>
+		<li>the RDF semantics, as standardised in 2004 [[RDF-MT]];</li>
+		<li>the article <i>Named Graphs</i> by Carrol et al., which first introduced the term "named graph" and contains a section on formal semantics;</li>
+		<li>the SPARQL specification [[RDF-SPARQL-QUERY]], which defines RDF datasets and how to query them.</li>
+	</ul>
+	
+	<section id="rdf-semantics">
+		<h3>The RDF semantics</h3>
+		
+		<p>The RDF semantics defines the meaning of a set of RDF graphs: <q cite="http://www.w3.org/TR/rdf-mt/#entail">a set of graphs can be treated as equivalent to its merge, i.e. a single graph, as far as the model theory is concerned</q>.</p>
+		<p>So, a first intuition could be that an RDF dataset, being presented as a collection of graph, should mean exactly what the set of its named graphs and default graph means. However, this has both formal drawbacks and conceptual drawbacks.</p>
+		<p>Formally, the semantics of RDF defines a notion of interpretation for a set of triples (i.e., an RDF graph), which then extends easily to a set of RDF graphs. A dataset is neither a set of triples nor a set of RDF graphs. It is a set of <em>pairs</em> (name,graph) together with a distinguish RDF graph. Consequently, defining interpretation and entailement for RDF datasets would require at least an extension of the RDF semantics.</p>
+		<p>Conceptually, it is problematic since one of the reasons for separating triples into distinct (named) graphs is to avoid propagating the knowledge of one graph to the entire triple base. Sometimes, contradicting graphs need to coexist in a store. Sometimes named graphs are not endorsed by the system as a whole, they are merely quoted.</p>
+	</section>
+	
+	<section id="named-graph-paper">
+		<h3>The Named Graphs paper</h3>
+		
+		<p>In Carrol et al., a named graph is simply defined as a pair comprising an IRI and an RDF graph. The notion of RDF interpretation is extended to named graphs by saying that the graph IRI in the pair must denote the pair itself. This non-ambiguously answers the question of what the graph IRI denotes. Additionally, ...</p>
+	</section>
+	
+	<section id="sparql">
+		<h3>The SPARQL specification</h3>
+
+		<p>RDF 1.1 defines the notion of RDF dataset identically to SPARQL, which introduced it first. So, in order to understand the semantics of dataset, it is worth looking at how SPARQL uses datasets. SPARQL defines what are answers to queries posed against a dataset, but it never defines the notions that are key to a model theoretic formal semantics: it neither presents interpretations nor entailment. Still, it is worth noticing that a ASK query that only contains a basic graph pattern without variables yields the same result as asking whether the RDF graph in the query is entailed by the default graph. Based on this observation, one may extrapolate that a ASK query containing no variables and only GRAPH graph patterns would yield the same result as dataset entailment.</p>
+		<p>This can be used to define a formal semantics for datasets, as can be seen in Section ?.</p>
+	</section>
+	
+</section>
+    
+
+<section id="options">
+
+	<h2>Formal definitions</h2>
+
+	<p>This section presents the different options proposed, together with their formal definitions. We include each time a discussion of the merrits of the choice, and some properties.</p>
+	<p>Each subsection here describes the option informally, before presenting the formal definitions. As far as the formal part is concerned, one has to be familiar with the definitions given in RDF Semantics. We rely a lot on the notion of interpretation and entailment, which are key in model theory.</p>
+	<p>All proposed options share some commonalities:</p>
+	<ul>
+		<li>they behave identically on datasets that do not contain named graphs; precisely, entailment between datasets having no named graph is carried out in the same way as entailment between RDF graphs;</li>
+		<li>they define notions of interpretation and entailment in function of the corresponding notions in RDF Semantics.</li>
+	</ul>
+
+	<p>In fact, the dependency on RDF semantics is such that most of the dataset semantics below reuse RDF semantics as a black box.  The purpose of a formal semantics for datasets is to determine under what circumstances a dataset can be said to be true or false.  The formalisation below indicates that the truth of an RDF dataset can be determined in function of the truth of an RDF graph, no matter how the latter is determined.  Therefore, instead of defining a precise definition of RDF graph interpretations and entailment, we use the more abstract notion of <a>entailment regime</a>.  In fact, RDF Semantics does not define a single formal semantics, but multiple ones, depending on what standard vocabularies are endorsed by an application.  Consequently, we will parameterize most of the definitions below with an unspecified entailment regime E.  RDF 1.1 defines the following entailment regimes: simple entailment, LV entailment, RDFS-entailment, D-entailment.  Additionally, OWL defines two other entailment regimes, based on the OWL 2 direct semantics and the OWL 2 RDF-based semantics.</p>
+	<p>For an entailment regime E, we will say E-interpretation, E-entailment, E-equivalence, E-consistency to describe the notions of interpretations, entailment, equivalence and consistency associated with the regime E. Similarly, we will use the terms dataset-interpretation, dataset-entailment, dataset-equivalence, dataset-consistency for the corresponding notions in dataset semantics.</p>
+
+	<section>
+		<h3>Named graphs have no meaning</h3>
+		<p>The simplest semantics defines an interpretation of a dataset as an RDF interpretation of the default graph. The dataset is true, according to the interpretation, if and only if the default graph is true. In this case, any datasets that have equivalent default graphs are dataset-equivalent.</p>
+		<p>This means that the named graphs in a dataset are irrelevent to determining the truth of a dataset. Therefore, arbitrary modifications of the named graphs in a graph store always yield an equivalent dataset, according to this semantics.</p>
+		<h4 class="formal">Formalization</h4>
+		<p>Considering an entailment regime E, a dataset-interpretation of a vocabulary V with respect to E is an E-interpretation of V. Given an interpretation I of V and a dataset D = (G, NG), I(D) is true if and only if I(G).</p>
+
+		<h4 class="ex">Examples of entailement and non-entailments</h4>
+		<p>Consider the following dataset:</p>
+		<pre>{ :s :p :o . }
+:g1 { :a :b :c }</pre>
+		<p>does not entail:</p>
+		<pre>{ :s :p :o .
+:a :b :c .}</pre>
+		<p>but entails:</p>
+		<pre>{}  # empty default graph
+:g2 { :x :y :z }</pre>
+	</section>
+
+	<section>
+		<h3>Dafault graph as union or as merge</h3>
+		<p>It is sometimes assumed that named graphs are simply a convenient way of sorting the triples but all the triples participte in a united knowledge base that takes the place of the default graph.  More precisely, a dataset is considered to be true if all the triples in all the graphs, named or default, are true together.  This description allows two formalization of dataset semantics, depending on how blank nodes spanning several named graphs are treated.</p>
+
+		<h4 class="formal">Formalization: first version</h4>
+		<p>We define a dataset-interpretation of a vocabulary V with respect to an entailment regime E as an E-interpretation of V. Given a dataset-interpretation I and a dataset D = (G, NG), I(D) is true if and only if I(G) is true and for all ng in NG, I(ng) is true.</p>
+		<p>This is equivalent to I(D) is true if I(H) is true where H is the merge of all the RDF graphs, named or default, appearing in D.</p>
+		
+		<h4 class="formal">Formalization: second version</h4>
+		<p>We define a dataset-interpretation of a vocabulary V with respect to an entailment regime E as an E-interpretation of V. Given a dataset-interpretation I and a dataset D = (G, NG), I(D) is true if and only if I(H) is true where H is the union of all the RDF graphs, named or default, appearing in D.</p>
+		<p>An alternative presentation of this variant is the following: define I+A to be an extended interpretation which is like I except that it uses A to give the interpretation of blank nodes; define blank(D) to be the set of blank nodes in D. Then I(D) = true if [I+A](D) = true for some mapping A from blank(D) to the set of resources in I, otherwise I(D)= false.</p>
+
+		<h4 class="ex">Examples</h4>
+		
+	</section>
+
+	<section>
+		<h3>The graph IRI denotes the associated graph</h3>
+		<p></p>
+		<h4 class="formal">Formalization</h4>
+		<h4>Examples</h4>
+	</section>
+
+	<section>
+		<h3>Each named graph defines its own "context"</h3>
+		<p>Sometimes, the separation of triples into different named graphs is used to indicate truth in different contexts. Each graph describes a "world".</p>
+		<p>In substance, the formalization says that each RDF graph in a dataset is interpreted separately.  This models the fact that different RDF graphs may hold in different contexts.  This way, graphs that have been put in different "named graph pairs" can contradict with each other without making the dataset inconsistent.</p>
+	
+		<h4 class="formal">Formalization</h4>
+		<p>Like RDF interpretations, a dataset-interpretation is relative to a vocabulary V.  Moreover, dataset interpretations are defined with respect to an entailment regime E.  Let KE be the set of all E-interpretations.  The dataset-interpretation of a vocabulary V is a pair (IG,Con) where IG is an E-interpretation of V and Con is a mapping from V to KE.</p>
+		<p>The truth of a dataset for a dataset-interpretation I = (IG,Con) is defined as follows:</p>
+		<ul>
+			<li>for a named graph pair ng = (n,G), I(ng) is true if Con(n) is defined Con(n)(G) is true;</li>
+			<li>for a dataset D = (DG,G), I(D) is true if IG(G) is true and for all named graph ng in NG, I(ng) is true;
+			<li>I(D) is false otherwise.</li>
+		</ul>
+		<p>Following standard definitions, we say that a dataset D1 entails a dataset D2 if all dataset-interpretation I that makes D1 true also makes D2 true.</p>
+	</section>
+	
+	<section>
+		<h3>Each named graph is a hypothetical theory</h3>
+		<p></p>
+		<h4 class="formal">Formalization</h4>
+		<p>A dataset-interpretation of a vocabulary V is a pair (IG,IGEXT) where IG is an E-interpretation of V and IGEXT is a mapping from V to the set of RDF graphs.</p>
+		<p>The truth of a dataset for a dataset-interpretation I = (IG,IGEXT) is defined as follows:</p>
+		<ul>
+			<li>for a named graph pair ng = (n,G), I(ng) is true if IGEXT(n) is defined and IGEXT(n) E-entails G;</li>
+			<li>for a dataset D = (DG,G), I(D) is true if IG(G) is true and for all named graph ng in NG, I(ng) is true;
+			<li>I(D) is false otherwise.</li>
+		</ul>
+		<p></p>
+
+		<h4 class="ex">Examples</h4>
+	</section>
+
+
+	<section>
+		<h3>Named graphs as contexts, and the default graph is universal truth</h3>
+		<p>In this case, the named graphs are used to hold statements that are only true in certain circonstances, while the default graph holds in all cases. For instance, terminological knowledge may be considered universal (such as, hierarchy of classes or properties), while assertional knowledge (facts about instances) is changing from time to time, or from sources to sources.</p>
+
+		<h4 class="formal">Formalization</h4>
+		
+		<h4 class="ex">Examples</h4>
+	</section>
+	
+	<section>
+		<h3>Quad semantics</h3>
+		<p>This approach consists in considering named graph as sets of quadruples, having the subject, predicate and object of the triples as first three components, and the graph IRI as the fourth element.  Each quadruple is interpreted similarly to a triple in RDF, except that the relation that the predicate denotes is not indicating a binary relation but a ternary relation.</p>
+		<p>This semantics is extending the semantics of RDF rather than simply reusing it.</p>
+
+		<h4 class="formal">Formalization</h4>
+		<p>A quad-interpretration of a vocabulary V is a tuple (IR,IP,IEXT,IS,IL,LV) where IR, IP, IS, IL and LV are defined as in RDF and IEXT is a mapping from IP into the powerset of IR &times; IR union IR &times; IR &times; IR.</p>
+
+		<p>Since this option modifies the notion of simple-interpretation, instead of simply referring to it, which is the basis for all E-interpretations in any entailment regime E, it is not clear how it can be extended to arbitrary entailment regimes.</p>
+		
+		<h4 class="ex">Examples</h4>
+
+	</section>
+
+	<section>
+	</section>
+</section>
+
+<section id="declaring">
+	<h2>Declaring the intended semantics</h2>
+	
+	<p>In spite of the RDF Working Group's mission to define a semantics for a multiple graph data model, none semantics presented before could obtained consensus. Choosing one or another of the propositions before would have gone against deployed implementations. Therefore, the Working Group discussed the possibility to define several semantics, among  which an implementation could choose, and provide the means to declare which semantics is adopted.</p>
+	<p>This was not retained eventually, because of the lack of experience, and potentially the lack of utility, so there is no definite option for this. Nonetheless, for completeness, we describe here possible solutions.</p>
+
+	<h3>Using vocabularies</h3>
+	<p>A dataset can be described in RDF using vocabularies like voiD [[VOID]] and the SPARQL service description vocabulary. VoiD is used to describe how a collection of RDF triples is organized in a web site or across web sites, giving information about the size of the datasets, the location of the dump files, the IRI of the query endpoints, and so on. The notion of dataset in voiD is used as a more informal and broader concept than RDF dataset. However, an RDF dataset and the graphs in it can be describe as voiD datasets and the information can be completed with SPARQL service description</p>
+	<p></p>
+	
+	
+</section>
+
+<section id="references">
+<p>@@@ tbd</p>
+</section>
+
+<section class="appendix informative" id="changes">
+  <h2>Changes</h2>
+  <ul>
+    <li>2013-01-24:  A first version published as an editor's draft.</li>
+  </ul>
+</section>
+
+
+
+</body>
+</html>
+