This specification introduces the notion of RDF spaces—places to store RDF triples—and defines a set of mechanisms expressing and manipulating information about them. Examples of RDF spaces include: an HTML page with embedded RDFa or microdata, a file containing RDF/XML or Turtle data, and a SQL database viewable as RDF using R2RML. RDF spaces are a generalization of SPARQL's named graphs, providing a standard model with formal semantics for systems which manage multiple collections of RDF data.

Editor's Draft Status

Closing in on FPWD IMHO, but definitely not there yet. The "@@@" flags mark the places where I'm pretty sure something is needed before FPWD.

This text might be re-factored into other the other RDF documents. The Use Cases and Example would probably end up in a WG Note.

Introduction

The Resource Description Framework (RDF) provides a simple declarative way to store and transmit information. It also provides a trivial but effective way to combine information from multiple sources, with graph merging. This allows information from different people, different organizations, different units within an organization, different servers, different algorithms, etc, to all be combined and used together, without any special processing or understanding of the relationships among the providers.

For some applications, the basic RDF merge operation is overly simplistic, as extra processing and an understanding of the relationships among the providers may be useful. This document specifies a way to conveniently handle information coming from multiple sources, by modeling each one as a separate space, and using RDF to express information about these spaces. In addition to this important concept, we provide a pair of languages—extensions to existing RDF syntaxes— which can be used to store or transmit in one document the contents of multiple spaces as well as information about them.

This approach allows for a variety of use cases (immediately below) to be addressed in a straightforward manner, as shown in .

Use Cases

Each of these use cases is initally described in terms of the following scenario. Details of how each use case is handled by the RDF spaces design are in .

The Example Foundation is a large organization with more than ten thousand employees and volunteers, spread out over five continents. It has branches in 25 different countries, and those divisions have considerable autonomy; they are only loosely controlled by the parent organization (called "headquarters" or "HQ") in Geneva.

HQ wants to help the divisions work together better, and decides a first step is to provide a simple but complete directory of all the Example personnel. Until now, each division has maintained its own directory, using its own technology. HQ would like to be able to find someone's phone number, mailing address, and job title, knowing only their name or email addresses. Later, they hope to extend the system to allow finding people based on their areas of interest and expertise.

HQ decides to use RDF with the the vcard-rdf vocabulary. They ask each division to put an up-to-date directory somewhere on the Web, and mail kelly@hq.example.org the URL. They say: "Just tell Kelly the username/password if there is one, or make it only available to the IP address of dir.hq.example.org." Kelly maintains a file which lists the URLs and any username/password combinations she is given.

For the first iteration of the design of their directory, HQ builds a "harvester" which uses Kelly's file for input and fetches the content from each of the provided URLs. It operates behind a caching Web proxy, so that if a division sets the right HTTP headers (eg Expires and Last-Modified) the load on its servers will be minimal, even if HQ runs the harvester every few minutes.

The harvester parses the RDF from each data source and loads it into an in-memory triplestore, merging each new graph. Once it's done with all the harvesting, the harvester writes out the merged graph into a turtle file. The file is published (with access control) where it can be used by several different clients providing directory search services.

Although HQ provides a Web-based client, they makes this raw merged data available. They know people will want many different kinds of clients, include mobile clients, SMS-based clients, command-line clients on different operating systems, and possibly even clients that do something more sophisticated than just looking up a phone numer. By making the raw data available, they empower the divisions to build all these other applications.

This "version 1" system is functional, but it has several shortcomings stemming from its use of simple graph merging. The following sections each discuss a shortcoming which can potentially be addressed by the proper modeling of RDF spaces. Some sections include more scenarios (not involving the Example Foundation's federated phonebook) which illustrate the use case. Each section also links to an appendix where a detailed solution is provided.

Minimizing Reloads

An obvious drawback of version 1 is that for any data change in a division database to show through to the users, the harvester must be re-run, to again fetch and merge all the data. HTTP caching can reduce the load on the division servers, but HQ still needs to parse 25 data feeds, and all the clients need to reload the merged data feed.

At first, HQ runs the harvester once a day and explains to users that it takes a day for changes to propagate. Users, however, are still confused and unhappy. A user corrects her phone number in the division database, then sees it still wrong in the HQ database. She's not interested in hearing about "propagation delay"; she wants her phone number to be correct.

Several different technologies are needed to fully provide this feature, but for a start, it would help if the harvester could maintain its state between runs and only replace those parts of the output that had changed. Just storing the merged set of triples is not enough; it needs to store them in such a way that it can replace just the ones coming from a given source.

For a discussion of how this use case could be addressed, see

Showing Provenance

@@@ the released db shows which dept supplies each part of the information

Maintaining Derived Data

@@@ namefill is needed, and its results need their own provenance

Distributed Harvesting

@@@ divisions gather from departments who might gather from individuals

Loading Untrusted Datasets

@@@ what if one of the divisions gives you bad quads? It better not mess up provenance. Maybe suggest GSP-style name mangling...?

Showing Revision History

@@@ we want to be able to see all the changes, for auditing, to see what the DB said about anyone at any point in time. (transaction time)

Expressing Past or Future States

@@@ we want to be able to express when someone started and stopped having a particular role various ways, which might not be the time we put this into the db.

Vendor-Neutral SPARQL Backup

@@@ we want to be able to dump the database and load it in a different system

Concepts

Space

The term "space" might change. The final terminology has not yet been selected by the Working Group. Other candidates include "g-box", "data space", "graph space", "(data) surface", "(data) layer", "sheet", and "(data) page".

An RDF space is anything that can reasonably be said to explicitly contain zero or more RDF triples and has an identity distinct from the triples it contains. Examples include:

Examples of things that are not spaces:

Quad and Quadset

We define an RDF quad as the 4-tuple (subject, predicate, object, space).

Informally, a quad should be understood as a statement that the RDF triple (subject, predicate, object) is in the space space.

We define an RDF quadset as a set containing (zero or more) RDF Quads and (zero or more) RDF Triples. A quadset is thus an extension to the concept of an RDF Graph (a set containing zero or more RDF triples) to also potentially include statements about triples being in particular spaces.

Dataset

A dataset is defined by SPARQL 1.1 as a structure consisting of:

  1. A distinguished RDF Graph called the default graph
  2. A set of (name, graph) pairs, where name is an IRI and the graph is an RDF Graph. No two pairs in a dataset may have the same name.

This definition forms the basis of the SPARQL Query semantics; each query is performed against the information in a specific dataset.

Although the term is sometimes used more loosely, a dataset is a pure mathematical structure, like an RDF Graph or a set of integers, with no identity apart from its contents. Two datasets with the same contents are in fact the same dataset, and one dataset cannot change over time.

The word "default" in the term "default graph" refers to the fact that in SPARQL, this is the graph a server uses to perform a query when the client does not specify which graph to use. The term is not related to the idea of a graph containing default (overridable) information. The role and purpose of the default graph in a dataset varies with application.

Named Graph

SPARQL formally defines a named graph following [Carroll], to be any of the (name, graph) pairs in a dataset.

In practice, the term is often used to refer to the graph part of those pairs. This is the usage we follow in this document, saying that a graph is a named graph in some dataset if and only if it appears as the graph part of a (name, graph) pair in that dataset. Note that "named graph" is a relation, not a class: we say that something is a named graph of a dataset, not simply that it is a named graph.

The term is also sometimes used to refer to the slot part of the (name, slot) pairs in a graph store. For example, the text of SPARQL 1.1 Update says, "This example copies triples from one named graph to another named graph". For clarity, we avoid calling these "named graphs" and instead call them "named slots" of the graph store.

Quadset/Dataset Relationship

A quad-equivalent dataset is a dataset with no empty named graphs. A non-quad-equivalent dataset is a dataset in which one or more of its named graphs is empty. Every non-quad-equivalent dataset has a corresponding quad-equivalent dataset formed by removing the (name, graph) pairs where the graph is empty.

Quadsets and quad-equivalent datasets are isomorphic, and given identical declarative semantics in . The isomorphism is:

The phrasing quads in a dataset is thus shorthand for: quads in some quadset which is isomorphic to a given dataset. If the dataset is a non-quad-equivalent dataset, then the isomorphism is to the dataset produced by removing all its empty named graphs.

In order to promote interoperability and flexibility in implementation techniques — to allow datasets and quadsets to be used interchangably — systems which handle datasets SHOULD NOT give significance to empty named graphs.

Graph Store

SPARQL 1.1 Update defines a mutable (time-dependent) structure corresponding to a dataset, called graph store. It is defined as:

  1. A distinguished slot for an RDF Graph
  2. A set of (name, slot) pairs, where the slot holds an RDF Graph and the name is an IRI. No two pairs in a graph store may have the same name.

A "slot" in this definition is an RDF space.

A dataset can be thought of as the state of a graph store, just like an RDF graph can be thought of as the state of a space.

Union and Merge

RDF graphs are usually combined in one of two ways:

This difference is not noticable when graphs are being expressed in an orginary RDF syntax, like RDF/XML, RDFa, or Turtle, because they provide no mechanism for transmitted two graphs which have a blank node in common. The difference can appear, however, in systems and languages which handle datasets or in APIs which allow blank nodes to be shared between graphs.

We define a union dataset to be a dataset in which its default graph is the union of all its named graphs. Some systems provide special, simplified handling of union datasets.

We define a merge dataset to be a dataset in which its default graph is the merge of all its named graphs.

Semantics

This section specifies a declarative semantics for quads, quadsets, and datasets, allowing them to be used to express knowledge, especially knowledge about spaces. This makes the languages defined in suitable for conveying knowledge about spaces and providing a foundation for addressing the challenges described in .

@@@ the section needs some revision by someone with a good ear for formal semantics, and probably some references to the old and/or new versions of RDF Semantics.

The fundamental notion of RDF spaces is that they can contain triples. This is formalized with the relation CT(S, T) which is informally understood to hold true for any triple T and space S such that S explicitely contains T.

The basic declarative meaning (that is, the truth condition) of RDF quads is this:

The RDF quad (s, p, o, sp) is true in I if and only if CT(I(sp), triple(s, p, o)).

The declarative meaning of a quadset is to simply read the quadset as a conjunction of its quads and its triples. Given the structural mapping between quadsets and datasets, the truth condition for datasets follows:

The RDF dataset (DG, (N1,G1),... (Ni,Gi), ...(Nn,Gn)) is true in I if and only if:

  1. DG is true in I, and
  2. For every (Ni,Gi) (1<=i<=n):
    • For every triple T in Gi:
      • CT(I(Ni),T)

Some implications of these truth conditions:

We say nothing here about the fact that the truth value of a quad is likely to change over time. Time is orthogonal to RDF semantics, and quads present no fundamentally different issue here. When the world changes state, the truth value of RDF triples or quads might change. This occurs when a triple is put in or taken out of a space, but it also occurs with "normal" RDF when, for instance, someone changes their address and different vcard triples about them become true. Some approaches to handling change-over-time are discussed in and .

Do the named graphs in a dataset include all the triples in the spaces with those names, or only some of them? Aka partial-graph or complete-graph semantics. Assuming partial, but maybe we can say something about how things SHOULD be done?

Dataset Languages

This section contains specifications of languages for serializing quad-equivalent datasets. N-Quads documents and Trig documents have identical semantics, since they each serialize the same structure and follow .

Dataset information may also be conveyed and manipulated using SPARQL or using RDF triple-based tools and languages as per .

N-Quads

The syntax of N-Quads is the same as the syntax of N-Triples, except that a fourth term, identifying an RDF space, may optionally be included on each line, after the "object" term.

Formally, the N-Quads grammar is the N-Triples Grammar modified by removing productions [1] and [2], and adding the following productions:

[1q]    nquadsDoc    ::=    statement? (EOL statement)* EOL?
[2q]    statement    ::=    subject predicate object space? "."
[3q]    space    ::=    IRIREF

The grammar symbols EOL, subject predicate object, and IRIREF are defined in the the N-Triples Grammar

The following example shows a quadset consisting of two triples and two quads. The quads both use the same triple, but express the fact that it is in two spaces, "space1" and "space2".

<http://example.org/subject> <http://example.org/predicate> <http://example.org/object1>.
<http://example.org/subject> <http://example.org/predicate> <http://example.org/object2>.
<http://example.org/subject> <http://example.org/predicate> <http://example.org/object1> <http://example.org/space1>.
<http://example.org/subject> <http://example.org/predicate> <http://example.org/object1> <http://example.org/space2>.

Trig

The syntax of Trig is the same as the syntax of Turtle except that (name, graph) pairs can be specified by giving an optional GRAPH keyword, a "name" term, and a nested Turtle graph expression in curly braces.

Formally, the Trig grammar is the Turtle Grammar modified by removing productions [1] and [2], and adding the following productions:

[1g]    trigDoc    ::=    statement*
[2g]    statement    ::=    directive "." | triples "." | naming | wrappedDefault
[3g]    naming    ::=    GRAPH? IRI_REF "{" triples "."? "}"
[4g]    wrappedDefault    ::=    "{" triples "."? "}"
[5g]    GRAPH    ::=    "GRAPH"

The grammar symbols directive, triples and IRI_REF are defined in the Turtle Grammar

Parsing a Trig document is like parsing a Turtle document except:

  1. The result is a dataset, not an RDF Graph
  2. The triples generated during parsing of the naming production go into a named graph, with the name being the IRI_REF.
  3. The triples generated during other parsing go into the default graph.

Note that the grammar forbids directives between curly braces and empty curly-brace expressions. Also, note that blank node processing is not affected by curly braces, so conceptually blank node identifiers are scoped to the entire document.

There is no requirement that named graph names be unique in a document, or that triples in the default graph be continguous. For example, these two Trig document parse to exactly the same Dataset:

# Trig Example 1
    @prefix : <http://example.org/>.
    :a :b 1.
    :s1 { :a :b 10 }
    :s2 { :a :b 20 }
    :s1 { :a :b 11 }
    :s2 { :a :b 21 }
    :a :b 2.
# Trig Example 2
    @prefix : <http://example.org/>.
    :a :b 1,2.
    :s1 { :a :b 10,11. }
    :s2 { :a :b 20,21. }

The same dataset could be expressed in N-Quads as:

# N-Quads for TriG Example 1 and 2
<http://example/org/a> <http://example/org/b> "1"^^<http://www.w3.org/2001/XMLSchema#integer>.
<http://example/org/a> <http://example/org/b> "2"^^<http://www.w3.org/2001/XMLSchema#integer>.
<http://example/org/a> <http://example/org/b> "10"^^<http://www.w3.org/2001/XMLSchema#integer> <http://example/org/s1>.
<http://example/org/a> <http://example/org/b> "11"^^<http://www.w3.org/2001/XMLSchema#integer> <http://example/org/s1>.
<http://example/org/a> <http://example/org/b> "20"^^<http://www.w3.org/2001/XMLSchema#integer> <http://example/org/s2>.
<http://example/org/a> <http://example/org/b> "21"^^<http://www.w3.org/2001/XMLSchema#integer> <http://example/org/s2>.

There are several open issues concernting Trig syntax:

  • Should we call this something other than Trig, since it's a bit different? Qurtle? Mugr (multi-graph-rdf)? Turtle2?
  • Are braces around default-graph triples required, optional, or disallowed? Assuming "optional" for now.
  • Is the name prefixed by a keyword? If so, is the keyword "@graph" or "GRAPH"? Assuming optional "GRAPH" for now.
  • Are blank node labels scoped to the document, the curly-brace expression, or the graph name? Assuming document-scope for now. This is Issue-21.
  • Can blank node labels be used as space names? Assuming not, for now.

Conformance

@@@ what to say here? What kind of think might conform or not conform to this spec?

Detailed Example

This section presents a design for using spaces in constructing a federated information system. It is intended to help explain and help motivate the designs specified in this document.

The example covers the same federated phonebook scenario used in , with each specific use case having an example here.

@@@ Most of this "Detailed Example" section is older and needs re-writing to be synchronized with changes made in the Use Cases.

See the May 10 Version for a now-obsolete run-through of the example.

How to Minimize Reloads

To address the needs described in ... @@@

Folding

This section is experimental.

This section specifies a mechanism and an RDF vocabulary for conveying quads/datasets using ordinary RDF Graphs instead of special syntaxes and/or interfaces. The mechanism is somewhat similar to reflection or reification. The idea is to express each quad using a set of triples using a specialized vocabulary.

Folding allows quads and thus datasets to be conveyed and manipulated using normal triple-based RDF machinery, including RDF/XML, Turtle, and RDFa, but at the cost of some complexity, storage space, and performance. In general, in systems where languages or APIs are available which directly support datasets, folding is neither required nor useful.

As an example, the dataset

@prefix : <http://example.org/>.
:space { eg:subject eg:predicate eg:object }
    
would fold to these triples:
@prefix : <http://example.org/>.
:space rdf:containsTriple [   
   a rdf:Triple;
   rdf:subjectIRI "http://example.org/subject";
   rdf:predicateIRI "http://example.org/predicate";
   rdf:objectIRI "http://example.org/object";

The terms in the triple are encoded (turned into literal strings, in this example), to provide referential opacity. In the semantics of quads, it does not follow from (a b c d) and a=aa that (aa b c d). Without this encoding of terms as strings, that conclusion would erroneously follow from the folded quad..

Terms in this vocabulary:

rdf:Triple
The class of RDF Triples, each of which is just a triple (3-tuple) of a three RDF terms, called its "subject", "predicate", and "object". Triples have no identity apart from their three components.
rdf:subjectIRI
A predicate expressing the relationship to the triple's subject term, when the subject term is an IRI. The value is the IRI (a string) which is the subject-term part of the triple.
rdf:subjectNode
A predicate expressing the relationship to the triple's subject term, when the subject term is a blank node. The value is any RDF Resource; it simply serves as a placeholder, representing the blank node which serves as the subject-term part of the triple.
rdf:predicateIRI
A predicate expressing the relationship to the triple's predicate term. The value is the IRI (a string) which serves as the predicate-term part of the triples.
rdf:objectIRI
A predicate expressing the relationship to the triple's object term, when the object term is an IRI. The value is the IRI (a string) which is the object-term part of the triple.
rdf:objectNode
A predicate expressing the relationship to the triple's object term, when the object term is a blank node. The value is any RDF Resource; it simply serves as a placeholder, representing the blank node which serves as the object-term part of the triple.
rdf:objectValue
A predicate expressing the relationship to the triple's object term, when the object term is literal. The value is the value which serves as the object-term part of the triple.
rdf:containsTriple
A predicate expressing the relationship between an RDF space and a triple which it contains.

This vocabulary is used in a specific template form, always matching this SPARQL graph pattern:

?sp rdf:containsTriple [ 
   a rdf:Triple;
   rdf:subjectIRI|rdf:subjectNode ?s;
   rdf:predicateIRI ?p;
   rdf:objectIRI|rdf:objectNode|rdf:objectValue ?o 
]

This one template uses SPARQL 1.1 property paths, with alternation using the "|" character. It could also be expressed as six different SPARQL 1.0 (non-property-path) graph patterns.

The terms in this vocabulary only have fully-defined meaning when they occur in the template pattern. When they do, the set of triples matching the template has the same meaning as the quad [ ?s ?p ?o ?sp ].

Folding a dataset is the act of completely conveying the facts in a dataset in RDF triples, using this vocabulary. The procedure is: (1) check for occurances of the fold template in the default graph -- if they occur, abort, since folding is not defined for this dataset; (2) copy the triples in the default graph of the input to the output; (3) for each quad in the input, generate a matching instance of the fold template and put the resulting five triples in the output.

Unfolding a dataset is the act of turning an RDF graph into a dataset, using this vocabulary. The procedure is: (1) make a mutable copy of the input graph, (2) for each match of the fold template, add the resulting quad to the output dataset and delete the five triples which matched the template, (3) copy the remaining triples to the output as the default graph of the dataset.

The fold and unfold functions are inverses of each other. That is, for all datasets D on which fold is defined, D = unfold(fold(D)) and for all graphs G, G = (fold(unfold(G)).

The functions cannot be composed with themselves (called recursively), since for each of them the domain and range are disjoint. If we were to implicitely convert graphs to datasets (with the graph as the default graph), then fold(fold(D)) would either be an error (if D had any named graphs) or be the same as fold(D). If we were to define unfold2 as an unfold operating on datasets using their default graphs, unfold2(D) = union(D, unfold(default_graph(D)), then unfold2 would be idempotent: unfold2(D) = unfold2(unfold2(D)).

@@@ tbd

Changes