This specification introduces the notion of RDF spaces—places to store RDF triples—and defines a set of mechanisms expressing and manipulating information about them. Examples of RDF spaces include: an HTML page with embedded RDFa or microdata, a file containing RDF/XML or Turtle data, and a SQL database viewable as RDF using R2RML. RDF spaces are a generalization of SPARQL's named graphs, providing a standard model with formal semantics for systems which manage multiple collections of RDF data.
Closing in on FPWD IMHO, but not there yet. The "@@@" flags mark the places where I'm pretty sure something is needed before FPWD.
This text might be re-factored into other the other RDF documents. The Use Cases and Example would probably end up in a WG Note.
The Resource Description Framework (RDF) provides a simple declarative way to store and transmit information. It also provides a trivial but effective way to combine information from multiple sources, with graph merging. This allows information from different people, different organizations, different units within an organization, different servers, different algorithms, etc, to all be combined and used together, without any special processing or understanding of the relationships among the providers.
For some applications, the basic RDF merge operation is overly simplistic, as extra processing and an understanding of the relationships among the providers may be useful. This document specifies a way to conveniently handle information coming from multiple sources, by modeling each one as a separate space, and using RDF to express information about these spaces. In addition to this important concept, we provide a pair of languages—extensions to existing RDF syntaxes— which can be used to store or transmit in one document the contents of multiple spaces as well as information about them.
This approach allows for a variety of use cases (immediately below) to be addressed in a straightforward manner, as shown in .
Each of these use cases is initally described in terms of the following scenario. Details of how each use case might be addressed using the technologies specified in this document are in .
The Example Foundation is a large organization with more than ten thousand employees and volunteers, spread out over five continents. It has branches in 25 different countries, and those divisions have considerable autonomy; they are only loosely controlled by the parent organization (called "headquarters" or "HQ") in Geneva.
HQ wants to help the divisions work together better. It decides a first step is to provide a simple but complete directory of all the Example personnel. Until now, each division has maintained its own directory, using its own technology. HQ wants to gather them all together, building a federated phonebook. They want to be able to find someone's phone number, mailing address, and job title, knowing only their name or email addresses. Later, they hope to extend the system to allow finding people based on their areas of interest and expertise.
HQ understands that people will want access to the phonebook in many different computing environments and with different languages, social norms, and application styles. Users are going to want at least one Web based user interface (UI), but they will also want mobile UIs for different platforms, desktop UIs for different platforms, and even to look up information via text messaging. HQ does not have the resources to build all of these, so they intend to provide direct access to the data so that the divisions can do it themselves as needed.
Each of the sections below, after the first, contains a new requirement, something additional that users in this scenario want the system to do. Each of these will motivate the features of the technologies specified in this rest of document.
As a starting point, HQ needs to gather data from each division and re-publish it, in one place, for use by the different UIs.
This is a general use case for RDF, with no specific need for using spaces or datasets. It simply involves divisions pubishing RDF data on the web (with some common vocabulary and with access control), then HQ merging it and putting it on their website (with access control).
For an example of how this baseline could be implemented, see
A user says: I'm looking at an incorrect phonebook entry. It has the name of the person I'm looking for, but it's missing most of the record. I can't even tell which division the person works for. I need to know who is responsible for this information, so I can get it corrected.
While this might be address by including a "report-errors-to" field in each phonebook entry, HQ is looking ahead to the day when other information is in the phonebook — like which projects the person has worked on — which might be come from a variety of others sources, possibly other divisions.
For a discussion of how this use case could be addressed, see
It turns out different divisions are using somewhat different vocabularies for publishing their data. HQ writes a program to translate, but they need the output of that program to be correctly attributed, in case it turns out to be wrong.
This use case motivates sharing of blank nodes between named graphs, as seen in the example.
For a discussion of how this use case could be addressed, see
It turns out some divisions do not have centralized phonebooks. Division 3 has twelve different departments, each with its own phonebook. Divsion 3 can do the harvesting from its departments, but it does not want to be in the loop for corrections; it wants those to go straight back to the relevant department.
For a discussion of how this use case could be addressed, see
A user reports: There's information here that says it's from our department, but it's not. Somehow your provenance information is wrong. We need to see the provenance of the provenance!
For a discussion of how this use case could be addressed, see
Division 14's legal department says: "We're doing an investigation and we need to be able to connect people's names and phone numbers as they used to be. Can you include archival data in the data feed, so we we can search the phonebook as it was on each day of September, last year?"
For a discussion of how this use case could be addressed, see
Division 5 says: "We're planning a major move in three months, to a neighboring city. Everybody's office and phone number will have to change. Can we start putting that information in the phonebook now, but mark it as not effective until 20 July? After the move, we'll also need to see the old (no-longer-in-effect) data for a while, until we get everything straightened out.
This use case, contrasted with the previous one, shows the difference between Transaction Time and Valid Time in bitemporal databases. After Division 5's move, the "old" phone numbers are not just the old state of the database; they reflect the old state of the world. It is possible that some time after the move, an error in the pre-move data might need to be corrected. This would require a new transaction time, even though the valid-time has already ended.
Use case sightings:
For a discussion of how this use case could be addressed, see
@@@ we want to be able to dump the database and load it in a different system
@@@ This doesn't seem to belong here. Maybe we have Federated Phonebook use cases, and *other* ones, too?
The term "space" might change. The final terminology has not yet been selected by the Working Group. Other candidates include "g-box", "data space", "graph space", "(data) surface", "(data) layer", "sheet", and "(data) page".
An RDF space is anything that can reasonably be said to explicitly contain zero or more RDF triples and has an identity distinct from the triples it contains. Examples include:
Examples of things that are not spaces:
We define an RDF quad as the 4-tuple (subject, predicate, object, space).
Informally, a quad should be understood as a statement that the RDF triple (subject, predicate, object) is in the space space.
We define an RDF quadset as a set containing (zero or more) RDF Quads and (zero or more) RDF Triples. A quadset is thus an extension to the concept of an RDF Graph (a set containing zero or more RDF triples) to also potentially include statements about triples being in particular spaces.
A dataset is defined by SPARQL 1.1 as a structure consisting of:
This definition forms the basis of the SPARQL Query semantics; each query is performed against the information in a specific dataset.
Although the term is sometimes used more loosely, a dataset is a pure mathematical structure, like an RDF Graph or a set of integers, with no identity apart from its contents. Two datasets with the same contents are in fact the same dataset, and one dataset cannot change over time.
The word "default" in the term "default graph" refers to the fact that in SPARQL, this is the graph a server uses to perform a query when the client does not specify which graph to use. The term is not related to the idea of a graph containing default (overridable) information. The role and purpose of the default graph in a dataset varies with application.
SPARQL formally defines a named graph following [Carroll], to be any of the (name, graph) pairs in a dataset.
In practice, the term is often used to refer to the graph part of those pairs. This is the usage we follow in this document, saying that a graph is a named graph in some dataset if and only if it appears as the graph part of a (name, graph) pair in that dataset. Note that "named graph" is a relation, not a class: we say that something is a named graph of a dataset, not simply that it is a named graph.
The term is also sometimes used to refer to the slot part of the (name, slot) pairs in a graph store. For example, the text of SPARQL 1.1 Update says, "This example copies triples from one named graph to another named graph". For clarity, we avoid calling these "named graphs" and instead call them "named slots" of the graph store.
A quad-equivalent dataset is a dataset with no empty named graphs. A non-quad-equivalent dataset is a dataset in which one or more of its named graphs is empty. Every non-quad-equivalent dataset has a corresponding quad-equivalent dataset formed by removing the (name, graph) pairs where the graph is empty.
Quadsets and quad-equivalent datasets are isomorphic, and given identical declarative semantics in . The isomorphism is:
The phrasing quads in a dataset is thus shorthand for: quads in some quadset which is isomorphic to a given dataset. If the dataset is a non-quad-equivalent dataset, then the isomorphism is to the dataset produced by removing all its empty named graphs.
In order to promote interoperability and flexibility in implementation techniques — to allow datasets and quadsets to be used interchangably — systems which handle datasets SHOULD NOT give significance to empty named graphs.
Can we take a stronger stand against non-quad-equivalent datasets? Maybe we can use the terms "proper" and "improper", or something like that. Improper datasets might also include ones which use the same name in more than one pair. Combining these, like removing empty named graphs, is how you convert an improper dataset to a proper one.
SPARQL 1.1 Update defines a mutable (time-dependent) structure corresponding to a dataset, called graph store. It is defined as:
A "slot" in this definition is an RDF space.
A dataset can be thought of as the state of a graph store, just like an RDF graph can be thought of as the state of a space.
RDF graphs are usually combined in one of two ways:
This difference is not noticable when graphs are being expressed in an orginary RDF syntax, like RDF/XML, RDFa, or Turtle, because they provide no mechanism for transmitted two graphs which have a blank node in common. The difference can appear, however, in systems and languages which handle datasets or in APIs which allow blank nodes to be shared between graphs.
We define a union dataset to be a dataset in which its default graph is the union of all its named graphs. Some systems provide special, simplified handling of union datasets.
We define a merge dataset to be a dataset in which its default graph is the merge of all its named graphs.
We define the union and merge of quadsets (and thus datasets) as the set merge of their constituent triples and quads; in the case of a merge, it is after any shared blank nodes have been renamed apart.
The act of renaming the graphs in a dataset is to create another dataset which differs from the first only in that all the IRIs used as graph names are replace by fresh "Skolem" IRIs. This replacement occurs in the name slot of the (name,graph) pairs, and in the triples in the default graph, but not in the triples in the named graphs.
Logically, this operation is equivalent to partially un-labeling an RDF Graph (turning some IRIs into blank nodes), then Skolemizing those blank nodes. As an operation, it discards some of the information and adds more true information; it is a sound but not complete reasoning step. It can be made complete by recording the relationship between the old graph names and the new ones, using some vocabulary such as owl:sameAs.
For example, a recording graph_rename operation might take as input:
@prefix : <http://example.com/> :g1 { :a :b :c } :d :e :f
and produce:
@prefix : <http://example.com/> :fe2b9765-ba1d-4644-a335-80a8c3786c8d { :a :b :c } :d :e :f :fe2b9765-ba1d-4644-a335-80a8c3786c8d owl:sameAs :g1
Given the semantics of datasets, informally described above and formally stated in , and the semantics of OWL, where { ?a owl:sameAs ?b } means that the terms ?a and ?b both denote the same thing, the second dataset above entails the first and includes only additional information that is known to be true. (Slight caveat: the new information is only true if the assumptions of the name-generation function are correct, that the name is previously unused and this naming agent has the right to claim it.)
A relatated operation, sequestering the default
graph, is to create a new dataset which differs from the first
only in that the the triples in the default graph of the input
appear instead in a new, freshly-named, named graph of the
output. Sequestering returns both the new dataset and the name
generated for the new graph: sequester(D1) -> (D2,
generatedIRI)
.
Used together, the operations of renaming the graphs, sequestering the default graphs, and then merging datasets, constitutes an untrusting merge of datasets. This operation provides the functionality required for addressing the use case described in and is illustrated in . It uses quads to address some—perhaps all—of the need for quints or nested graphs.
More precisely:
function untrusted_merge(D1, ... Dn): for i in 1..n: RDi = rename_graphs(Di) (SRDi, DGNi) = sequester(RDi) return (merge(SRD1, ... SRDn), (DGN1, ... DGNn))
Here, untrusted_merge returns a single dataset and a list of the names of the graphs (in that dataset) which contain the triples that were in the default graphs, possibly augmented with recording triples. Whether recording is done or not is hidden inside the rename_graphs function, and is application-dependent.
This section specifies a declarative semantics for quads, quadsets, and datasets, allowing them to be used to express knowledge, especially knowledge about spaces. This makes the languages defined in suitable for conveying knowledge about spaces and providing a foundation for addressing the challenges described in .
@@@ the section needs some revision by someone with a good ear for formal semantics, and probably some references to the old and/or new versions of RDF Semantics.
The fundamental notion of RDF spaces is that they can contain triples. This is formalized with the relation CT(S, T) which is informally understood to hold true for any triple T and space S such that S explicitely contains T.
The basic declarative meaning (that is, the truth condition) of RDF quads is this:
The RDF quad (s, p, o, sp) is true in I if and only if CT(I(sp), triple(s, p, o)).
The declarative meaning of a quadset is to simply read the quadset as a conjunction of its quads and its triples. Given the structural mapping between quadsets and datasets, the truth condition for datasets follows:
The RDF dataset (DG, (N1,G1),... (Ni,Gi), ...(Nn,Gn)) is true in I if and only if:
Some implications of these truth conditions:
A dataset with no named graphs has the same declarative meaning as its default graph. A quadset with no quads has the same declarative meaning as the RDF graph consisting of the triples in the quadset.
This fits the intuition that datasets and quadsets are extensions of RDF Graphs and applies to the syntax as well: a Trig document without any named graphs is syntactically and semantically a Turtle document; an N-QUads document without any quads is syntactically and semantically an N-Triples document.
The empty named graphs in a non-quad-equivalent dataset have no effect on its meaning. Replacing such a dataset with its equivalent without the empty named graphs does not change its meaning.
We say nothing here about the fact that the truth value of a quad is likely to change over time. Time is orthogonal to RDF semantics, and quads present no fundamentally different issue here. When the world changes state, the truth value of RDF triples or quads might change. This occurs when a triple is put in or taken out of a space, but it also occurs with "normal" RDF when, for instance, someone changes their address and different vcard triples about them become true. Some approaches to handling change-over-time are discussed in and .
@@@ explain why we use partial-graph semantics, and how in most applications its bad to drop information, but sometimes it's necessary, and sometimes you only have incomplete information.
This section contains specifications of languages for serializing quad-equivalent datasets. N-Quads documents and Trig documents have identical semantics, since they each serialize the same structure and follow .
Dataset information may also be conveyed and manipulated using SPARQL or using RDF triple-based tools and languages as per .
The syntax of N-Quads is the same as the syntax of N-Triples, except that a fourth term, identifying an RDF space, may optionally be included on each line, after the "object" term.
Formally, the N-Quads grammar is the N-Triples Grammar modified by removing productions [1] and [2], and adding the following productions:
[1q] | nquadsDoc |
::= | statement? (EOL statement)* EOL? |
[2q] | statement |
::= | subject predicate object space? "." |
[3q] | space |
::= | IRIREF |
The grammar symbols
EOL
,
subject
predicate
object
, and
IRIREF
are defined in the the
N-Triples Grammar
The following example shows a quadset consisting of two triples and two quads. The quads both use the same triple, but express the fact that it is in two spaces, "space1" and "space2".
<http://example.org/subject> <http://example.org/predicate> <http://example.org/object1>. <http://example.org/subject> <http://example.org/predicate> <http://example.org/object2>. <http://example.org/subject> <http://example.org/predicate> <http://example.org/object1> <http://example.org/space1>. <http://example.org/subject> <http://example.org/predicate> <http://example.org/object1> <http://example.org/space2>.
The syntax of Trig is the same as the syntax of Turtle except that (name, graph) pairs can be specified by giving an optional GRAPH keyword, a "name" term, and a nested Turtle graph expression in curly braces.
Formally, the Trig grammar is the Turtle Grammar modified by removing productions [1] and [2], and adding the following productions:
[1g] | trigDoc |
::= | statement* |
[2g] | statement |
::= | directive "." | triples "." | naming | wrappedDefault |
[3g] | naming |
::= |
"GRAPH"?
spaceName
( ","
spaceName
)* "{"
triples
"."?
"}"
|
[4g] | spaceName |
::= |
iri
| "DEFAULT"
|
[5g] | wrappedDefault |
::= |
"{" triples "."? "}" |
The grammar symbols
directive
,
triples
, and
iri
are defined in the
Turtle Grammar
Parsing a Trig document is like parsing a Turtle document except:
naming
production go into each
named graph and/or the default graph as given in the
spaceName
productions.
Note that the grammar forbids directives between curly braces and empty curly-brace expressions. Also, note that blank node processing is not affected by curly braces, so conceptually blank node identifiers are scoped to the entire document.
There is no requirement that named graph names be unique in a document, or that triples in the default graph be continguous. For example, these two Trig document parse to exactly the same Dataset:
# Trig Example 1 @prefix : <http://example.org/>. :a :b 1. :s1 { :a :b 10 } :s2 { :a :b 20 } :s1 { :a :b 11 } :s2 { :a :b 21 } :a :b 2.
# Trig Example 2 @prefix : <http://example.org/>. :a :b 1,2. :s1 { :a :b 10,11. } :s2 { :a :b 20,21. }
The same dataset could be expressed in N-Quads as:
# N-Quads for TriG Example 1 and 2 <http://example/org/a> <http://example/org/b> "1"^^<http://www.w3.org/2001/XMLSchema#integer>. <http://example/org/a> <http://example/org/b> "2"^^<http://www.w3.org/2001/XMLSchema#integer>. <http://example/org/a> <http://example/org/b> "10"^^<http://www.w3.org/2001/XMLSchema#integer> <http://example/org/s1>. <http://example/org/a> <http://example/org/b> "11"^^<http://www.w3.org/2001/XMLSchema#integer> <http://example/org/s1>. <http://example/org/a> <http://example/org/b> "20"^^<http://www.w3.org/2001/XMLSchema#integer> <http://example/org/s2>. <http://example/org/a> <http://example/org/b> "21"^^<http://www.w3.org/2001/XMLSchema#integer> <http://example/org/s2>.
There are several open issues concernting Trig syntax:
@@@ what to say here? What kind of think might conform or not conform to this spec?
This section presents a design for using spaces in constructing a federated information system. It is intended to help explain and help motivate the designs specified in this document.
The example covers the same federated phonebook scenario used in , with each specific use case having an example here.
@@@ An obsolete but complete version was in the May 10 Version.
@@@ uses renaming the graphs.
To keep versions, as required by , we simply copy the old data into a new named graph and record some metadata about it.
In this example, we handle this by defining the following vocabulary:
If Marvin changes, rather absurdly, changes his email address every day, to include the date, we might have a dataset like this:
@prefix transt: <http://example.org/ns/transaction-time>. @prefix hq: <http://example.org/ns/phonebook>. @prefix v: <http://www.w3.org/2006/vcard/ns#>. @prefix : <>. :g32201 { #... various data, then: [] a v:VCard v:fn "Marvin Mover" ; v:email "marvin-0101@example.org". #... more data from other people } [] a transt:Snapshot; transt:source <http://div14.example.org/phonefeed>; transt:result :g32201; transt:starts "2012-01-01T00:00:00"^^xs:dateTime; transt:ends "2012-01-02T00:00:00"^^xs:dateTime. :g32202 { #... various data, then: [] a v:VCard v:fn "Marvin Mover" ; v:email "marvin-0102@example.org". #... more data from other people } [] a transt:Snapshot; transt:source <http://div14.example.org/phonefeed>; transt:result :g32202; transt:starts "2012-01-02T00:00:00"^^xs:dateTime; transt:ends "2012-01-03T00:00:00"^^xs:dateTime. # the current data <http://div14.example.org/phonefeed> { #... various data, then: [] a v:VCard v:fn "Marvin Mover" ; v:email "marvin-0103@example.org". #... more data from other people }
@@@ or should we put the data directly into a genid graph, so that metadata about it is less likely to change or be wrong...? On the other hand, there's ALSO some nice potential for metadata about the feed space.
The challenge expressed in is to segregate some of the triples, marking them as being in-effect only at certain times. The study of how to do this is part of the field of temporal databases.
In this example, we handle this by defining the following vocabulary:
This "valid-time" vocabulary allows a data publisher to express a time range during which the triples in some space are considered valid. This acts like a time-dependent version of owl:import, where the import is only made during the given time range.
In general, these two predicates need to be used together, providing both vt:starts and vt:ends values for a space. In this case, { ?sp vt:starts ?t1; vt:ends ?t2 } claims that all the triples in ?sp are in effect for all points in time t such that t1 <= t < t2. A consumer who only knows one of the two times is unable to make use of data; there are no default values.
These predicates say nothing about the validity (or "truth") of the triples in Sp outside of the valid-time range. Each of the triples might or might not hold outside of the range — these vt triples simply make no claim about them.
Given this definition, it is almost trivial for Division 5 to share their "before" and "after phonebooks:
@prefix vt: <http://example.org/ns/valid-time>. @prefix hq: <http://example.org/ns/phonebook>. @prefix : <>. :pre-move { # all the pre-move data ... } :post-move { # all the post-move data ... } :pre-move vt:starts "2010-01-01T00:00:00"^^xs:dateTime; vt:ends "2012-07-12T00:00:00"^^xs:dateTime. :post-move vt:starts "2012-07-12T00:00:00"^^xs:dateTime; vt:ends "2020-01-01T00:00:00"^^xs:dateTime.
This design requires every client to be modified to understand and use the valid-time vocabulary. There may be designs that do not require this.
This section is experimental.
This section specifies a mechanism and an RDF vocabulary for conveying quads/datasets using ordinary RDF Graphs instead of special syntaxes and/or interfaces. The mechanism is somewhat similar to reflection or reification. The idea is to express each quad using a set of triples using a specialized vocabulary.
Folding allows quads and thus datasets to be conveyed and manipulated using normal triple-based RDF machinery, including RDF/XML, Turtle, and RDFa, but at the cost of some complexity, storage space, and performance. In general, in systems where languages or APIs are available which directly support datasets, folding is neither required nor useful.
As an example, the dataset
@prefix : <http://example.org/>. :space { eg:subject eg:predicate eg:object }would fold to these triples:
@prefix : <http://example.org/>. :space rdf:containsTriple [ a rdf:Triple; rdf:subjectIRI "http://example.org/subject"; rdf:predicateIRI "http://example.org/predicate"; rdf:objectIRI "http://example.org/object";
The terms in the triple are encoded (turned into literal strings, in this example), to provide referential opacity. In the semantics of quads, it does not follow from (a b c d) and a=aa that (aa b c d). Without this encoding of terms as strings, that conclusion would erroneously follow from the folded quad..
Terms in this vocabulary:
This vocabulary is used in a specific template form, always matching this SPARQL graph pattern:
?sp rdf:containsTriple [ a rdf:Triple; rdf:subjectIRI|rdf:subjectNode ?s; rdf:predicateIRI ?p; rdf:objectIRI|rdf:objectNode|rdf:objectValue ?o ]
This one template uses SPARQL 1.1 property paths, with alternation using the "|" character. It could also be expressed as six different SPARQL 1.0 (non-property-path) graph patterns.
The terms in this vocabulary only have fully-defined meaning when they occur in the template pattern. When they do, the set of triples matching the template has the same meaning as the quad [ ?s ?p ?o ?sp ].
Folding a dataset is the act of completely conveying the facts in a dataset in RDF triples, using this vocabulary. The procedure is: (1) check for occurances of the fold template in the default graph -- if they occur, abort, since folding is not defined for this dataset; (2) copy the triples in the default graph of the input to the output; (3) for each quad in the input, generate a matching instance of the fold template and put the resulting five triples in the output.
Unfolding a dataset is the act of turning an RDF graph into a dataset, using this vocabulary. The procedure is: (1) make a mutable copy of the input graph, (2) for each match of the fold template, add the resulting quad to the output dataset and delete the five triples which matched the template, (3) copy the remaining triples to the output as the default graph of the dataset.
The fold and unfold functions are inverses of each other. That is, for all datasets D on which fold is defined, D = unfold(fold(D)) and for all graphs G, G = (fold(unfold(G)).
The functions cannot be composed with themselves (called recursively), since for each of them the domain and range are disjoint. If we were to implicitely convert graphs to datasets (with the graph as the default graph), then fold(fold(D)) would either be an error (if D had any named graphs) or be the same as fold(D). If we were to define unfold2 as an unfold operating on datasets using their default graphs, unfold2(D) = union(D, unfold(default_graph(D)), then unfold2 would be idempotent: unfold2(D) = unfold2(unfold2(D)).
@@@ tbd