RDF Named Graphs: Use Cases

This document explores some of the motivations for having Named Graphs and Datasets in RDF.

Editor's Draft Status

This text came from RDF Spaces and Datasets (15 May 2012), and now is being updated here.

Introduction

The Resource Description Framework (RDF) provides a simple declarative way to store and transmit information. It also provides a trivial but effective way to combine information from multiple sources, with graph merging. This allows information from different people, different organizations, different units within an organization, different servers, different algorithms, etc, to all be combined and used together, without any special processing or understanding of the relationships among the providers.

For some applications, the basic RDF merge operation is overly simplistic, as extra processing and an understanding of the relationships among the providers may be useful. This document enumerates some of these applicatsion.

Use Cases

Each of these use cases is initally described in terms of the following scenario. Details of how each use case might be addressed are in .

The Example Foundation is a large organization with more than ten thousand employees and volunteers, spread out over five continents. It has branches in 25 different countries, and those divisions have considerable autonomy; they are only loosely controlled by the parent organization (called "headquarters" or "HQ") in Geneva.

HQ wants to help the divisions work together better. It decides a first step is to provide a simple but complete directory of all the Example personnel. Until now, each division has maintained its own directory, using its own technology. HQ wants to gather them all together, building a federated phonebook. They want to be able to find someone's phone number, mailing address, and job title, knowing only their name or email addresses. Later, they hope to extend the system to allow finding people based on their areas of interest and expertise.

HQ understands that people will want access to the phonebook in many different computing environments and with different languages, social norms, and application styles. Users are going to want at least one Web based user interface (UI), but they will also want mobile UIs for different platforms, desktop UIs for different platforms, and even to look up information via text messaging. HQ does not have the resources to build all of these, so they intend to provide direct access to the data so that the divisions can do it themselves as needed.

The first section below describes a minimal, baseline approach. Each section after that describes a new system requirement, a new thing the users in this scenario want the federated phonebook to do.

Baseline Solution (Just Triples)

As a starting point, HQ needs to gather data from each division and re-publish it, in one place, for use by the different UIs.

This is a general use case for RDF, with no specific need for using named graphs. It simply involves divisions publishing RDF data on the web (with some common vocabulary and with access control), then HQ merging it and putting it on their website (with access control).

For an example of how this baseline could be implemented, see

Showing Provenance

A user says: I'm looking at an incorrect phonebook entry. It has the name of the person I'm looking for, but it's missing most of the record. I can't even tell which division the person works for. I need to know who is responsible for this information, so I can get it corrected.

While this might be address by including a "report-errors-to" field in each phonebook entry, HQ is looking ahead to the day when other information is in the phonebook — like which projects the person has worked on — which might be come from a variety of others sources, possibly other divisions.

For a discussion of how this use case could be addressed, see

Maintaining Derived Data

It turns out different divisions are using somewhat different vocabularies for publishing their data. HQ writes a program to translate, but they need the output of that program to be correctly attributed, in case it turns out to be wrong.

This use case motivates sharing of blank nodes between named graphs, as seen in the example.

For a discussion of how this use case could be addressed, see

Distributed Harvesting

It turns out some divisions do not have centralized phonebooks. Division 3 has twelve different departments, each with its own phonebook. Divsion 3 can do the harvesting from its departments, but it does not want to be in the loop for corrections; it wants those to go straight back to the relevant department.

For a discussion of how this use case could be addressed, see

Loading Untrusted Datasets

A user reports: There's information here that says it's from our department, but it's not. Somehow your provenance information is wrong. We need to see the provenance of the provenance!

For a discussion of how this use case could be addressed, see

Showing Revision History

Division 14's legal department says: "We're doing an investigation and we need to be able to connect people's names and phone numbers as they used to be. Can you include archival data in the data feed, so we we can search the phonebook as it was on each day of September, last year?"

For a discussion of how this use case could be addressed, see

Expressing Past or Future States

Division 5 says: "We're planning a major move in three months, to a neighboring city. Everybody's office and phone number will have to change. Can we start putting that information in the phonebook now, but mark it as not effective until 20 July? After the move, we'll also need to see the old (no-longer-in-effect) data for a while, until we get everything straightened out.

This use case, contrasted with the previous one, shows the difference between Transaction Time and Valid Time in bitemporal databases. After Division 5's move, the "old" phone numbers are not just the old state of the database; they reflect the old state of the world. It is possible that some time after the move, an error in the pre-move data might need to be corrected. This would require a new transaction time, even though the valid-time has already ended.

Use case sightings:

Temporal Scope for RDF Triples, Jeni Tennison's report of attempting to solve this problem in UK Government data.
Vocab terms for owner, validFrom and validUntil, Manu Sporny reports PaySwarm wants to record ownership information for particular time ranges.

For a discussion of how this use case could be addressed, see

Detailed Example

This section presents a design for using named graphs in constructing a federated information system. It is intended to help explain and help motivate the designs @@@.

The example covers the same federated phonebook scenario used in , with each specific use case having an example here.

@@@ An obsolete but complete version was in the May 10 Version.

Showing Triples (v1)

@@@ Shows the baseline in

Showing Web Provenance (v2)

@@@ Shows how to address

Showing Process Provenance(v3)

@@@ Shows how to address

Showing Reported Provenance (v4)

@@@ Shows how to address

Showing Untrusted Quads(v5)

@@@ Show how to address

@@@ uses renaming the graphs.

Showing Change History (v6)

To keep versions, as required by , we simply copy the old data into a new named graph and record some metadata about it.

In this example, we handle this by defining the following vocabulary:

@@@ tdb can we define each property separately with any sense, or just the block, together?

If Marvin changes, rather absurdly, changes his email address every day, to include the date, we might have a dataset like this:

@prefix transt: <http://example.org/ns/transaction-time>.
@prefix hq: <http://example.org/ns/phonebook>.
@prefix v:  <http://www.w3.org/2006/vcard/ns#>.
@prefix : <>.

:g32201 {  
   #... various data, then:
   [] a v:VCard
      v:fn "Marvin Mover" ;
      v:email "marvin-0101@example.org". 
   #... more data from other people
}
[] a transt:Snapshot;
   transt:source <http://div14.example.org/phonefeed>;
   transt:result :g32201;
   transt:starts "2012-01-01T00:00:00"^^xs:dateTime;
   transt:ends "2012-01-02T00:00:00"^^xs:dateTime.

:g32202 {  
   #... various data, then:
   [] a v:VCard
      v:fn "Marvin Mover" ;
      v:email "marvin-0102@example.org". 
   #... more data from other people
}
[] a transt:Snapshot;
   transt:source <http://div14.example.org/phonefeed>;
   transt:result :g32202;
   transt:starts "2012-01-02T00:00:00"^^xs:dateTime;
   transt:ends "2012-01-03T00:00:00"^^xs:dateTime.

# the current data
<http://div14.example.org/phonefeed> {
   #... various data, then:
   [] a v:VCard
      v:fn "Marvin Mover" ;
      v:email "marvin-0103@example.org". 
   #... more data from other people
}

@@@ or should we put the data directly into a genid graph, so that metadata about it is less likely to change or be wrong...? On the other hand, there's ALSO some nice potential for metadata about the feed space.

Showing Past and Future States (v7)

The challenge expressed in is to segregate some of the triples, marking them as being in-effect only at certain times. The study of how to do this is part of the field of temporal databases.

In this example, we handle this by defining the following vocabulary:

This "valid-time" vocabulary allows a data publisher to express a time range during which the triples in some space are considered valid. This acts like a time-dependent version of owl:import, where the import is only made during the given time range.

(rdf:space Sp) vt:starts (xs:dateTime T1): Claims that all the triples in Sp are valid starting at T1, ending at some unspecified period of time.
(rdf:space Sp) vt:end (xs:dateTime T2): Claims that all the triples in Sp are valid until just before T2, starting at some unspecified time.

In general, these two predicates need to be used together, providing both vt:starts and vt:ends values for a space. In this case, { ?sp vt:starts ?t1; vt:ends ?t2 } claims that all the triples in ?sp are in effect for all points in time t such that t1 <= t < t2. A consumer who only knows one of the two times is unable to make use of data; there are no default values.

These predicates say nothing about the validity (or "truth") of the triples in Sp outside of the valid-time range. Each of the triples might or might not hold outside of the range — these vt triples simply make no claim about them.

Given this definition, it is almost trivial for Division 5 to share their "before" and "after phonebooks:

@prefix vt: <http://example.org/ns/valid-time>.
@prefix hq: <http://example.org/ns/phonebook>.
@prefix : <>.

:pre-move {  
    # all the pre-move data  
    ...
}
:post-move {
    # all the post-move data  
    ...
}

:pre-move  vt:starts "2010-01-01T00:00:00"^^xs:dateTime;
           vt:ends   "2012-07-12T00:00:00"^^xs:dateTime.
:post-move vt:starts "2012-07-12T00:00:00"^^xs:dateTime;
           vt:ends   "2020-01-01T00:00:00"^^xs:dateTime.

This design requires every client to be modified to understand and use the valid-time vocabulary. There may be designs that do not require this.

@@@ tbd

Changes

2012-08-20: Started fresh Use Cases document, using some text from RDF Spaces and Datasets.