This document describes the use of existing mechanisms for accessing and querying provenance data about resources on the web.

Introduction

@@TODO Introductory text

Concepts

In defining the specification below, we make use of the following concepts.

Provenance information
refers to provenance represented in some fashion.
Provenenance-URI
A URI denoting some provenance information.
Target
an entity about which one wants to know the provenance.
Target-URI
a URI denoting a target, which allows the target to be found in some provenance information.
Provenance service
a service that provides a Provenance-URI or provenance information given a Target-URI.
Service-URI
the URI of a Provenance Service.
Resource
a web resource. A resource may be associated with multiple targets.

A key notion within these concepts is that a resource may not be the same as a target. Within provenance information, the provenance for a resource may be described with respect to a restricted view of that resource (e.g. the resource at a particular time). This restricted view is termed a target and a Target-URI allows one to locate that target within provenance information. Therefore, the Target-URI connects this restricted view of a resource with the resource itself.

Are we treading to much on the model territory here? How can we explain this only a target identifies an Entity in the provenance model?

Accessing provenance information

A general expectation is that web applications may access provenance information in the same way as any web resource, by dereferencing its URI. Typically, this will be by performing an HTTP GET operation. Thus, any provenance information may be associated with a URI, and may be accessed by dereferencing that URI using normal web mechanisms.

This specification thus RECOMMENDS that if a publisher wishes to make provenance information available, it is published as a normal web resource, and provision is made for the Provenance-URI to be discoverable using one or more of the mechanisms described later in .

This presumption of using web retrieval to access provenance information does not preclude use of other mechanisms. In particular, alternative mechanisms may be needed if there is no URI associated with some particular provenance information. A possible mechanism is suggested in .

Locating provenance information

On the presumption that provenance information is a resource that can be accessed using normal web retrieval, one needs to know a Provenance-URI to dereference. The Provenance-URI may be known in advance, in which case there is nothing more to specify. If a Provenance-URI is not known, then a mechanism to discover one must be based on some information that is available to the would-be accessor. We also wish to allow that provenance information could be provided by parties other than the provider of the original resource. Indeed, provenance information for a resource may be provided by several different parties, at different URIs, each with different concerns. It is quite possible that different parties may provide contradictory provenance information.

Once provenance information information is retrieved, one needs how to identify the view of that resource within that provenance information. This view is known as the target and is identified by a Target-URI.

We start by considering mechanisms for the resource provider to indicate a Provenance-URI along with a Target-URI. Because the resource provider controls the response when the resource is accessed, direct indication of these URIs is possible. Three mechanisms are described here:

These particular cases are selected as corresponding to primary current web protocol and data formats. Finally, in , we discuss the case of a resource in an unspecified format which has been provided by some means other than HTTP.

Resource accessed by HTTP

For a document accessible using HTTP, POWDER [[POWDER-DR]] describes a mechanism for associating metadata with a resource using an HTTP Link header field. The Link header field is included in the HTTP response to a GET or HEAD operation (other HTTP operations are not excluded, but are not considered here). Since the POWDER specification was published, the HTTP linking draft has been approved by the IETF as RFC 5988 [[LINK-REL]].

The same basic mechanism can be used for referencing provenance information, for which two new link relation types are registered according to the template in :

Link: provenance-URI; rel="provenance" Link: target-URI; rel="target"

When used in conjunction with an HTTP success response code (2xx), this HTTP header indicates that provenance-URI is the URI of some provenance for the requested resource and that resource's associated target is identified by the target-URI.

If no target link is provided then the target-URI is assumed to be the URI of the resources. It is RECOMMENDED that this only be done when the resource is static.

At this time, the meaning of these links returned with other HTTP response codes is not defined: future revisions of this specification may define interpretations for these.

An HTTP response MAY include multiple provenance link headers, indicating a number of different resources that are known to the responding server, each providing provenance about the accessed resource. Likewise, an HTTP response MAY inclue multiple target link headers, that indicate the resource may be identified within provenance information using all of these target-URIs.

The presence of a provenance link in an HTTP response does not preclude the possibility that other publishers may offer provenance information about the same resource. In such cases, discovery of the additional provenance information must use other means (e.g. see ).

Are the provenance resources indicated in this way to be considered authoritative? I.e. if the client trusts information returned by the server (e.g. is prepared to act on inferences based on the returned data), should it also trust the provenance data, or should trust in the linked provenance data be determined separately? If the linked data is to be trusted, then the data from multiple linked provenance resources MUST be consistent if it is to be meaningful. I favour an approach whereby trust in the provenance resources is established independently, which is similar to the situation for any other resource; e.g. based on the domain that serves it, or an associated digital signature.

Resource presented as HTML

Addresses ISSUE 46 with target link-relation.

For a document presented as HTML or XHTML, without regard for how it has been obtained, POWDER [[POWDER-DR]] describes a mechanism for associating metadata with a resource by adding a <Link> element to the HTML <head> section.

The same basic mechanism can be used for referencing provence information, for which two new link relation types are registered according to the template in :

  <html xmlns="http://www.w3.org/1999/xhtml">
     <head>
        <meta name="wdr.issuedby" content="http://authority.example.org/company.rdf#me"/>
        <link rel="provenance" href="provenance-URI">
        <link rel="target" href="target-URI">
        <title>Welcome to example.com</title>
     </head>
     <body>
        ...
     </body>
  </html>
            
The provenance-URI given by the provenance link element identifies the Provenance-URI for the document where the target-URI given by the target link element specifies the identifier of the presented document view, and which is used within the provenance information when referring to this document.

An HTML document header MAY include multiple provenance link elements, indicating a number of different resources that are known to the creator of the document, each providing provenance about the document.

Likewise, the header MAY include multiple target link elements indicating that the document can be identified in the provenance information with multiple target-URIs.

If no target link element is provided then the target-URI is assumed to be the URI of the document. It is RECOMMENDED that this only be done when the document is static.

Specifying Provenance Services

This is a new proposal. It needs to be checked as to whether it is useful. GK/PG to review nature of provenance-service-URI.

The document creator may specify that the provenance information about the document is provided by a provenance service. This is done through the use of a third link relation type following the same pattern as above:

  <html xmlns="http://www.w3.org/1999/xhtml">
     <head>
        <meta name="wdr.issuedby" content="http://authority.example.org/company.rdf#me"/>
        <link rel="provenance-service" href="provenance-service-URI">
        <link rel="target" href="target-URI">
        <title>Welcome to example.com</title>
     </head>
     <body>
        ...
     </body>
  </html>
              

The provenance-service link element identifies the service URI. Dereferencing this URI yields a service description that provides further information to enable a client to determine a Provenance-URI for a target; see for more details. There may be multiple provenance-service link elements, and these MAY appear in the same document as target and provenance link elements (though, in practice, there may be little point in providing both provenance and provenance-service links).

Is this next paragraph useful? It seems out of place here: I'm not sure what it is intended to clarify. #g

See in particular Appendix A. Notes on Using the Link Header with the HTML4 Format of RFC5988 for further notes about using link relation types in HTML.

An alternative option would be to use an HTML <meta> element to present provenance links. The <Link> is preferred as it reflects more closely the intended goal, and has been defined with somewhat consistent applicability across HTTP, HTML and potentially RDF data. A specification to use <meta> for this would miss this opportunity to build on the existing specification and registry.

The POWDER specification also adds: Documents MAY also include any of the attribution data from the POWDER document in meta tags. In particular, the issuedby field is likely to be useful to user agents deciding whether or not to fetch the full POWDER document. Any attribution data encoded in meta tags within an HTML document should be the same as that in the POWDER document. In case of discrepancy, the POWDER document should be taken as more authoritative. Is there a parallel we should add here for provenance? I'm not seeing any compelling case for this.

Resource presented as RDF

Addresses ISSUE 46 ???.

If a resource is presented as RDF (in any of its recognized syntaxes, including RDFa), it may contain references to its own provenance using additional RDF statements.

For this purpose a new RDF property, prov:hasProvenance, is defined as a relation between two resources, where the object of the property is a resource that provides provenance information about the subject resource. Multiple prov:hasProvenance assertions may be made about a subject resource.

Another new RDF property, prov:hasTarget, is defined to allow the RDF content to specify one or more target-URIs of the RDF document for the purpose of provenance information (similar to the use of the target link relation in HTML).

@@TODO: needs to be completed.

@@TODO: example

@@TODO: document namespace. Check naming style. Use provenance model namespace? Define as part of model?

Arbitrary data

We have so far decided not to try and define a common mechanism for arbitrary data, because it's not clear to us what the correct choice would be. Is this a reasonable position, or is there a real need for a generic solution for provenance discovery for arbitrary, non-web-accessible data objects?

If a resource is presented using a data format other than HTML or RDF, and no URI for the resource is known, provenance discovery becomes trickier to achieve. This specification does not define a specific mechanism for such arbitrary resources, but this section discusses some of the options that might be considered.

For formats which have provision for including metadata within the file (e.g. JPEG images, PDF documents, etc.), use the format-specific metadata to include a Target-URI and/or Provenance-URI.

Use a generic packaging format that can combine an arbitrary data file with a separate metadata file in a known format, such as RDF. At this time, it is not clear what format that should be, but some possible candidates are:

Provenance discovery services

Propose simple HTTP interface for discovery. cf ISSUE 53. This should be properly RESTful, per http://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven. Have I properly interpreted the principles indicated here?

An earlier proposal was a simple HTTP-based service, with URI query parameters used in the construction of a specific URI for provenance information, and did not depend on a service description resource. Early but limited feedback suggested that an approach that does not prescribe URI formats (in line with REST style) would be preferable. Do other reviewers agree?

This section describes a REST API [[REST-APIs]] for a provenance discovery and retrieval service, which can be implemented independently of the original resource delivery channels (e.g. by a third party service).

@@TODO: Review functions provided; review terminology use and property naming; review structure - move different format examples to a separate area?; review correspondence between JSON and RDF-based formats (use JRON?);

Using the provenance discovery API

This section describes general procedures for using the provenance discovery service API. Later sections describe the resources presented by the API, and their representations using RDF, Turtle and JSON. Normal HTTP content negotiation mechanisms may be used to retrieve representations formats convenient for the client application.

Do we need/want both of the following cases?

Retrieve Provenance-URIs for a resource

To use the provenance discovery service to retrieve a list of provenance-URIs for a resource, starting with the discovery service URI (service-URI) and the URI of the target resource (target-URI):

  1. Dereference service-URI to obtain a representation of the service description in one of the formats described below.
  2. Extract the provenance location template from the service description.
  3. Use the provenance location template with target-URI for template variable uri to form provenance-location-URI.
  4. Dereference provenance-location-URI to obtain a provenance locations resource in one of the formats described below.

Any or all of URIs in the returned provenance locations may be used to retrieve provenance information, per .

Retrieve Provenance information for a resource

To use the provenance discovery service to retrieve provenance information for a resource, starting with the discovery service URI (service-URI) and the URI of the resource (target-URI):

  1. Dereference service-URI to obtain a representation of the service description in one of the formats described below.
  2. Extract the provenance information template from the service description.
  3. Use the provenance information template with target-URI for template variable uri to form provenance-URI.
  4. Dereference provenance-URI to obtain provenance information as described by the provenance model document [[PROV-MODEL]] @@TODO: fix up name, reference.

Resources presented and representations used

Service description

Describes the provenance discovery and retrieval service and, in particular, provides URI templates [[URI-template]] for URIs to access Provenance-URIs and/or provenance information. Dereferencing the service URI returns a representation of this service description. The service description MAY contain additional metadata about the service beyond that described here: API clients are expected to ignore any metadata elements they do not understand.

JSON example of service description

This example uses JSON format [[RFC4627]], presented using MIME content type application/json.

                {
                  "provenance_service_uri": "http://example.info/provenance_service/",
                  "location_template":      "http://example.info/provenance_service/location/?uri={uri}",
                  "provenance_template":    "http://example.info/provenance_service/provenance/?uri={uri}",
                }
              

RDF Turtle example of service description

This example uses the RDF Turtle format [[TURTLE]], presented using MIME content type text/turtle.

                @prefix provds: <http://www.w3.org/2011/provenance_discovery/@@TBD@@#> .
                <http://example.info/provenance_service/> a provds:Service_description ;
                  provds:provenance_service_uri  <http://example.info/provenance_service/> ;
                  provds:location_template       "http://example.info/provenance_service/location/?uri={uri}" ;
                  provds:provenance_template     "http://example.info/provenance_service/provenance/?uri={uri}" ;
                  .
              

The provenance URI templates are encoded in RDF as plain string literals, not as resource URIs.

The provds:provenance_service_uri property is redundant given the service description node itself is specified. I've included it for discussion, as it allows the RDF/Turtle form to be very similar to the JSON form of the service description, which may or may not be an advantage. I am personally in favour of using JRON (http://decentralyze.com/2010/06/04/from-json-to-rdf-in-six-easy-steps-with-jron/ Hi sandro :)) for the JSON format, which would be a very small change: just use "__iri" for the "provenance_service_uri" property, and would allow all formats to be closely based on the RDF model, while affording developers the convenience of using simple JSON handling code where appropriate.

RDF/XML example of service description

This is essentially the same as the Turtle example above, but encoded in RDF/XML [[RDF-SYNTAX-GRAMMAR]].

                <rdf:RDF
                  xmlns:rdf    = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                  xmlns:rdfs   = "http://www.w3.org/2000/01/rdf-schema#"
                  xmlns:provds = "http://www.w3.org/2011/provenance_discovery/@@TBD@@#"
                >
                  <provds:Service_description rdf:about="http://example.info/provenance_service/">
                    <provds:provenance_service_uri  rdf:resource="http://example.info/provenance_service/" /> ;
                    <provds:location_template>http://example.info/provenance_service/location/?uri={uri}</provds:location_template> ;
                    <provds:provenance_template>http://example.info/provenance_service/provenance/?uri={uri}</provds:provenance_template> ;
                  </provds:Service_description>
                </rdf:RDF>
              

Provenance locations

A resource that enumerates one or more Provenance-URIs associated with a target resource.

The examples below are for a target resource URI http://example.info/qdata/, and using the service description example above, its URI would be http://example.info/provenance_service/location/?uri=http%3A%2F%2Fexample.info%2Fqdata%2F.

The template might use ?uri={+uri} rather than just ?uri={uri}, and thereby avoid %-escaping the : and / characters in the target resource URI, but this could cause difficulties for target URIs containing query parameters and/or fragment identifiers. In this case, the client application would need to ensure that any such characters were %-escaped before being passed into a URI-template expansion processor.

JSON example of provenance locations

This example uses JSON format [[RFC4627]], presented using MIME content type application/json.

                {
                  "target_uri": "http://example.info/qdata/",
                  "provenance": [
                    "http://source1.example.info/provenance/qdata/",
                    "http://source2.example.info/prov/qdata/",
                    "http://source3.example.com/prov?id=qdata"
                  ]
                }
              

RDF Turtle example of provenance locations

This example uses the RDF Turtle format [[TURTLE]], presented using MIME content type text/turtle.

                @prefix prov: <http://www.w3.org/2011/provenance/@@TBD@@#> .
                <http://example.info/qdata/> a prov:Entity ;
                  prov:hasProvenance  <http://source1.example.info/provenance/qdata/> ;
                  prov:hasProvenance  <http://source2.example.info/prov/qdata/> ;
                  prov:hasProvenance  <http://source3.example.com/prov?id=qdata>
                  .
              

NOTE: The namespace URI used here for the provenance properties is different from that used in the service description. I am anticipating that it will be defined as part of the provenance model. If it is not defined as part of the provenance model, then a property name should be allocated in the provenance discovery service namespace.

@@TODO: revise to conform with Provenance Model vocabulary

RDF/XML example of provenance locations

This is essentially the same as the Turtle example above, but encoded in RDF/XML [[RDF-SYNTAX-GRAMMAR]], and presented with MIME content type application/rdf+xml.

                <rdf:RDF
                  xmlns:rdf    = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                  xmlns:rdfs   = "http://www.w3.org/2000/01/rdf-schema#"
                  xmlns:prov   = "http://www.w3.org/2011/provenance/@@TBD@@#"
                >
                  <prov:Entity rdf:about="http://example.info/qdata/">
                    <prov:hasProvenance  rdf:resource="http://source1.example.info/provenance/qdata/" /> ;
                    <prov:hasProvenance  rdf:resource="http://source2.example.info/prov/qdata/" /> ;
                    <prov:hasProvenance  rdf:resource="http://source3.example.com/prov?id=qdata" /> ;
                  </prov:Entity>
                </rdf:RDF>
              

@@TODO: revise to conform with Provenance Model vocabulary

Provenance information

Provenance information about a resource or resources may be returned in any format. It is recommended that the format be one defined by the Provenance Model specification [[PROV-MODEL]].

Assuming a target resource URI http://example.info/qdata/, and using the service description example above, the provenance URI would be http://example.info/provenance_service/provenance/?uri=http%3A%2F%2Fexample.info%2Fqdata%2F.

Querying provenance information

This section proposes use of SPARQL queries to address requirements that are not covered by the simple retrieval and discovery services proposed above.

There are circumstances where simply identifying and retrieving provenance information as a web resource may not best fit the requirements of a particular application or service, e.g.:

For such circumstances, a provenance query service provides an alternative way to access provenance information and/or Provenance-URIs.

We assume that the requesting application has the URI of a provenance query service, and some information about the resource for which provenance information is required that can be used as the basis for a query. A query service is potentially a very general capability that can, in principle, subsume the provenance discovery service described in , but which may be more complex to deploy and use for simple provenance discovery cases..

The details of a provenance query service is an implementation choice, to be agreed between provider and users of the service, but for ease of interoperability between different providers and users we recommend use of SPARQL [[RDF-SPARQL-PROTOCOL]] [[RDF-SPARQL-QUERY]]. The query service URI would then be the URI of a SPARQL endpoint (or, to use the SPARQL specification language, a SPARQL protocol service). A query service can potentially be used in many different ways, limited only by the available information and capabilities of theSPARQL query language; the following subsections provide examples for what are considered to be some plausible common scenarios.

Find Provenance-URI given Target-URI of resource

If the requester has a Target-URI for the original resource, they might simply issue a simple SPARQL query for the URI(s) of any associated provenance information; e.g., if the original resource has URI http://example.org/resource,

              @prefix prov: <@@TBD>
              SELECT ?provenance_uri WHERE
              {
                <http://example.org/resource> prov:hasProvenance ?provenance_uri
              }
            

@@TODO: specific provenance namespace and property to be determined by the model specification?

Find Provenance-URI given identifying information about a resource

If the requester has identifying information that is not the URI of the original resource, then they will need to construct a more elaborate query to locate the target resource and obtain its Provenance-URI(s). The nature of identifying information that can be used in this way will depend upon the third party service used, further definition of which is out of scope for this specification. For example, a query for a document identified by a DOI, say 1234.5678, using the PRISM vocabulary [[PRISM]] recommended by FaBio [[FABIO]], might look like this:

              @prefix prov: <@@TBD>
              @prefix prism: <http://prismstandard.org/namespaces/basic/2.0/>
              SELECT ?provenance_uri WHERE
              {
                [ prism:doi "1234.5678" ] prov:hasProvenance ?provenance_uri
              }
            

@@TODO: specific provenance namespace and property to be determined by the model specification?

Obtain provenance information directly given Target-URI of a resource

This scenario retrieves provenance information directly given the URI of a resource, and may be useful where the provenance information has not been assigned a specific URI, or when the calling application is interested only in specific elements of provenance information.

If the original resource has URI http://example.org/resource, a SPARQL query for provenance information might look like this:

              @prefix prov: <@@TBD>
              CONSTRUCT
              {
                <http://example.org/resource> ?p ?v
              }
              WHERE
              {
                <http://example.org/resource> ?p ?v
              }
            
This query essentially extracts all available properties and values available from the query service used that are directly about the specified resource, and returns them as an RDFG graph. This may be fine if the service contains only provenance information about the indicated resource, or if the non-provenance information is also of interest. A more complex query using specific provenance vocabulary terms may be needed to selectively retrieve just provenance information when other kinds of information are also available.

@@TODO: specific provenance namespace and property to be determined by the model specification? The above query pattern assumes provenance information is included in direct properties about the target resource. When an RDF provenance vocabulary is formulated, this may well turn out to not be the case. A better example would probably be one that retrieves specific provenance information when the vocabulary terms have been defined.

Provenance service discovery

(How to discover provenance services. There is nothing particular about provenance on this respect, and this section will discuss some of the available options without adding any new normative specification.)

@@TODO

IANA considerations

This document requests registration of new link relations, per section-6.2.1 of RFC 5988. @@TODO At an appropriate time (??), the following templates should be submitted to link-relations@ietf.org:

Registration template for link relation: "provenance"

Relation Name:
provenance
Description:
the resource identified by target URI of the link provides provenance information about the resource identified by the context link
Reference:
@@this spec, @@provenance-model-spec
Notes:
...
Application Data:
...

Registration template for link relation: "target"

The name "target" is unfortunate, as it has specific meaning in the context of a link relation (I think). Reconsider?

Relation Name:
target
Description:
the resource identified by target URI of the link is one for which provenance information is provided. This may be used, for example, to extract relevant information from a referenced document that contains provenance information for several targets.
Reference:
@@this spec, @@provenance-model-spec
Notes:
...
Application Data:
...

Security considerations

Provenance is central to establishing trust in data. If provenance information is corrupted, it may lead agents (human or software) to draw inappropriate and possibly harmful conclusions. Therefore, care is needed to ensure that the integrity of provenance data is maintained.

When using HTTP to access provenance information, or to determine a provenance URI, secure HTTP (https) SHOULD be used.

When retrieving a provenance URI from a document, steps SHOULD be taken to ensure the document itself is an accurate copy of the original whose author is being trusted (e.g. signature checking, or verifying its checksum aainst an author-provided secure web service). against

@@TODO ... privacy, access control to provenance (from Edinburgh meeting). In particular, note that the fact that a resource is openly accessible does not mean that its provenance information should also be.

@@TODO ... more, probably

Acknowledgements

Many thanks to Robin Berjon for making our lives so much easier with his cool ReSpec tool.

Motivating scenario

I propose to remove this appendix on publication.

This scenario was selected by the provenance working group as a touchstone for evaluating any provenance access proposal. This appendix evaluates the foregoing proposals against the requirements implied by that scenario.

Gap analysis

There are clearly a number of capabilities needed for a provenance-aware application that are not covered by the mechanisms described above. But most of these amount to implementation details and decisions for a particular application, and as such are beyond the scope of this document to specify.

One feature not covered above that might be a candidate for specification is a common format for a data package that combines original content along with provenance-related metadata or data. At this stage, it is not clear what format that might take, but some possible candidates are discussed in . In any case, it seems to me that a specification that is specific for provenance to the exclusion of other metadata is unlikely to obtain traction, as provenance is just part of a wider landscape of information quality, trust, preservation and more.