This document is an experimental work in progress. The concepts described herein are intended to help provide guidance for a future working group. Implementations of this specification, either producers or consumers, should note that it is likely to change significantly prior to any publication as a Working Draft.

Introduction

This document describes a means of transforming HTML containing microdata into RDF. HTML Microdata [[!MICRODATA]] is an extension to HTML used to embed machine-readable data to HTML documents. This specification describes transformation directly to RDF [[RDF-CONCEPTS]].

There are a variety of ways in which a mapping from microdata to RDF might be configured to give a result that is closer to the required result for a particular vocabulary. This specification defines terms that can be used as hooks for vocabulary-specific behavior, which could be defined within a registry or on an implementation-defined basis. However, the HTML Data TF recommends the adoption of a single method of mapping in which every vocabulary is treated as if:

For background on the trade-offs between these options, see http://www.w3.org/wiki/Mapping_Microdata_to_RDF.

Background

Microdata [[!MICRODATA]] is a way of embedding data in HTML documents using attributes. The HTML DOM is extended to provide an API for accessing microdata information, and the microdata specification defines how to generate a JSON representation from microdata markup.

Mapping microdata to RDF enables consumers to merge data expressed in other RDF-based formats with microdata. It facilitates the use of RDF vocabularies within microdata, and enables microdata to be used with the full RDF toolchain. Some use cases for this mapping are described in Section 1.2 below.

Microdata's data model does not align neatly with RDF.

Thus, in some places the needs of RDF consumers violate requirements of the microdata specification. This specification highlights where such violations occur and the reasons for them.

This specification allows for vocabulary-specific rules that affect the generation of property URIs and value serializations. This is facilitated by a registry that associates URIs with specific rules based on matching itemtype values against registered URI prefixes do determine a vocabulary and potentially vocabulary-specific processing rules.

This specification also assumes that consumers of RDF generated from microdata may have to process the results in order to, for example, assign appropriate datatypes to property values.

Use Cases

During the period of the task force, a number of use cases were put forth for the use of microdata in generating RDF:

Issues

Decisions or open issues in the specification are tracked on the Task Force Issue Tracker. These include the following:

ISSUE 1
Vocabulary specific parsing for Microdata
ISSUE 2
Should Microdata-RDF generate XMLLiteral values. This issue has been closed with no change as this would violate microdata's data model.
ISSUE 3
Should the registry allow property datatype specification.
ISSUE 4
Should the registry allow a name or URL to be used as an alias for itemid.

The purpose of this specification is to provide input to a future working group that can make decisions about the need for a registry and the details of processing. Among the options investigated by the Task Force are the following:

Attributes and Syntax

The microdata specification [[!MICRODATA]] defines a number of attributes and the way in which those attributes are to be interpreted. The microdata DOM API provides methods and attributes for retrieving microdata from the HTML DOM.

For reference, attributes used for specifying and retrieving HTML microdata are referenced here:

itemid
An attribute containing a URL used to identify the subject of triples associated with this item. (See Items in [[!MICRODATA]]).
itemprop
An attribute used to identify one or more names of an items. An itemprop contains a space separated list of names which may either by absolute URLs or terms associated with the type of the item as defined by the referencing item's item type. (See Items in [[!MICRODATA]]).
itemref
An additional attribute on an element that references additional elements containing property definitions to be applied to the referencing item. (See Items in [[!MICRODATA]]).
itemscope
An boolean attribute identifying an element as an item. (See Items in [[!MICRODATA]]).
itemtype
An additional attribute on an element used to specify one or more types of an item. The item type of an item is the first value returned from element.itemType on the element. The item type is also used to resolve non-URL names to absolute URLs. Available through the Microdata DOM API as element.itemType. (See Items in [[!MICRODATA]]).

In RDF, it is common for people to shorten vocabulary terms via abbreviated URIs that use a 'prefix' and a 'reference'. throughout this document assume that the following vocabulary prefixes have been defined:

dc: http://purl.org/dc/terms/
md: http://www.w3.org/ns/md#
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs: http://www.w3.org/2000/01/rdf-schema#
xsd: http://www.w3.org/2001/XMLSchema#

Vocabulary Registry

In a perfect world, all processors would be able to generate the same output for a given input without regards to the requirements of a particular vocabulary. However, microdata doesn't provide sufficient syntactic help in making these decisions. Different vocabularies have different needs.

The registry is located at the namespace defined for microdata: http://www.w3.org/ns/md in a variety of formats.

The registry associates a URI prefix with one or more key-value pairs denoting processor behavior. A hypothetical JSON representation of such a registry might be the following:


This structure associates mappings for two URIs, http://schema.org/ and http://microformats.org/profile/hcard. Items having an item type with a URI prefix from this registry use the the rules described for that prefix within the scope of that item type. This mapping currently defines two rules: propertyURI and multipleValues with values to indicate specific behavior. It also allows overrides on a per-property basis; the properties key associates an individual name with overrides for default behavior. The interpretation of these rules is defined in the following sections. If an item has no current type or the registry contains no URI prefix matching current type, a conforming processor MUST use the default values defined for these rules.

The concept of a registry, including a hypothetical format, location and updating rules is presented as an abstract concept useful for describing the function of a microdata processor. There are issues surrounding update frequency, URL naming, and how updates are authorized. This spec just considers the semantic content of such a registry and how it can be used to affect processing without defining its representation or update policies.

Richard Ciganiak has pointed out that "Registry" may be the wrong term, as the proposed registry doesn't assign identifiers or manage namespace, it simply provides a mapping between URI prefixss and processor behavior and suggests the term "Whitelist". As more than two values are required, and it describes more than binary behavior, this term isn't appropriate either.

Property URI Generation

For names which are not absolute URLs, the propertyURI rule defines the algorithm for generating an absolute URL given an evaluation context including a current type, current name and current vocabulary.

The procedure for generating property URIs is defined in Generate Predicate URI.

Possible values for propertyURI are the following:

contextual
The contextual URI generation scheme guarantees that generated property URIs are unique based on the value of current name. This is required as the microdata data model requires that names are associated with specific items and do not have a global scope. (See Step 5 in Generate Predicate URI).

URI creation uses a base URI with query parameters to indicate the in-scope type and name list. Consider the following example:


        

The first name n generates the URI http://www.w3.org/ns/md?type=http://microformats.org/profile/hcard?prop=n. However, the included name given-name is included in untyped item. The inherited property URI is used to create a new property URI: http://www.w3.org/ns/md?type=http://microformats.org/profile/hcard?prop=n.given-name.

This scheme is compatible with the needs of other RDF serialization formats such as RDF/XML [[RDF-SYNTAX-GRAMMAR]], which rely on QNames for expressing properties. For example, the generated property URIs can be split as follows:


        

Looking at another example:


        

This would generate http://www.w3.org/ns/md?type=http://schema.org/Person&prop=name.

vocabulary
The vocabulary URI generation scheme appends names that are not absolute URLs to the URI prefix. When generating property URIs, if the URI prefix does not end with a '/' or '#', a '#' is appended to the URI prefix. (See Step 4 in Generate Predicate URI.)

URI creation uses a base URL with query parameters to indicate the in-scope type and name list. Consider the following example:


        

Given the URI prefix http://microformats.org/profile/hcard, this would generate http://microformats.org/profile/hcard#n and http://microformats.org/profile/hcard#given-name. Note that the '#' is automatically added as a separator.

Looking at another example:


        

Given the URI prefix http://schema.org/, this would generate http://schema.org/name. Note that if the itemtype were http://schema.org/Person/Teacher, this would generate the same property URI.

If the registry contains no match for current type implementations act as if there is a URI prefix made from the first itemtype value by stripping either the fragment content or last path segment, if the value has no fragment (See [[!RFC3986]]).

Deconstructing the itemtype URL to create or identify a vocabulary URI is a violation of the microdata specification which is necessary to support the use of existing vocabularies designed for use with RDF, and shared or inherited properties within all vocabularies.

The default value of propertyURI is vocabulary.


  

In this example, assuming no matching entry in the registry, the URI prefix is constructed by removing the last path segment, leaving the URI http://schema.org/. As there is no explicit propertyURI, the default vocabulary is used, and the resulting property URI would be http://schema.org/title.

Value Ordering

For items having multiple values for a given property, the multipleValues rule defines the algorithm for serializing these values. Microdata uses document order when generating property values, as defined in Microdata DOM API as element.itemValue. However, many RDF vocabularies expect multiple values to be generated as triples sharing a common subject and predicate. In some cases, it may be useful to retain value ordering.

The procedure for generating property values is defined in Generate Property Values.

Possible values for multipleValues are the following:

unordered
Values are serialized without ordering using a common subject and predicate. (See Step 7 in Generate Property Values).
list
Multi-valued itemprops are serialized using an RDF Collection. (See Step 8 in Generate Property Values).

An example of how this might be specified in a registry is the following:


  

Additionally, some vocabularies may wish to specify this on a per-property basis. For example, within http://schema.org/MusicPlaylist the tracks property might depend on the order of values to to reproduce associated MusicRecording values.


  

The properties key takes a JSON Object as a value, which in turn has keys for each property that is to be given alternate semantics. Each name is implicitly expanded to it's URI representation as defined in Generate Predicate URI, so that the behavior is the same whether or not the name is listed as an absolute URL.

The default value of multipleValues is unordered.

Value Typing

In microdata, all values are strings. In RDF, values may be resources or may be typed with an appropriate datatype.

In some cases, the type of a microdata value can be determined from the element on which it is specified. In particular:

Using information about the content of the document where the microdata is marked up might be a violation of the spirit of the microdata specification, though it does not explicitly say in normative text that consumers cannot use other information from the HTML DOM to interpret microdata.

Additionally, one possible use of a registry would allow vocabularies to be marked with datatype information, so that a dc:time value, for example, would be understood to represent a literal with datatype xsd:date. This could be done by adding information for each property in the vocabulary requiring special treatment.

This might be represented using a syntax such as the following:


  

The datatype identifies a URI to be used in constructing a typed literal.

In most cases, the relevant datatype for a value can be derived from knowledge of what property the value is for and the syntax of the value itself. Thus, values can be given datatypes in a post-processing step after the mapping of microdata to RDF described by this specification. However, where there is information in the HTML markup, such as knowledge of what element was used to mark up the value, which can help with determining its datatype, that information is used by this specification.

This concept is not explored further at this time, but could be developed further in a future revision of this document.

Algorithm

Transformation of Microdata to RDF makes use of general processing rules described in [[!MICRODATA]] for the treatment of items.

Algorithm Terms

absolute URL
The term absolute URL is defined in [[!HTML5]].
blank node
A blank node is a node in a graph that is neither a URI reference nor a literal. Items without a global identifier have a blank node allocated to them. (See [[RDF-CONCEPTS]]).
document base
The base address of the document being processed, as defined in Resolving URLs in [[!HTML5]].
evaluation context
A data structure including the following elements:
memory
a mapping of items to subjects, initially empty;
current name
an absolute URL for the in-scope name, used for generating URIs for properties of items without an item type;
current name is required for the contextual property URI generation scheme. Without this scheme, this evaluation context component would not be required.
current type
an absolute URL for the current type, used when an item does not contain an item type;
current vocabulary
an absolute URL for the current vocabulary, from the registry.
item
An item is described by an element containing an itemscope attribute. The list of top-level microdata items may be retrieved using the Microdata DOM API document.getItems method.
item properties
The mechanism for finding the properties of an item The list of item properties items may be retrieved using the Microdata DOM API element.properties attribute.
fragment-escape
The term fragment-escape is defined in [[!HTML5]]. This involves transforming elements added to URLs to ensure that the result remains a valid URL. The following characters are subject to percent escaping:
  • U+0022 QUOTATION MARK character (")
  • U+0023 NUMBER SIGN character (#)
  • U+0025 PERCENT SIGN character (%)
  • U+003C LESS-THAN SIGN character (<)
  • U+003E GREATER-THAN SIGN character (>)
  • U+005B LEFT SQUARE BRACKET character ([)
  • U+005C REVERSE SOLIDUS character (\)
  • U+005D RIGHT SQUARE BRACKET character (])
  • U+005E CIRCUMFLEX ACCENT character (^)
  • U+007B LEFT CURLY BRACKET character ({)
  • U+007C VERTICAL LINE character (|)
  • U+007D RIGHT CURLY BRACKET character (})
global identifier
The value of an item's itemid attribute, if it has one. (See Items in [[!MICRODATA]]).
literal
Literals are values such as strings and dates, including typed literals and plain literals. (See [[RDF-CONCEPTS]]).
property
Each name identifies a property of an item. An item may have multiple elements sharing the same name, creating a multi-valued property.
property names
The tokens of an element's itemprop attribute. Each token is a name. (See Names: the itemprop attribute in [[!MICRODATA]]).
property value
The property value of a name-value pair added by an element with an itemprop attribute depends on the element.
If the element has no itemprop attribute
The value is null and no triple should be generated.
If the element creates an item (by having an itemscope attribute)
The value is the URI reference or blank node returned from generate the triples for that item.
If the element is a URL property element (a, area, audio, embed, iframe, img, link, object, source, track or video)
The value is a URI reference created from element.itemValue. (See relevant attribute descriptions in [[!HTML5]]).
If the element is a time element.
The value is a literal made from element.itemValue.
If the value has the lexical form of xsd:date [[!RDF-SCHEMA]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#date.
If the value has the lexical form of xsd:time [[!RDF-SCHEMA]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#time.
If the value has the lexical form of xsd:dateTime [[!RDF-SCHEMA]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#dateTime.
Otherwise
The value is a plain literal created from the value with language information set from the lang IDL attribute of the property element.

See The time element in [[!HTML5]].

The content model of the time element is subject to change, and may include more content types, such as xsd:duration, xsd:gYear, xsd:gYearMonth and xsd:monthDay in the future.
Otherwise
The value is a plain literal created from element.itemValue with language information set from the lang IDL attribute of the property element.
top-level item
An item which does not contain an itemprop attribute. Available through the Microdata DOM API as document.getItems. (See Associating names with items in [[!MICRODATA]]).
URI reference
URI references are suitable to be used in subject, predicate or object positions within an RDF triple, as opposed to a literal value that may contain a string representation of a URI. (See [[RDF-CONCEPTS]]).

The HTML5/microdata content model for @href, @src, @data, itemtype and itemprop and itemid is that of a URL, not a URI or IRI.

A proposed mechanism for specifying the range of property values to be URI reference or IRI could allow these to be specified as subject or object using a @content attribute.

vocabulary
A vocabulary is a collection of URIs, suitable for use as an itemtype or itemprop value, that share a common URI prefix. That prefix is the vocabulary URI. A vocabulary URI is not allowed to be a prefix of another vocabulary URI.
This definition differs from the language in the HTML spec and is just for the purpose of this document. In HTML, a vocabulary is a specification, and doesn't have a URI. In our view, if one specification defines ten itemtypes, then these could be treated as one vocabulary or as ten distinct vocabularies; it is entirely up to the vocabulary creator.

RDF Conversion Algorithm

A HTML document containing microdata MAY be converted to any other RDF-compatible document format using the algorithm specified in this section.

A conforming microdata processor implementing RDF conversion MUST implement a processing algorithm that results in the equivalent triples to those that the following algorithm generates:

Set item list to an empty list.

  1. For each element that is also a top-level item run the following algorithm:
    1. Generate the triples for an item item, using the evaluation context. Let result be the (URI reference or blank node) subject returned.
    2. Append result to item list.
  2. Generate an RDF Collection list from the ordered list of values. Set value to the value returned from generate an RDF Collection.
  3. Generate the following triple:
    subject
    Document base
    predicate
    http://www.w3.org/ns/md#item
    object
    value

Generate the triples

When the user agent is to Generate triples for an item item, given an Evaluation Context, it must run the following steps:

This algorithm has undergone substantial change from the original microdata specification [[!MICRODATA]].

  1. If there is an entry for item in memory, then let subject be the subject of that entry. Otherwise, if item has a global identifier and that global identifier is an absolute URL, let subject be that global identifier. Otherwise, let subject be a new blank node.
  2. Add a mapping from item to subject in memory
  3. For each type returned from element.itemType of the element defining the item.
    1. If type is an absolute URL, generate the following triple:
      subject
      subject
      predicate
      http://www.w3.org/1999/02/22-rdf-syntax-ns#type
      object
      type (as a URI reference)
  4. Set type to the first value returned from element.itemType of the element defining the item.
  5. If type is not an absolute URL, set it to current type from the Evaluation Context if not empty.
  6. If the registry contains a URI prefix that is a character for character match of type up to the length of the URI prefix, set vocab as that URI prefix.
  7. Otherwise, if type is not empty, construct vocab by removing everything following the last SOLIDUS U+002F ("/") or NUMBER SIGN U+0023 ("#") from type.
  8. Update evaluation context setting current vocabulary to vocab.
  9. Set property list to an empty array mapping properties to one or more values as established below.
  10. For each element element that has one or more property names and is one of the properties of the item item, in the order those elements are given by the algorithm that returns the properties of the item, run the following substep:
    1. For each name in the element's property names, run the following substeps:
      1. Let context be a copy of evaluation context with current type set to type.
      2. Let predicate be the result of generate predicate URI using context and name. Update context by setting current name to predicate.
      3. Let value be the property value of element.
      4. If value is an item, then generate the triples for value using context. Replace value by the subject returned from those steps.
      5. Add value to property list for predicate.
  11. For each predicate in property list:
    1. Generate property values subject, predicate and the list of values associated with predicate from property list as values.
  12. Return subject

Generate Predicate URI

Predicate URI generation makes use of current type, current name, and current vocabulary from an evaluation context context along with name.

  1. If name is an absolute URL, return name as a URI reference.
  2. If current type from context is null, there can be no current vocabulary. Return the URI reference that is the document base with its fragment set to the fragment-escaped value of name

    This rule is intended to allow for a the case where no type is set, and therefore there is no vocabulary from which to extract rules. For example, if there is a document base of http://example.org/doc and an itemprop of 'title', a URI will be constructed to be http://example.org/doc#title.
  3. Otherwise, if current vocabulary from context is not null and registry has an entry for current vocabulary having a propertyURI entry that is not null, set that as scheme. Otherwise, set scheme to vocabulary.
  4. If scheme is vocabulary return the URI reference constructed by appending the fragment-escaped value of name to current vocabulary, separated by a U+0023 NUMBER SIGN character (#) unless the current vocabulary ends with either a U+0023 NUMBER SIGN character (#) or SOLIDUS U+002F (/).
  5. If scheme is contextual, return the URI reference constructed as follows:
    1. Let s be current type from context.
    2. If http://www.w3.org/ns/md?type= is a prefix of s, return the concatenation of s, a U+002E FULL STOP character (.) and the fragment-escaped value of name.
    3. Otherwise, return the concatenation of http://www.w3.org/ns/md?type=, the fragment-escaped value of s, the string &prop=, and the fragment-escaped value of name.

Generate Property Values

Property value serialization makes use of subject, predicate and values.

  1. If the registry contains a URI prefix that is a character for character match of predicate up to the length of the URI prefix, set vocab as that URI prefix. Otherwise set vocab to null.
  2. If vocab is not null and registry has an entry for vocab that is a JSON Object, let registry object be that value. Otherwise set registry object to null.
  3. If registry object is not null and registry object contains key properties which has a JSON Object value, let properties be that value. Otherwise, set properties to null.
  4. If properties is not null, and properties contains a key, which after Generate Predicate URI expansion has a value which is a JSON Object, let property override be that value. Otherwise, set property override to null.
  5. If property override contains the key multipleValues, set that as method.
  6. Otherwise, if registry object con contains the key multipleValues, set that as method.
  7. Otherwise, set method to unordered.
  8. If method is unordered, for each value in values, generate the following triple:
    subject
    subject
    predicate
    predicate
    object
    value
  9. Otherwise, if method is list:
    1. Set value to the value returned from generate an RDF Collection.
    2. Generate the following triple:
      subject
      subject
      predicate
      predicate
      object
      value

Generate RDF Collection

An RDF Collection is a mechanism for defining ordered sequences of objects in RDF (See RDF Collections in [[!RDF-SCHEMA]]). As the RDF data-model is that of an unordered graph, a linking method using properties rdf:first and rdf:next is required to be able to specify a particular order.

In the microdata to RDF mapping, RDF Collections are used when an item has more than one value associated with a given property to ensure that the original document order is maintained. The following procedure should be used to generate triples when an item property has more than one value (contained in list):

  1. Create a new array array containing a blank node for every value in list.
  2. For each pair of bnode from array and value from list the following triple is generated:
    subject
    bnode
    predicate
    http://www.w3.org/1999/02/22-rdf-syntax-ns#first
    object
    value
  3. For each bnode in array the following triple is generated:
    subject
    bnode
    predicate
    http://www.w3.org/1999/02/22-rdf-syntax-ns#rest
    object
    next bnode in array or, if that does not exist, http://www.w3.org/1999/02/22-rdf-syntax-ns#nil
  4. Return the first blank node from array.

Markup Examples

The microdata example below expresses book information as an FRBR Work item.


Assuming that registry contains a an entry for http://purl.org/vocab/frbr/core# with propertyURI set to vocabulary, this is equivalent to the following Turtle:


The following snippet of HTML has microdata for two people with the same address. This illustrates two items referencing a third item, and how only a single RDF resource definition is created for that third item.


Assuming that registry contains a an entry for http://microformats.org/profile/hcard with propertyURI set to vocabulary, it generates these triples expressed in Turtle:


The following snippet of HTML has microdata for a playlist, and illustrates overriding a property to place elements in an RDF Collection:


Assuming that registry contains a an entry for http://schema.org/ with propertyURI set to vocabulary, multipleValues set to unordered with the properties track and byArtist having multipleValues set to list, it generates these triples expressed in Turtle:


Example registry

The following is an example registry in JSON format.

  

Acknowledgements

Thanks to Richard Cyganiak for property URI and vocabulary terminology and the general excellent consideration of practical problems in generating RDF from microdata.