This document is an experimental work in progress.

Introduction

This document describes a means of transforming HTML containing Microdata into RDF. HTML Microdata [[!MICRODATA]] is an extension to HTML used to embed machine-readable data to HTML documents. This specification describes transformation directly to RDF [[RDF-CONCEPTS]].

Attributes and Syntax

The Microdata specification [[!MICRODATA]] defines a number of attributes and the way in which those attributes are to be interpreted. This section describes those attributes, with reference to their original definition.

itemid
An attribute containing a URI used to identify the subject of triples associated with this item. Available through the Microdata DOM API as element.itemId. (See Section 3.2 Items in [[!MICRODATA]]).
itemprop
An attribute used to identify one or more properties to one ore more items. An itemprop contains a space separated list of names which may either by absolute URIs or terms associated with the type of the item as defined by the referencing item's itemtype. Available through the Microdata DOM API as element.itemProp. (See Section 3.3 Names: the itemprop attribute of [[!MICRODATA]]).
itemscope
An boolean attribute identifying an element as an item. (See Section 3.2 Items of [[!MICRODATA]]).
itemref
An additional attribute on an item that references additional elements containing property definitions to be applied to the referencing item. The attribute value is an unordered list of ID references to elements within the same document. Available through the Microdata DOM API as element.itemRef. (See Section 3.2 Items of [[!MICRODATA]]).
itemtype
An additional attribute on an item used to specify the type of an item. The specified type is also used to resolve non-URI names to absolute URIs. Available through the Microdata DOM API as element.itemType. (See Section 3.2 Items of [[!MICRODATA]]).

Algorithm

Transformation of Microdata to RDF makes use of general processing rules described in [[!MICRODATA]] for the treatment of items.

Algorithm Terms

item
An item is defined as an element containing an itemscope attribute. (See Section 3.2 Items of [[!MICRODATA]]).
top-level item
An item which does not contain an itemprop attribute. Available through the Microdata DOM API as document.getItems. (See Section 3.5 Associating names with items of [[!MICRODATA]]).
absolute URI
As defined in [[!RFC3986]], an absolute URI contains both scheme and scheme-specific-parts.
document base
The base address of the document being processed, as defined in Section 2.6.3 Resolving URLs of [[!HTML5]].
global identifier
The value of an item's itemid attribute, if it has one. (See Section 3.2 Items of [[!MICRODATA]]).
URI reference
URI references are suitable to be used as subject predicate or object positions within an RDF triple, as opposed to a Literal value that may contain a string representation of a URI. (See [[RDF-CONCEPTS]]).
Blank Node
A blank node is a node in a graph that is neither a URI reference nor a Literal. Items without a global identifier have a blank node allocated to them. (See [[RDF-CONCEPTS]]).
Literal
Literals a values such as strings and dates, including typed literals and plain literals. (See [[RDF-CONCEPTS]]).
evaluation context
A data structure including the following elements:
memory
a mapping of items to subjects, initially empty
current type
an absolute URI for the current type, used when an item does not contain an explicit itemtype
item properties
The mechanism for finding the properties of an item are described in Section 3.5 Associating names with items of [[!MICRODATA]]. Available through the Microdata DOM API as element.properties.
property names
The tokens of an element's itemprop attribute. (See Section 3.3 Names: the itemprop attribute of [[!MICRODATA]]).
property value
The property value of a name-value pair added by an element with an itemprop attribute depends on the element. Available through the Microdata DOM API as element.itemValue. (Updated from Section 3.4 Values of [[!MICRODATA]]).
If we reference element.itemValue we should file issues against the Microdata spec to ensure that values returned are consisted with this spec.
If the element also has an itemscope attribute
The value is the item created by the element as a URI reference or blank node
If the element is a meta element
The value is the plain literal created from the value of the element's content attribute, if any, or the empty string if there is no such attribute.
If the element is an audio, embed, iframe, img, source, track, or video element with a src attribute
The value is a URI reference that results from resolving the value of the element's src attribute relative to the element at the time the attribute is set.
If the element is an a, area, or link element with an href attribute
The value is a URI reference that results from resolving the value of the element's href attribute relative to the element at the time the attribute is set.
If the element is an object element with a data attribute
The value is URI reference that results from resolving the value of the element's data attribute relative to the element at the time the attribute is set.
If the element is a time element with a datetime attribute
If the value has the lexical form of xsd:date [[!RDF-SCHEMA]]
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#date
If the value has the lexical form of xsd:time [[!RDF-SCHEMA]]
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#time
If the value has the lexical form of xsd:dateTime [[!RDF-SCHEMA]]
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#dateTime
Otherwise
The value is a plain literal created from the value.
If the element is an blockquote or q element with n cite attribute
The value is URI reference that results from resolving the value of the element's cite attribute relative to the element at the time the attribute is set
Was formerly document-level, now part of item value processing.
Otherwise
The value is a plain literal, with the language information set from the language of the element, if it is not unknown.

RDF Conversion Algorithm

A HTML document containing Microdata MAY be converted to any other RDF-compatible document format using the algorithm specified in this section.

The algorithm below is designed for DOM-based implementations with CSS selector access to elements.

A conforming Microdata processor implementing RDF conversion MUST implement a processing algorithm that results in the equivalent triples that the following algorithm generates:

Set item list to an empty list.

For each element that is also a top-level item run the following algorithm:

  1. Generate the triples for an item item, using the evaluation context. Let result be the (URI reference or blank node) subject returned.
  2. Append result to item list.
  3. If item list contains multiple values, generate an RDF Collection list from the ordered list of values. Set value to the value returned from generate an RDF Collection.
  4. Otherwise, if item list contains a single value set value to that value.
  5. Generate the following triple:
    subject
    Document base
    predicate
    http://www.w3.org/1999/xhtml/microdata#item
    object
    value

Generate the triples

When the user agent is to Generate triples for an item item, given an Evaluation Context, it must run the following steps:

This algorithm has undergone substantial change from the original Microdata specification [[!MICRODATA]].

  1. If there is an entry for item in memory, then let subject be the subject of that entry. Otherwise, if item has a global identifier and that global identifier is an absolute URI, let subject be that global identifier. Otherwise, let subject be a new blank node.
  2. Add a mapping from item to subject in memory
  3. If the item has an itemtype attribute, extract the value as type.
  4. If type is an absolute URI, generate the following triple:
    subject
    subject
    predicate
    http://www.w3.org/1999/02/22-rdf-syntax-ns#type
    object
    type (as a URI reference)
  5. If type is not an absolute URI, set it to current type from the Evaluation Context if not empty.
  6. Set property list to an empty mapping between properties and one or more ordered values as established below.
  7. For each element element that has one or more property names and is one of the properties of the item item, in the order those elements are given by the algorithm that returns the properties of the item, run the following substep:
    1. For each name in the element's property names, run the following substeps:
      1. If name is an absolute URI, set predicate to name as a URI reference.
      2. Otherwise, if type is not defined, set predicate to a URI reference that results from resolving name relative to the element at the time the attribute is set.
      3. Otherwise, construct predicate from type by removing everything following the last SOLIDUS U+002F ("/") or NUMBER SIGN U+0023 ("#") in type and append name.
      4. Let value be the property value of element.
      5. If value is an item, then generate the triples for value using a copy of evaluation context with current type set to type. Replace value by the subject returned from those steps.
      6. Add value to property list for predicate.
  8. For each predicate in property list:
    1. If entry for predicate in property list contains multiple values, generate an RDF Collection list from the ordered list of values. Set value to the value returned from generate an RDF Collection.
    2. Otherwise, if predicate in property list contains a single value set value to that value.
    3. Generate the following triple:
      subject
      subject
      predicate
      predicate
      object
      value
  9. Return subject

Generate an RDF Collection

An RDF Collection is a mechanism for defining ordered sequences of objects in RDF (See Section 5.2 RDF Collections in [[!RDF-SCHEMA]]). As the RDF data-model is that of an unordered graph, a linking method using properties rdf:first and rdf:next is required to be able to specify a particular order.

In the Microdata to RDF mapping, RDF Collections are used when an item has more than one value associated with a given property to ensure that the original document order is maintained. The following procedure should be used to generate triples when an item property has more than one value (contained in list):

  1. Create a new array array containing a blank node for every value in list.
  2. For each pair of bnode and value from array and list the following triple is generated:
    subject
    bnode
    predicate
    http://www.w3.org/1999/02/22-rdf-syntax-ns#first
    object
    value
  3. For each bnode in array the following triple is generated:
    subject
    bnode
    predicate
    http://www.w3.org/1999/02/22-rdf-syntax-ns#rest
    object
    next element in array or, if that does not exist, http://www.w3.org/1999/02/22-rdf-syntax-ns#nil
  4. Return the first blank node from array.

Markup Examples

The Microdata example below expresses book information as an FRBR Work item.


This is equivalent to the following Turtle:


The following snippet of HTML has microdata for two people with the same address:


It generates these triples expressed in Turtle:


Acknowledgements