This document is an experimental work in progress.

Introduction

This document describes a means of transforming HTML containing microdata into RDF. HTML Microdata [[!MICRODATA]] is an extension to HTML used to embed machine-readable data to HTML documents. This specification describes transformation directly to RDF [[RDF-CONCEPTS]].

Background

Microdata is a way of expressing metadata in HTML documents using attributes. A previous version of microdata [[!MICRODATA]] included rules for generating RDF, but current Editor's Drafts have removed the explicit transformation procedure. Microdata is now used as an API to access data from within an HTML DOM and as a JSON serialization.

The original RDF transformation process created URIs for properties that are expressed as non-absolute URIs. The algorithm was designed to create URIs which were distinct based on the relationship between itemtype and itemprop contexts. This is required, as the microdata data model requires that properties maintain distinct semantic meanings in different contexts. However, this form of URI generation is typically different than that used within RDF vocabularies, where properties typically have a common meaning within a given vocabulary.

Microdata also specifies that items are values are ordered, which is not typically the case for RDF vocabularies. In fact, unless a property has an rdfs:range of rdf:List, or is unspecified, it may not be appropriate to generate an RDF Collection.

This specification is an update to the original RDF transformation process in addition to vocabulary-specific rules that affect the generation of property URIs and value serializations. This is facilitated by a registry that associates URIs with specific rules based on matching itemtype values against registered URI prefixes do determine a vocabulary and vocabulary-specific processing rules.

The Microdata JSON serialization does not retain datatype or language information that might be derived from the HTML DOM. The RDF Transformation does retain language information when it is available.

Use Cases

During the period of the task force, a number of use cases were put forth for the use of microdata in generating RDF:

Issues

Decisions or open issues in the specification are tracked on the Task Force Issue Tracker. These include the following:

ISSUE 1
Vocabulary specific parsing for Microdata
ISSUE 2
Should Microdata-RDF generate XMLLiteral values. This issue has been closed with no change as this would violate microdata's data model.
ISSUE 3
Should the registry allow property datatype specification.

Goals

The purpose of this specification is to provide input to a future working group that can make decisions about the need for a registry and the details of processing. Among the options investigated by the Task Force are the following:

Attributes and Syntax

The microdata specification [[!MICRODATA]] defines a number of attributes and the way in which those attributes are to be interpreted. This section describes those attributes, with reference to their original definition.

content
An attribute appropriate for use with the meta element for creating invisible properties.
data
An attribute appropriate for use with the object element for creating URIURI references.
datetime
An attribute appropriate for use with the date element for creating typed literals.
The date element will likely be replaced with something more general purpose.
href
An attribute appropriate for use with a, area or link elements for creating URI references.
itemid
An attribute containing a URI used to identify the subject of triples associated with this item. Available through the Microdata DOM API as element.itemId. (See Section 3.2 Items in [[!MICRODATA]]).
itemprop
An attribute used to identify one or more properties to one ore more items. An itemprop contains a space separated list of names which may either by absolute URIs or terms associated with the type of the item as defined by the referencing item's itemtype. Available through the Microdata DOM API as element.itemProp. (See Section 3.3 Names: the itemprop attribute of [[!MICRODATA]]).
itemref
An additional attribute on an item that references additional elements containing property definitions to be applied to the referencing item. The attribute value is an unordered list of ID references to elements within the same document. Available through the Microdata DOM API as element.itemRef. (See Section 3.2 Items of [[!MICRODATA]]).
itemscope
An boolean attribute identifying an element as an item. (See Section 3.2 Items of [[!MICRODATA]]).
itemtype
An additional attribute on an item used to specify the type of an item. The specified type is also used to resolve non-URI names to absolute URIs. Available through the Microdata DOM API as element.itemType. (See Section 3.2 Items of [[!MICRODATA]]).
src
An attribute appropriate for use with audio, embed, iframe, img, source, track, or video elements for creating invisible properties.
value
An attribute appropriate for use with the data element for creating untyped literals.

Vocabulary Registry

In a perfect world, all processors would be able to generate the same output for a given input without regards to the requirements of a particular vocabulary. However, microdata doesn't provide sufficient syntactic help in making these decisions. Different vocabularies have different needs.

The registry associates a URI prefix with one or more key-value pairs denoting processor behavior. A hypothetical JSON representation of such a registry might be the following:


This structure associates mappings for two URIs, http://schema.org/ and http://microformats.org/profile/hcard. Items having an itemtype with a URI prefix from this registry use the the rules described for that prefix within the scope of that itemtype. This mapping currently defines two rules: propertyURI and multipleValues with values to indicate specific behavior. The interpretation of these rules is defined in the following sections. If an item has no current type or the registry contains no URI prefix matching current type, a conforming processor MUST use the default values defined for these rules.

Richard Ciganiak has pointed out that "Registry" may be the wrong term, as the proposed registry doesn't assign identifiers or manage namespace, it simply provides a mapping between URI prefixes and processor behavior and suggests the term "Whitelist". As more than two values are required, and it describes more than binary behavior, this term isn't appropriate either.

Anytime we discuss maintaining such a database, there are issues surrounding update frequency, URL naming, and how updates are authorized. This remains an open issue. This spec just considers the semantic content of such a list and how it can be used to affect processing without defining its representation or update policies.

The URL of the registry must be defined.

Property URI Generation

For property names which are not absolute URIs, the propertyURI rule defines the algorithm for generating an absolute URI given an evaluation context including a current type, current property and current vocabulary.

The procedure for generating property URIs is defined in Generate Predicate URI.

Possible values for propertyURI are the following:

base

The base URI generation scheme uses an itemtype URI to create a property URI by using the portion of the itemtype URI up to and including the final '#' or '/'.

For example, given a type of http://schema.org/Person and a property name of name, the resulting property URI would be http://schema.org/name.

contextual
The contextual URI generation scheme guarantees that generated property URIs are unique based on the value of current property. This is required as the microdata data model requires that property names are associated with specific items and do not have a global scope.

URI creation uses a base URI with query parameters to indicate the in-scope type and property name list. Consider the following example:


        

The first property name n generates the URI http://www.w3.org/ns/md?type=http://microformats.org/profile/hcard?prop=n. However, the included property name given-name is included in untyped item. The inherited property URI is used to create a new property URI: http://www.w3.org/ns/md?type=http://microformats.org/profile/hcard?prop=n.first-name.

This scheme is compatible with the needs of other RDF serialization formats such as RDF/XML [[RDF-SYNTAX-GRAMMAR]], which rely on QNames for expressing properties. For example, the generated property URIs can be split as follows:


        
type
The type URI generation scheme appends property names that are not absolute URIs to current type using a "#" separator.
vocabulary
The vocabulary URI generation scheme appends property names that are not absolute URIs to the URI prefix.

The default value of propertyURI is contextual.

Value Ordering

For items having multiple values for a property, the multipleValues rule defines the algorithm for serializing these values. This is required as the microdata data model requires that values be strictly ordered as defined in Microdata DOM API as element.itemValue. However, many RDF vocabularies expect multiple values to be generated as triples sharing a common subject and predicate.

Possible values for multipleValues are the following:

unordered
Values are serialized without ordering using a common subject and predicate.
list
Multi-valued itemprops are serialized using an RDF Collection.

The default value of multipleValues is list.

Value Typing

One possible use of a registry would allow vocabularies to be marked with datatype information, so that a dc:time value, for example, would be understood to represent a literal with datatype xsd:date. This could be done by adding information for each property in the vocabulary requiring special treatment.

Additionally, literal values which should be interpreted as URI references could be given special treatment.

These concepts are not explored further at this time, but could be developed further in a future revision of this document.

Algorithm

Transformation of Microdata to RDF makes use of general processing rules described in [[!MICRODATA]] for the treatment of items.

Algorithm Terms

absolute URI
As defined in [[!RFC3986]], an absolute URI contains both scheme and scheme-specific-parts.
blank node
A blank node is a node in a graph that is neither a URI reference nor a literal. Items without a global identifier have a blank node allocated to them. (See [[RDF-CONCEPTS]]).
document base
The base address of the document being processed, as defined in Section 2.6.3 Resolving URLs of [[!HTML5]].
evaluation context
A data structure including the following elements:
memory
a mapping of items to subjects, initially empty
current property
an absolute URI for the current property, used for generating URIs for properties of items without an explicit itemtype.
current property is required for the contextual property URI generation scheme. Without this scheme, this evaluation context component would not be required.
current type
an absolute URI for the current type, used when an item does not contain an explicit itemtype
current vocabulary
an absolute URI for the current vocabulary, from the registry
item
An item is defined as an element containing an itemscope attribute. (See Section 3.2 Items of [[!MICRODATA]]).
item properties
The mechanism for finding the properties of an item are described in Section 3.5 Associating names with items of [[!MICRODATA]]. Available through the Microdata DOM API as element.properties.
global identifier
The value of an item's itemid attribute, if it has one. (See Section 3.2 Items of [[!MICRODATA]]).
literal
Literals a values such as strings and dates, including typed literals and plain literals. (See [[RDF-CONCEPTS]]).
property names
The tokens of an element's itemprop attribute. (See Section 3.3 Names: the itemprop attribute of [[!MICRODATA]]).
property value
The property value of a name-value pair added by an element with an itemprop attribute depends on the element. Available through the Microdata DOM API as element.itemValue. (Updated from Section 3.4 Values of [[!MICRODATA]]).
If the element also has an itemscope attribute
The value is the item created by the element as a URI reference or blank node
If the element is a meta element
The value is the plain literal created from the value of the element's content attribute, if any, or the empty string if there is no such attribute. If the language of the element is known it MUST be used when creating the plain literal.
If the element is an audio, embed, iframe, img, source, track, or video element with a src attribute
The value is a URI reference that results from resolving the value of the element's src attribute relative to the element at the time the attribute is set.
If the element is an a, area, or link element with an href attribute
The value is a URI reference that results from resolving the value of the element's href attribute relative to the element at the time the attribute is set.
If the element is an object element with a data attribute
The value is URI reference that results from resolving the value of the element's data attribute relative to the element at the time the attribute is set.
If the element is a data element with a value attribute
The value is a plain literal created from the value.
If the element is a time element with a datetime attribute
If the value has the lexical form of xsd:date [[!RDF-SCHEMA]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#date.
If the value has the lexical form of xsd:time [[!RDF-SCHEMA]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#time.
If the value has the lexical form of xsd:dateTime [[!RDF-SCHEMA]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#dateTime.
Otherwise
The value is a plain literal created from the value.
Otherwise
The value is a plain literal, with the language information set from the language of the element, if it is not unknown.
top-level item
An item which does not contain an itemprop attribute. Available through the Microdata DOM API as document.getItems. (See Section 3.5 Associating names with items of [[!MICRODATA]]).
URI reference
URI references are suitable to be used in subject, predicate or object positions within an RDF triple, as opposed to a literal value that may contain a string representation of a URI. (See [[RDF-CONCEPTS]]).
vocabulary
A vocabulary is a collection of URIs, suitable for use as an itemtype or itemprop value, that share a common URI prefix. That prefix is the vocabulary URI. A vocabulary URL is not allowed to be a prefix of another vocabulary URI.
This definition differs from the language in the HTML spec and is just for the purpose of this document. In HTML, a vocabulary is a specification, and doesn't have a URI. In our view, if one specification defines ten itemtypes, then these could be treated as one vocabulary or as ten distinct vocabularies; it is entirely up to the vocabulary creator.

RDF Conversion Algorithm

A HTML document containing microdata MAY be converted to any other RDF-compatible document format using the algorithm specified in this section.

The algorithm below is designed for DOM-based implementations with CSS selector access to elements.

A conforming microdata processor implementing RDF conversion MUST implement a processing algorithm that results in the equivalent triples that the following algorithm generates:

Set item list to an empty list.

  1. For each element that is also a top-level item run the following algorithm:
    1. Generate the triples for an item item, using the evaluation context. Let result be the (URI reference or blank node) subject returned.
    2. Append result to item list.
  2. If item list contains multiple values, generate an RDF Collection list from the ordered list of values. Set value to the value returned from generate an RDF Collection.
  3. Otherwise, if item list contains a single value set value to that value.
  4. Generate the following triple:
    subject
    Document base
    predicate
    http://www.w3.org/ns/md#item
    object
    value

Generate the triples

When the user agent is to Generate triples for an item item, given an Evaluation Context, it must run the following steps:

This algorithm has undergone substantial change from the original microdata specification [[!MICRODATA]].

  1. If there is an entry for item in memory, then let subject be the subject of that entry. Otherwise, if item has a global identifier and that global identifier is an absolute URI, let subject be that global identifier. Otherwise, let subject be a new blank node.
  2. Add a mapping from item to subject in memory
  3. If item has an itemtype attribute, extract the value as type.
  4. If type is an absolute URI, generate the following triple:
    subject
    subject
    predicate
    http://www.w3.org/1999/02/22-rdf-syntax-ns#type
    object
    type (as a URI reference)
  5. If type is not an absolute URI, set it to current type from the Evaluation Context if not empty.
  6. If the registry contains a URI prefix that is a character for character match of type up to the length of the URI prefix, set vocab as that URI prefix.
  7. Set property list to an empty mapping between properties and one or more ordered values as established below.
  8. For each element element that has one or more property names and is one of the properties of the item item, in the order those elements are given by the algorithm that returns the properties of the item, run the following substep:
    1. For each name in the element's property names, run the following substeps:
      1. Let context to a copy of evaluation context with current type set to type and current vocabulary set to vocab.
      2. Let predicate be the result of generate predicate URI using context and name. Update context by setting current property to predicate.
      3. Let value be the property value of element.
      4. If value is an item, then generate the triples for value using context. Replace value by the subject returned from those steps.
      5. Add value to property list for predicate.
  9. For each predicate in property list:
    1. Generate property values using a copy of evaluation context with current vocabulary set to vocab along with subject, predicate and the list of values associated with predicate from property list as values.
  10. Return subject

Generate Predicate URI

Predicate URI generation makes use of current type, current property, and current vocabulary from an evaluation context context along with name.

  1. If name is an absolute URI, return name as a URI reference.
  2. If current type from context is null, there can be no current vocabulary. Return the URI reference constructed as follows:
    1. Let s be document base.
    2. If s does not contain a U+0023 NUMBER SIGN character (#), then append a U+0023 NUMBER SIGN character (#) to s.
    3. Return the concatenation of s and the fragment-escaped value of name as a URI reference.
    This rule is intended to allow for a the case where no type is set, and therefore there is no vocabulary from which to extract rules. For example, if there is a document base of http://example.org/doc and an itemprop of 'title', a URI will be constructed to be http://example.org/doc#title.
  3. Otherwise, if current vocabulary from context is not null and registry has an entry for current vocabulary having a propertyURI entry that is not null, set that as scheme. Otherwise, set scheme to contextual.
  4. If scheme is base return the URI reference constructed by removing everything following the last SOLIDUS U+002F ("/") or NUMBER SIGN U+0023 ("#") from current type and append the fragment-escaped value of name.
  5. If scheme is vocabulary return the URI reference constructed by appending the fragment escaped value of name to current vocabulary.
  6. If scheme is type, return the URI reference constructed as follows:
    1. Let s be current type from context.
    2. If s does not contain a U+0023 NUMBER SIGN character (#), then append a U+0023 NUMBER SIGN character (#) to s.
    3. Return the concatenation of s and the fragment-escaped value of name as a URI reference.
    The use of the NUMER SIGN as a separator is somewhat arbitrary. A future edition of this document may include additional registry parameters that could identify a different separator character (such as SOLIDUS (/)).
  7. If scheme is contextual, return the URI reference constructed as follows:
    1. Let s be current type from context.
    2. If http://www.w3.org/ns/md?type= is a prefix of s, return the concatenation of s, a U+002E FULL STOP character (.) and the fragment-escaped value of name.
    3. Otherwise, return the concatenation of http://www.w3.org/ns/md?type=, the fragment-escaped value of s, the string &prop=, and the fragment-escaped value of name.

Generate Property Values

Property value serialization makes use of current vocabulary from an evaluation context context along with subject, predicate and values.

  1. If current vocabulary from context is not null and registry has an entry for current vocabulary having a multipleValues entry that is not null, set that as method. Otherwise, set method to list.
  2. If method is unordered, foreach value in values, generate the following triple:
    subject
    subject
    predicate
    predicate
    object
    value
  3. Otherwise, if method is list:
    1. If values contains multiple values, generate an RDF Collection list from the ordered list of values. Set value to the value returned from generate an RDF Collection.
    2. Otherwise, if values contains a single value set value to that value.
    3. Generate the following triple:
      subject
      subject
      predicate
      predicate
      object
      value

Generate RDF Collection

An RDF Collection is a mechanism for defining ordered sequences of objects in RDF (See Section 5.2 RDF Collections in [[!RDF-SCHEMA]]). As the RDF data-model is that of an unordered graph, a linking method using properties rdf:first and rdf:next is required to be able to specify a particular order.

In the microdata to RDF mapping, RDF Collections are used when an item has more than one value associated with a given property to ensure that the original document order is maintained. The following procedure should be used to generate triples when an item property has more than one value (contained in list):

  1. Create a new array array containing a blank node for every value in list.
  2. For each pair of bnode and value from array and list the following triple is generated:
    subject
    bnode
    predicate
    http://www.w3.org/1999/02/22-rdf-syntax-ns#first
    object
    value
  3. For each bnode in array the following triple is generated:
    subject
    bnode
    predicate
    http://www.w3.org/1999/02/22-rdf-syntax-ns#rest
    object
    next element in array or, if that does not exist, http://www.w3.org/1999/02/22-rdf-syntax-ns#nil
  4. Return the first blank node from array.

Markup Examples

The microdata example below expresses book information as an FRBR Work item.


Assuming that registry contains a an entry for http://purl.org/vocab/frbr/core# with propertyURI set to vocabulary, this is equivalent to the following Turtle:


The following snippet of HTML has microdata for two people with the same address:


Assuming that registry contains a an entry for http://microformats.org/profile/hcard with propertyURI set to type, it generates these triples expressed in Turtle:


Acknowledgements

Thanks to Richard Cyganiak for property URI and vocabulary terminology and the general excellent consideration of practical problems in generating RDF from microdata.