Introduction
This document describes a means of transforming HTML containing Microdata into RDF. HTML Microdata [[!MICRODATA]]
is an extension to HTML used to embed machine-readable data to HTML documents. This specification describes
transformation directly to RDF [[RDF-CONCEPTS]].
Background
Microdata is a way of expressing metadata in HTML documents using attributes. A previous version
of Microdata [[!MICRODATA]] included rules for generating RDF, but current Editor's Drafts have removed
the explicit transformation procedure. Microdata is now used as an API to access data from within
an HTML DOM and as a JSON serialization.
The original RDF transformation process created URIs for properties that are expressed as non-absolute
URIs. The algorithm was designed to create URIs which were distinct based on the relationship between
itemtype and itemprop contexts. This is required, as the Microdata data model
requires that properties maintain distinct semantic meanings in different contexts. However, this
form of URI generation is typically different than that used within RDF vocabularies, where
properties typically have a common meaning within a given vocabulary.
Microdata also specifies that items are values are ordered, which is not typically the case for RDF
vocabularies. In fact, unless a property has an rdfs:range
of rdf:List
, or is
unspecified, it may not be appropriate to generate an RDF Collection.
The Microdata JSON serialization does not retain datatype or language information that might be derived
from the HTML DOM. The RDF Transformation does retain both datatype and language information when it
is available.
This specification is an update to the original RDF transformation process in addition to
vocabulary-specific rules that affect the generation of property URIs and value serializations.
This is facilitated by a registry that associates URIs with specific rules based on matching
itemtype values against registered URI prefixes do determine a vocabulary and
vocabulary-specific processing rules.
Use Cases
During the period of the task force, a number of use cases were put forth for the use of Microdata
in generating RDF:
- Semantic search engines such as Sindice use RDF as their backend data model.
They want to gather information expressed using microdata alongside information expressed in RDF-based formats
and make it available to others to use, as a service. In these cases, the ultimate consumer, who will need to
understand the vocabularies used within the microdata, is the program or person who pulls out data from Sindice.
Sindice needs to retain the distinctions in the original Microdata (e.g. ordering of items) and might not have
built-in knowledge about the vocabulary of interest to the ultimate consumer. In this case, the ultimate consumer
is likely to have to map/validate/handle errors in the data they get from Sindice.
- A consumer such as openelectiondata.org wants to support
Microdata-based markup of their vocabulary as well as RDFa-based markup, both going into an RDF-based data store.
They want to use an off-the-shelf tool to extract the microdata. They want to configure the tool to give them the
RDF that is appropriate for their known vocabulary.
- A browser plugin that captures data for the user uses an RDF model as its backend store.
Any time it encounters Microdata on a page, it wants to pull that Microdata into the store on the fly.
- GoodRelations require properties to be generated
in a flat namespace, not place multiple values within a container. Ideally, a processor would make use
of
rdfs:range
declarations at parse time so properly typed literals could be constructed. It also
requires that plain literals retain language information in scope on the HTML element, as it is common that
multiple values will be used to specify the same information in different languages. Collection.
- Schema.org has an
extension mechanism to allow authors to express information
that is more detail than the pre-defined types, properties and enumerations. Property URIs are all in the same
flat-namespace as types, but authors can add more detail by using a '/' after the type or property to provide
more detail. For example, schema.org defines a musicGroupMember property having a URI of
http://schema.org/musicGroupMember
, and an author might express more detail through an ad-hoc
sub-property musicGroupMember/leadVocalist, having the URI
http://schema.org/musicGroupMember/leadVocalist
.
Issues
Decisions or open issues in the specification are tracked on the
Task Force Issue Tracker. These include the
following:
- ISSUE 1
- Vocabulary specific parsing for Microdata
- ISSUE 2
- Should Microdata-RDF generate XMLLiteral values
Goals
The purpose of this specification is to provide input to a future working group that can make decisions
about the need for a registry and the details of processing. Among the options investigated by
the Task Force are the following:
- Property URI generation using the original Microdata specification with a base URI and fragment
made up of the in-scope itemtype and itemprop elements.
- Vocabulary-based URI generation, where the vocabulary is determined from the
in-scope itemtype, either through an algorithmic modification of the type URI or by matching the
URI against a registry. The vocabulary URI is then used to generate property URIs in a namespace
parallel to the type URI.
- Type-based URI generation, where the URI of the in-scope itemtype forms the base of property URI
by adding the property to the type URI as a fragment.
- When there are multiple top-level items in a document, place items in an RDF Collection.
Alternatively, simply list the items as multiple values, or do not generate an
http://www.w3.org/1999/xhtml/microdata#item
mapping at all.
- When there are multiple values for an itemprop, place items in an RDF Collection.
Alternatively, do not use collections, use an alternative such as
rdf:Seq
, or place all values,
whether or not multiple, into some form of collection.
Attributes and Syntax
The Microdata specification [[!MICRODATA]] defines a number of attributes and the way in which those
attributes are to be interpreted. This section describes those attributes, with reference to their
original definition.
- content
-
An attribute appropriate for use with the
meta
element for creating invisible properties.
- data
-
An attribute appropriate for use with the
object
element for creating URIURI
references.
- datetime
-
An attribute appropriate for use with the
date
element for creating typed literals.
The date
element will likely be replaced with something more general purpose.
- href
-
An attribute appropriate for use with
a
, area
or link
elements for
creating URI references.
- itemid
-
An attribute containing a URI used to identify the subject of triples associated with this item.
Available through the Microdata DOM API as
element.itemId
.
(See Section 3.2 Items in [[!MICRODATA]]).
- itemprop
-
An attribute used to identify one or more properties to one ore more items. An itemprop
contains a space separated list of names which may either by absolute URIs or terms
associated with the type of the item as defined by the referencing item's
itemtype.
Available through the Microdata DOM API as
element.itemProp
.
(See Section 3.3
Names: the itemprop attribute of [[!MICRODATA]]).
- itemref
-
An additional attribute on an item that references additional elements containing property
definitions to be applied to the referencing item. The attribute value is an unordered
list of ID references to elements within the same document.
Available through the Microdata DOM API as
element.itemRef
.
(See Section 3.2 Items
of [[!MICRODATA]]).
- itemscope
-
An boolean attribute identifying an element as an item.
(See Section 3.2 Items
of [[!MICRODATA]]).
- itemtype
-
An additional attribute on an item used to specify the type of an item.
The specified type is also used to resolve non-URI names to absolute URIs.
Available through the Microdata DOM API as
element.itemType
.
(See Section 3.2 Items
of [[!MICRODATA]]).
- src
-
An attribute appropriate for use with
audio
, embed
, iframe
,
img
, source
, track
, or video
elements for creating invisible
properties.
Vocabulary Registry
In a perfect world, all processors would be able to generate the same output for a given input
without regards to the requirements of a particular vocabulary. However, Microdata doesn't
provide sufficient syntactic help in making these decisions. Different vocabularies have different
needs.
The registry associates a URI prefix with one or more key-value pairs denoting
processor behavior. A hypothetical JSON representation of such a registry might be the following:
This structure associates mappings for two URIs, http://schema.org/
and
http://microformats.org/profile/hcard
. Items having an itemtype with a URI
prefix from this registry use the the rules described for that prefix within the scope of that
itemtype. This mapping currently defines two rules: propertyURI
and
multipleValues
with values to indicate specific behavior. The interpretation of these
rules is defined in the following sections. If an item has no current type or the
registry contains no URI prefix matching current type, a conforming
processor MUST use the default values defined for these rules.
Richard Ciganiak has
pointed out that
"Registry" may be the wrong term, as the proposed registry doesn't assign identifiers or manage
namespace, it simply provides a mapping between URI prefixes and processor behavior and suggests the term
"Whitelist". As more than two values are required, and it describes more than binary behavior, this term
isn't appropriate either.
Anytime we discuss maintaining such a database, there are issues surrounding update
frequency, URL naming, and how updates are authorized. This remains an open issue. This spec
just considers the semantic content of such a list and how it can be used to affect processing without
defining its representation or update policies.
The URL of the registry must be defined.
Property URI Generation
For property names which are not absolute URIs,
the propertyURI
rule defines the algorithm for generating an absolute URI
given an evaluation context including an current type and current
property.
The procedure for generating property URIs is defined in
Generate Predicate URI.
Possible values for propertyURI
are the following:
context
-
The
context
URI generation scheme guarantees that generated property URIs are
unique for each current type and current property combination. This is
required as the Microdata model requires that property names are associated with specific
items and do not have a global scope.
type
-
The
type
URI generation scheme appends property names that are not
absolute URIs to current type using a "#" separator.
vocabulary
-
The
vocabulary
URI generation scheme appends property names that are not
absolute URIs to the URI prefix.
The default value of propertyURI
is context
.
Value Ordering
For items having multiple values for a property,
the multipleValues
rule defines the algorithm for serializing these values.
This is required as the Microdata data model requires that values be strictly ordered as defined in
Microdata DOM API
as element.itemValue
. However, many RDF vocabularies expect multiple values to be generated
as triples sharing a common subject and predicate.
Possible values for multipleValues
are the following:
unordered
-
Values are serialized without ordering using a common subject and predicate.
list
-
Multi-valued itemprops are serialized using an RDF Collection.
The default value of multipleValues
is list
.
Algorithm
Transformation of Microdata to RDF makes use of general processing rules described in [[!MICRODATA]]
for the treatment of items.
Algorithm Terms
- absolute URI
-
As defined in [[!RFC3986]], an absolute URI contains both scheme and scheme-specific-parts.
- blank node
-
A blank node is a node in a graph that is neither a URI reference nor a literal.
Items without a global identifier have a blank node allocated to them.
(See [[RDF-CONCEPTS]]).
- document base
-
The base address of the document being processed, as defined in Section 2.6.3 Resolving URLs of
[[!HTML5]].
- evaluation context
-
A data structure including the following elements:
- memory
-
a mapping of items to subjects, initially empty
- current property
-
an absolute URI for the current property, used for generating URIs
for properties of items without an explicit itemtype.
- current type
-
an absolute URI for the current type, used when an item does not
contain an explicit itemtype
- current vocabulary
-
an absolute URI for the current vocabulary, from the registry
- item
-
An item is defined as an element containing an itemscope attribute. (See Section 3.2 Items of
[[!MICRODATA]]).
- item properties
-
The mechanism for finding the properties of an item are described in
Section 3.5
Associating names with items of [[!MICRODATA]].
Available through the Microdata DOM API as
element.properties
.
- global identifier
-
The value of an item's itemid attribute, if it has one. (See Section 3.2 Items of
[[!MICRODATA]]).
- literal
-
Literals a values such as strings and dates, including typed literals and
plain literals.
(See [[RDF-CONCEPTS]]).
- property names
-
The tokens of an element's itemprop attribute.
(See Section 3.3 Names: the
itemprop attribute of [[!MICRODATA]]).
- property value
-
The property value of a name-value pair added by an element with an itemprop
attribute depends on the element.
Available through the Microdata DOM API as
element.itemValue
.
(Updated from Section
3.4 Values of [[!MICRODATA]]).
- If the element also has an itemscope attribute
-
The value is the item created by the element as a URI reference or
blank node
- If the element is a
meta
element
-
The value is the plain literal created from the value of the element's content
attribute, if any, or the empty string if there is no such attribute.
If the language of the element is known it MUST be used when creating the plain literal.
-
If the element is an
audio
, embed
, iframe
, img
,
source
, track
, or video
element with a src attribute
-
The value is a URI reference that results from resolving the value of the element's
src attribute relative to the element at the time the attribute is set.
-
If the element is an
a
, area
, or link
element with an
href attribute
-
The value is a URI reference that results from resolving the value of the element's
href attribute relative to the element at the time the attribute is set.
- If the element is an
object
element with a data attribute
-
The value is URI reference that results from resolving the value of the element's
data attribute relative to the element at the time the attribute is set.
- If the element is a
time
element with a datetime attribute
The time
element will likely be replaced with something more general purpose.
-
-
If the value has the lexical form of xsd:date [[!RDF-SCHEMA]].
-
The value is a typed literal composed of the value and
http://www.w3.org/2001/XMLSchema#date
.
-
If the value has the lexical form of xsd:time [[!RDF-SCHEMA]].
-
The value is a typed literal composed of the value and
http://www.w3.org/2001/XMLSchema#time
.
-
If the value has the lexical form of xsd:dateTime [[!RDF-SCHEMA]].
-
The value is a typed literal composed of the value and
http://www.w3.org/2001/XMLSchema#dateTime
.
- Otherwise
- The value is a plain literal created from the value.
- Otherwise
-
The value is a plain literal, with the language information set from the language of the
element, if it is not unknown.
- top-level item
-
An item which does not contain an itemprop attribute.
Available through the Microdata DOM API as
document.getItems
.
(See Section 3.5
Associating names with items of [[!MICRODATA]]).
- URI reference
-
URI references are suitable to be used in subject, predicate or object positions
within an RDF triple, as opposed to a literal value that may contain a string representation of a
URI. (See [[RDF-CONCEPTS]]).
- vocabulary
-
A vocabulary is a collection of URIs, suitable for use as an itemtype or itemprop
value, that share a common URI prefix. That prefix is the vocabulary URI. A vocabulary URL is not
allowed to be a prefix of another vocabulary URI.
This definition differs from the language in the HTML spec and is just for the purpose of this
document. In HTML, a vocabulary is a specification, and doesn't have a URI. In our view, if one
specification defines ten
itemtypes, then these could be treated as one vocabulary or as ten
distinct vocabularies; it is entirely up to the vocabulary creator.
RDF Conversion Algorithm
A HTML document containing Microdata MAY be converted to any other RDF-compatible document
format using the algorithm specified in this section.
The algorithm below is designed for DOM-based implementations with CSS selector access to elements.
A conforming Microdata processor implementing RDF conversion MUST implement a
processing algorithm that results in the equivalent triples that the following
algorithm generates:
Set item list to an empty list.
- For each element that is also a top-level item run the following algorithm:
-
Generate the triples for an item item, using the
evaluation context.
Let result be the (URI reference or blank node) subject returned.
-
Append result to item list.
-
If item list contains multiple values, generate an RDF Collection list from the ordered list of values.
Set value to the value returned from generate an RDF
Collection.
-
Otherwise, if item list contains a single value set value to that value.
-
Generate the following triple:
- subject
- Document base
- predicate
http://www.w3.org/1999/xhtml/microdata#item
- object
- value
Generate the triples
When the user agent is to Generate triples for an item item, given an
Evaluation Context, it must run the following steps:
This algorithm has undergone substantial change from the original Microdata specification [[!MICRODATA]].
-
If there is an entry for item in memory, then let subject be the subject of
that entry. Otherwise, if item has a global identifier and that
global identifier is an absolute URI, let subject be that
global identifier. Otherwise, let subject be a new blank node.
- Add a mapping from item to subject in memory
-
If item has an itemtype attribute, extract the value as type.
- If type is an absolute URI, generate the following triple:
- subject
- subject
- predicate
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
- object
- type (as a URI reference)
-
If type is not an absolute URI, set it to current type from the
Evaluation Context if not empty.
- If the registry contains a URI prefix that is a character for character match of type
up to the length of the URI prefix, set vocab as that URI prefix.
-
Set property list to an empty mapping between properties and one or more ordered
values as established below.
-
For each element element that has one or more property names and is one of the
properties of the item item, in the order those elements
are given by the algorithm that returns the properties of the item,
run the following substep:
-
For each name in the element's property names, run the following substeps:
-
Let context to a copy of evaluation context with current type set
to type and current vocabulary set to vocab.
-
Let predicate be the result of generate predicate URI
using context and name. Update context by setting
current property to predicate.
-
Let value be the property value of element.
-
If value is an item, then generate the
triples for value using context. Replace value by the subject returned
from those steps.
-
Add value to property list for predicate.
-
For each predicate in property list:
- Generate property values using a copy of
evaluation context with current property set to predicate and
current vocabulary set to vocab along with subject and
the list of values associated with predicate from property list as values.
- Return subject
Generate Predicate URI
Predicate URI generation makes use of current type, current property
and current vocabulary from an evaluation context context
along with name.
- If name is an absolute URI, return name
as a URI reference.
- Otherwise, if current vocabulary from context is not null
and registry has an entry for current vocabulary having a
propertyURI entry that is not null, set that as method. Otherwise,
set method to
contextual
.
- If method is
vocabulary
return the URI reference constructed
by appending the fragment escaped value of name to current vocabulary.
- If method is
type
, return the URI reference constructed as follows:
- Let s be current type from context.
- If s does not contain a U+0023 NUMBER SIGN character (#),
then append a U+0023 NUMBER SIGN character (#) to s.
- Return the concatenation of s
and the fragment-escaped value of name as a URI reference.
- Otherwise, if current type from context return the
URI reference constructed as follows:
- Let s be document base.
- If s does not contain a U+0023 NUMBER SIGN character (#),
then append a U+0023 NUMBER SIGN character (#) to s.
- Return the concatenation of s
and the fragment-escaped value of name as a URI reference.
- Otherwise, return the URI reference constructed as follows:
- Let s be current type from context.
- If the last character of s is not a U+003A COLON character (:),
append a U+0025 PERCENT SIGN character (%), a U+0032 DIGIT TWO character (2), and a U+0030 DIGIT ZERO
character (0) to s.
- Append the fragment-escaped value of name to s.
- Return the concatenation of
http://www.w3.org/1999/xhtml/microdata#
and the fragment-escaped value of s as a URI reference.
Generate Property Values
Property value serialization makes use of current vocabulary
from an evaluation context context along with subject
and values.
- Let predicate be current property from context.
- If current vocabulary from context is not null
and registry has an entry for current vocabulary having a
multipleValues
entry that is not null, set that as method. Otherwise,
set method to list
.
- If method is
unordered
, foreach value in values, generate
the following triple:
- subject
- subject
- predicate
- predicate
- object
- value
- Otherwise, if method is
list
:
- If values contains multiple values, generate an RDF Collection list from the ordered list of values.
Set value to the value returned from generate an RDF
Collection.
-
Otherwise, if values contains a single value set value to that value.
-
Generate the following triple:
- subject
- subject
- predicate
- predicate
- object
- value
Generate RDF Collection
An RDF Collection is a mechanism for defining ordered sequences of objects in RDF (See Section 5.2 RDF Collections in
[[!RDF-SCHEMA]]). As the RDF data-model is that of an unordered graph, a linking method using properties
rdf:first
and rdf:next
is required to be able to specify a particular order.
In the Microdata to RDF mapping, RDF Collections are used when an item has more than one value
associated with a given property to ensure that the original document order is maintained. The following
procedure should be used to generate triples when an item property has more than one value
(contained in list):
-
Create a new array array containing a blank node for every value in list.
-
For each pair of bnode and value from array and list the following
triple is generated:
- subject
- bnode
- predicate
http://www.w3.org/1999/02/22-rdf-syntax-ns#first
- object
- value
-
For each bnode in array the following triple is generated:
- subject
- bnode
- predicate
http://www.w3.org/1999/02/22-rdf-syntax-ns#rest
- object
-
next element in array or, if that does not exist,
http://www.w3.org/1999/02/22-rdf-syntax-ns#nil
-
Return the first blank node from array.
Markup Examples
The Microdata example below expresses book information as an FRBR Work item.
Assuming that registry contains a an entry for http://purl.org/vocab/frbr/core#
with propertyURI
set to vocabulary
,
this is equivalent to the following Turtle:
The following snippet of HTML has microdata for two people with the same address:
Assuming that registry contains a an entry for http://microformats.org/profile/hcard
with propertyURI
set to type
,
it generates these triples expressed in Turtle: