HTML microdata [[!MICRODATA]] is an extension to HTML used to embed machine-readable data into HTML documents. Whereas the microdata specification describes a means of markup, the output format is JSON. This specification describes processing rules that may be used to extract RDF [[!RDF-CONCEPTS]] from an HTML document containing microdata.
This document is an experimental work in progress. The concepts described herein are intended to provide guidance for a possible future Working Group chartered to provide a Recommendation for this transformation. As a consequence, implementers of this specification, either producers or consumers, should note that it may change prior to any possible publication as a Recommendation.
This Working Draft is an update of the W3C Interest Group Note, published in March 2012. This update adds the Vocabulary Expansion feature to the conversion algorithm, in response to the evolution of vocabularies discussed on the Web Schemas Task Force of the Semantic Web Interest Group at W3C. The intention is to publish this draft as a new version of the Interest Group Note after gathering and incorporating community input.
This document describes a means of transforming HTML containing microdata into RDF. HTML Microdata [[!MICRODATA]] is an extension to HTML used to embed machine-readable data to HTML documents. This specification describes transformation directly to RDF [[RDF-CONCEPTS]].
There are a variety of ways in which a mapping from microdata to RDF might be configured to give a result that is closer to the required result for a particular vocabulary. This specification defines terms that can be used as hooks for vocabulary-specific behavior, which could be defined within a registry or on an implementation-defined basis. However, the HTML Data TF recommends the adoption of a single method of mapping in which every vocabulary is treated as if:
propertyURI
is set to vocabulary
multipleValues
is set to unordered
For background on the trade-offs between these options, see http://www.w3.org/wiki/Mapping_Microdata_to_RDF.
Microdata [[!MICRODATA]] is a way of embedding data in HTML documents using attributes. The HTML DOM is extended to provide an API for accessing microdata information, and the microdata specification defines how to generate a JSON representation from microdata markup.
Mapping microdata to RDF enables consumers to merge data expressed in other RDF-based formats with microdata. It facilitates the use of RDF vocabularies within microdata, and enables microdata to be used with the full RDF toolchain. Some use cases for this mapping are described in Section 1.2 below.
Microdata's data model does not align neatly with RDF.
http://example.org/Cat
can have
both the property color
and the property http://example.org/color
,
and these properties are semantically distinct under microdata. In
RDF, all properties have IRIs.@lang
attributes could
be used to provide datatype and language information for RDF data, this
would be contrary to the microdata specification.Thus, in some places the needs of RDF consumers violate requirements of the microdata specification. This specification highlights where such violations occur and the reasons for them.
This specification allows for vocabulary-specific rules that affect the generation of property URIs and value serializations. This is facilitated by a registry that associates URIs with specific rules based on matching itemtype values against registered URI prefixes do determine a vocabulary and potentially vocabulary-specific processing rules.
This specification also assumes that consumers of RDF generated from microdata may have to process the results in order to, for example, assign appropriate datatypes to property values.
During the period of the task force, a number of use cases were put forth for the use of microdata in generating RDF:
rdfs:range
of a GoodRelations
property indicates the datatype of the expected value, and GoodRelations
processors will expect values to be cast to that type. Language
information from the HTML needs to be captured as it is common that
multiple values will be used to specify the same information in different
languages.http://schema.org/musicGroupMember
, and an author might express more detail through an ad-hoc
sub-property musicGroupMember/leadVocalist, having the URI
http://schema.org/musicGroupMember/leadVocalist
.Decisions or open issues in the specification are tracked on the Task Force Issue Tracker. These include the following:
Vocabulary specific parsing for Microdata. This specification attempts to create generic rules for processing microdata with typical RDF vocabularies. A registry allows for exceptions to the default processing rules for certain well-known vocabularies.
Should Microdata-RDF generate XMLLiteral values. This issue has been closed with no change as this would violate microdata's data model.
Should the registry allow property datatype specification. The consensus is that datatypes are only derived from HTML semantics, so that only <time> values have a datatype other than plain.
Should the registry allow a name or URL to be used as an alias for itemid.
The purpose of this specification is to provide input to a future working group that can make decisions about the need for a registry and the details of processing. Among the options investigated by the Task Force are the following:
http://www.w3.org/ns/md#item
mapping at all.rdf:Seq
, or place all values,
whether or not multiple, into some form of collection.More examples and explanatory information are available in [[MICRODATA-RDF-SUPPLEMENT]], which may be updated from time to time.
The microdata specification [[!MICRODATA]] defines a number of attributes and the way in which those attributes are to be interpreted. The microdata DOM API provides methods and attributes for retrieving microdata from the HTML DOM.
For reference, attributes used for specifying and retrieving HTML microdata are referenced here:
element.itemType
on the element.
The item type is also used to resolve non-URL names to absolute URLs.
Available through the
Microdata DOM API as
element.itemType
.
(See Items
in [[!MICRODATA]]).
In RDF, it is common for people to shorten vocabulary terms via abbreviated URIs that use a 'prefix' and a 'reference'. throughout this document assume that the following vocabulary prefixes have been defined:
dc: | http://purl.org/dc/terms/ |
md: | http://www.w3.org/ns/md# |
rdf: | http://www.w3.org/1999/02/22-rdf-syntax-ns# |
rdf: | http://www.w3.org/1999/02/22-rdf-syntax-ns# |
rdfa: | http://www.w3.org/ns/rdfa# |
xsd: | http://www.w3.org/2001/XMLSchema# |
In a perfect world, all processors would be able to generate the same output for a given input without regards to the requirements of a particular vocabulary. However, microdata doesn't provide sufficient syntactic help in making these decisions. Different vocabularies have different needs.
The registry is located at the namespace defined for microdata: http://www.w3.org/ns/md
in
a variety of formats.
The registry associates a URI prefix with one or more key-value pairs denoting processor behavior. A hypothetical JSON representation of such a registry might be the following:
This structure associates mappings for two URIs, http://schema.org/
and
http://microformats.org/profile/hcard
. Items having an item type with a URI
prefix from this registry use the the rules described for that prefix within the scope of that
item type. This mapping currently defines two rules: propertyURI
and
multipleValues
with values to indicate specific behavior. It also allows overrides
on a per-property basis; the properties
key associates an individual name
with overrides for default behavior.
The interpretation of these
rules is defined in the following sections. If an item has no current type or the
registry contains no URI prefix matching current type, a conforming
processor MUST use the default values defined for these rules.
The concept of a registry, including a hypothetical format, location and updating rules is presented as an abstract concept useful for describing the function of a microdata processor. There are issues surrounding update frequency, URL naming, and how updates are authorized. This spec just considers the semantic content of such a registry and how it can be used to affect processing without defining its representation or update policies.
For names which are not absolute URLs,
the propertyURI
rule defines the algorithm for generating an absolute URL
given an evaluation context including a current type, current name and
current vocabulary.
The procedure for generating property URIs is defined in Generate Predicate URI.
Possible values for propertyURI
are the following:
contextual
contextual
URI generation scheme guarantees that generated property URIs are
unique based on the value of current name. This is
required as the microdata data model requires that names are associated with specific
items and do not have a global scope. (See Step 5 in
Generate Predicate URI).
URI creation uses a base URI with query parameters to indicate the in-scope type and name list. Consider the following example:
The first name n generates the URI
http://www.w3.org/ns/md?type=http://microformats.org/profile/hcard?prop=n
.
However, the included name given-name is included in untyped item.
The inherited property URI is used to create a new property URI:
http://www.w3.org/ns/md?type=http://microformats.org/profile/hcard?prop=n.given-name
.
This scheme is compatible with the needs of other RDF serialization formats such as RDF/XML [[RDF-SYNTAX-GRAMMAR]], which rely on QNames for expressing properties. For example, the generated property URIs can be split as follows:
Looking at another example:
This would generate http://www.w3.org/ns/md?type=http://schema.org/Person&prop=name
.
vocabulary
vocabulary
URI generation scheme appends names that are not
absolute URLs to the URI prefix. When generating property URIs, if the URI prefix
does not end with a '/' or '#', a '#' is appended to the URI prefix. (See Step 4
in
Generate Predicate URI.)
URI creation uses a base URL with query parameters to indicate the in-scope type and name list. Consider the following example:
Given the URI prefix http://microformats.org/profile/hcard
, this
would generate http://microformats.org/profile/hcard#n
and
http://microformats.org/profile/hcard#given-name
. Note that the '#' is automatically
added as a separator.
Looking at another example:
Given the URI prefix http://schema.org/
,
this would generate http://schema.org/name
. Note that if the itemtype
were http://schema.org/Person/Teacher
, this would generate the same property URI.
If the registry contains no match for current type implementations act as if there is a URI prefix made from the first itemtype value by stripping either the fragment content or last path segment, if the value has no fragment (See [[!RFC3986]]).
Deconstructing the itemtype URL to create or identify a vocabulary URI is a violation of the microdata specification which is necessary to support the use of existing vocabularies designed for use with RDF, and shared or inherited properties within all vocabularies.
The default value of propertyURI
is vocabulary
.
In this example, assuming no matching entry in the registry,
the URI prefix is constructed by removing the
last path segment, leaving the URI
http://schema.org/
. As there is no explicit propertyURI
,
the default vocabulary
is used, and the resulting property URI would be
http://schema.org/title
.
For items having multiple values for a given property,
the multipleValues
rule defines the algorithm for serializing these values.
Microdata uses document order when generating property values, as defined in
Microdata DOM API
as element.itemValue
. However, many RDF vocabularies expect multiple values to be generated
as triples sharing a common subject and predicate. In some cases, it may be useful to retain value ordering.
The procedure for generating property values is defined in Generate Property Values.
Possible values for multipleValues
are the following:
unordered
list
An example of how this might be specified in a registry is the following:
Additionally, some vocabularies may wish to specify this on a per-property basis. For example,
within http://schema.org/MusicPlaylist
the tracks
property might depend on the order
of values to to reproduce associated MusicRecording
values.
The properties
key takes a JSON Object as a value, which in turn has keys for each
property that is to be given alternate semantics. Each name is implicitly expanded to it's URI
representation as defined in Generate Predicate URI, so that
the behavior is the same whether or not the name is listed as an absolute URL.
The default value of multipleValues
is unordered
.
An alternative mechanism would output both unordered and ordered values, to allow an application to choose the most useful representation. For example, consider the following:
This might generate the following Turtle:
By providing both _:track1
and _:track2
as object values of the playlist
along with an RDF Collection containing the ordered values, the data may be queried via a simple
query using the playlist subject, or as an ordered collection.
In microdata, all values are strings. In RDF, values may be resources or may be typed with an appropriate datatype.
In some cases, the type of a microdata value can be determined from the element on which it is specified. In particular:
time
element provides dates and timesUsing information about the content of the document where the microdata is marked up might be a violation of the spirit of the microdata specification, though it does not explicitly say in normative text that consumers cannot use other information from the HTML DOM to interpret microdata.
Additionally, one possible use of a registry would allow vocabularies to be marked with datatype information,
so that a dc:time
value, for example, would be understood to represent a literal with datatype
xsd:date
. This could be done by adding information for each property in the vocabulary requiring
special treatment.
This might be represented using a syntax such as the following:
The datatype
identifies a URI to be used in constructing a typed literal.
In most cases, the relevant datatype for a value can be derived from knowledge of what property the value is for and the syntax of the value itself. Thus, values can be given datatypes in a post-processing step after the mapping of microdata to RDF described by this specification. However, where there is information in the HTML markup, such as knowledge of what element was used to mark up the value, which can help with determining its datatype, that information is used by this specification.
This concept is not explored further at this time, but could be developed further in a future revision of this document.
If property URI generation was fixed to vocabulary
, multiple values always
generated both unordered
and ordered
representations, and there were datatype
support, the registry could be reduced to a simple list of URLs without any further structure necessary.
Microdata requires that all values of itemtype come from the same vocabulary. This is required as itemprop values are resolved relative to that vocabulary. However, it is often useful to define an item to have types from multiple different vocabularies.
Vocabulary expansion uses simple rules to generate additional triples based on
rules and property relationships described in the registry.
Within the registry, a property definition may have either equivalentProperty
or subPropertyOf
keys having a IRI value (or array of IRI values)
of the associated property. Such an
entry causes the processor to generate triples associating the source
property IRI with the target property IRI using either
http://www.w3.org/2000/01/rdf-schema#subPropertyOf
or
http://www.w3.org/2002/07/owl#equivalentProperty
predicates.
For example, the registry definition for the additionalType property
within schema.org, defines additionalType to have an rdfs:subPropertyOf
relationship with http://www.w3.org/1999/02/22-rdf-syntax-ns#type
.
The previous example, indicates a registry rule, which causes the processor to emit
an extra triple when first seeing the additionalProperty
itemprop:
After performing vocabulary expansion, an additional rdf:type
triple is generated:
Formally, and for the purpose of vocabulary processing, microdata uses a very restricted subset of the OWL2 vocabulary and is based on the RDF-Based Semantics of OWL2 [[!OWL2-RDF-BASED-SEMANTICS]]. Vocabulary Entailment uses the following terms:
rdfs:subPropertyOf
owl:equivalentProperty
Vocabulary Entailment considers only the entailment on individuals (i.e., not on the relationships that can be deduced on the properties or the classes themselves.)
While the formal definition of the Entailment
refers to the general OWL 2 Semantics, practical implementations may
rely on a subset of the OWL 2 RL Profile’s entailment expressed in
rules
(
section 4.3
of [[!OWL2-PROFILES]]). The
relevant rules are, using the rule identifications in
section 4.3
of [[!OWL2-PROFILES]]): prp-spo1
, prp-eqp1
, and
prp-eqp2
.
[[RDFA-CORE]] implements a more complete form of vocabulary entailement,
including retrieving the vocabulary URI to find additional class and property expansion definitions, as
described in RDFa Vocabulary Entailment.
Microdata implementations MAY use RDFa Vocabulary Entailment as an alternative to implementing
a separate entailment algorithm. To allow [[RDFA-CORE]] processors to be used for microdata vocabulary expansion, microdata acts as if there is an implicit @vocab
RDFa attribute set to a detected vocabulary by emitting
a triple using the rdfa:usesVocabulary
predicate.
The entailment described in this section is the minimum
useful level for microdata. Processors may, of course, choose to follow
more powerful entailment regimes, e.g., include full RDFS [[!RDF-MT]]
or OWL2 [[!OWL2-OVERVIEW]] entailments. Using those entailments
applications may perform datatype validation by checking rdfs:range
of a property, or use the advanced facilities offered by, e.g., OWL2’s
property chains to interlink vocabularies further.
Conforming processors MUST perform the basic vocabulary expansion.
If vocabulary expansion is performed by the microdata processor using [[RDFA-CORE]] vocabulary expansion, and the
vocab_expansion
option is passed to the microdata processor,
the full [[RDFA-CORE]] expansion MUST also be performed.
Transformation of Microdata to RDF makes use of general processing rules described in [[!MICRODATA]] for the treatment of items.
contextual
property URI generation
scheme. Without this scheme, this evaluation context component would not be required.
document.getItems
method.
element.properties
attribute.
a
, area
, audio
,
embed
, iframe
, img
, link
, object
,
source
, track
or video
)
element.itemValue
.
(See relevant attribute descriptions in [[!HTML5]]).
time
element.
element.itemValue
.
http://www.w3.org/2001/XMLSchema#date
.
http://www.w3.org/2001/XMLSchema#time
.
http://www.w3.org/2001/XMLSchema#dateTime
.
http://www.w3.org/2001/XMLSchema#gYearMonth
.
http://www.w3.org/2001/XMLSchema#gYear
.
http://www.w3.org/2001/XMLSchema#duration
.
The referenced version of [[!HTML5]] does not include a duration data type, but it is in the Editor's Draft and is expected to be included in a forthcoming update to the Working Draft
The HTML valid yearless date string is similar to xsd:gMonthDay, but the lexical forms differ, so it is not included in this conversion.
See
The time
element
in [[!HTML5]].
See
The lang
and xml:lang
attributes
in [[!HTML5]] for determining the language of a node.
document.getItems
.
(See Associating names with items in [[!MICRODATA]]).
The HTML5/microdata content model for @href
, @src
,
@data
, itemtype and itemprop and itemid is that of
a URL, not a URI or IRI.
A proposed mechanism for specifying the range of property values to be URI reference or IRI could
allow these to be specified as subject or object using a @content
attribute.
A HTML document containing microdata MAY be converted to any other RDF-compatible document format using the algorithm specified in this section.
A conforming microdata processor implementing RDF conversion MUST implement a processing algorithm that results in the equivalent triples to those that the following algorithm generates:
Set item list to an empty list.
http://www.w3.org/ns/md#item
When the user agent is to Generate triples for an item item, given evaluation context, it must run the following steps:
This algorithm has undergone substantial change from the original microdata specification [[!MICRODATA]].
element.itemType
of the element defining the item.
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
element.itemType
of the element defining the item.
http://www.w3.org/ns/rdfa#usesVocabulary
Predicate URI generation makes use of current type, current name, and current vocabulary from an evaluation context context along with name.
http://example.org/doc
and an itemprop
of 'title', a URI will be constructed to be
http://example.org/doc#title
.
vocabulary
.vocabulary
set expandedURI to the URI reference constructed by
appending the fragment-escaped value of name to current
vocabulary, separated by a U+0023 NUMBER SIGN character (#) unless the
current vocabulary ends with either a U+0023 NUMBER SIGN character
(#) or SOLIDUS U+002F (/).contextual
, set expandedURI to the URI
reference constructed as follows:
http://www.w3.org/ns/md?type=
is a prefix of s, return the concatenation of s, a
U+002E FULL STOP character (.) and the fragment-escaped value
of name.http://www.w3.org/ns/md?type=
, the
fragment-escaped value of current type, the string
&prop=
, and the fragment-escaped value of
name. equivalentProperty
key,
generate the following triple using the
value of that key:
http://www.w3.org/2002/07/owl#equivalentProperty
If the value is an array, generate a triple for each value of that array.
subPropertyOf
key, generate the following triple using the
value of that key:
http://www.w3.org/2000/01/rdf-schema#subPropertyOf
If the value is an array, generate a triple for each value of that array.
Property value serialization makes use of subject, predicate and values.
properties
which has a JSON Object value, let properties be that value. Otherwise, set properties
to null.multipleValues
, set that as method.multipleValues
, set that as method.unordered
.unordered
,
for each value in values, generate the following triple:
list
:
An RDF Collection is a mechanism for defining ordered sequences of objects in RDF (See RDF Collections in
[[!RDF-SCHEMA]]). As the RDF data-model is that of an unordered graph, a linking method using properties
rdf:first
and rdf:next
is required to be able to specify a particular order.
In the microdata to RDF mapping, RDF Collections are used when an item has more than one value associated with a given property to ensure that the original document order is maintained. The following procedure should be used to generate triples when an item property has more than one value (contained in list):
http://www.w3.org/1999/02/22-rdf-syntax-ns#first
http://www.w3.org/1999/02/22-rdf-syntax-ns#rest
http://www.w3.org/1999/02/22-rdf-syntax-ns#nil
A test suite [[MICRODATA-RDF-TESTS]] under development to help processor developers verify conformance to this specification.
The microdata example below expresses book information as an FRBR Work item.
Assuming that registry contains a an entry for http://purl.org/vocab/frbr/core#
with propertyURI
set to vocabulary
,
this is equivalent to the following Turtle:
The following snippet of HTML has microdata for two people with the same address. This illustrates two items referencing a third item, and how only a single RDF resource definition is created for that third item.
Assuming that registry contains a an entry for http://microformats.org/profile/hcard
with propertyURI
set to vocabulary
,
it generates these triples expressed in Turtle:
The following snippet of HTML has microdata for a playlist, and illustrates overriding a property
to place elements in an RDF Collection. This also illustrates the use of the schema:additionalType
property to relate recordings to the Music Ontology:
Assuming that registry contains a an entry for http://schema.org/
with propertyURI
set to vocabulary
,
multipleValues
set to unordered
with the properties
track
and byArtist
having multipleValues
set to list
,
it generates these triples expressed in Turtle:
The following is an example registry in JSON format.
{ "http://schema.org/": { "propertyURI": "vocabulary", "multipleValues": "unordered", "properties": { "additionalType": {"subPropertyOf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"}, "blogPosts": {"multipleValues": "list"}, "breadcrumb": {"multipleValues": "list"}, "byArtist": {"multipleValues": "list"}, "creator": {"multipleValues": "list"}, "episode": {"multipleValues": "list"}, "episodes": {"multipleValues": "list"}, "event": {"multipleValues": "list"}, "events": {"multipleValues": "list"}, "founder": {"multipleValues": "list"}, "founders": {"multipleValues": "list"}, "itemListElement": {"multipleValues": "list"}, "musicGroupMember": {"multipleValues": "list"}, "performerIn": {"multipleValues": "list"}, "actor": {"multipleValues": "list"}, "actors": {"multipleValues": "list"}, "performer": {"multipleValues": "list"}, "performers": {"multipleValues": "list"}, "producer": {"multipleValues": "list"}, "recipeInstructions": {"multipleValues": "list"}, "season": {"multipleValues": "list"}, "seasons": {"multipleValues": "list"}, "subEvent": {"multipleValues": "list"}, "subEvents": {"multipleValues": "list"}, "track": {"multipleValues": "list"}, "tracks": {"multipleValues": "list"} } }, "http://microformats.org/profile/hcard": { "propertyURI": "vocabulary", "multipleValues": "unordered" }, "http://microformats.org/profile/hcalendar#": { "propertyURI": "vocabulary", "multipleValues": "unordered", "properties": { "categories": {"multipleValues": "list"} } } }
Thanks to Richard Cyganiak for property URI and vocabulary terminology and the general excellent consideration of practical problems in generating RDF from microdata.