Microdata to RDF

Abstract

HTML microdata [MICRODATA] is an extension to HTML used to embed machine-readable data into HTML documents. Whereas the microdata specification describes a means of markup, the output format is JSON. This specification describes processing rules that may be used to extract RDF [RDF-CONCEPTS] from an HTML document containing microdata.

1. Introduction

This section is non-normative.

This document describes a means of transforming HTML containing microdata into RDF. HTML Microdata [MICRODATA] is an extension to HTML used to embed machine-readable data to HTML documents. This specification describes transformation directly to RDF [RDF-CONCEPTS].

Note

There are a variety of ways in which a mapping from microdata to RDF might be configured to give a result that is closer to the required result for a particular vocabulary. This specification defines terms that can be used as hooks for vocabulary-specific behavior, which could be defined within a registry or on an implementation-defined basis. However, the HTML Data TF recommends the adoption of a single method of mapping in which every vocabulary is treated as if:

propertyURI is set to vocabulary
multipleValues is set to unordered

For background on the trade-offs between these options, see http://www.w3.org/wiki/Mapping_Microdata_to_RDF.

1.1 Background

This section is non-normative.

Microdata [MICRODATA] is a way of embedding data in HTML documents using attributes. The HTML DOM is extended to provide an API for accessing microdata information, and the microdata specification defines how to generate a JSON representation from microdata markup.

Mapping microdata to RDF enables consumers to merge data expressed in other RDF-based formats with microdata. It facilitates the use of RDF vocabularies within microdata, and enables microdata to be used with the full RDF toolchain. Some use cases for this mapping are described in Section 1.2 below.

Microdata's data model does not align neatly with RDF.

Non-URL microdata properties are disambiguated based on microdata item type; an item with the type http://example.org/Cat can have both the property color and the property http://example.org/color, and these properties are semantically distinct under microdata. In RDF, all properties have IRIs.
When an item has multiple properties with the same name, the values are always ordered; in RDF, property values are unordered unless they are explicitly listed in an RDF Collection.
A value in microdata is always a simple string which is interpreted by the consuming application. In RDF, values can be tagged with a datatype or a language. According to the microdata specification, the HTML context of microdata markup should not change how microdata is interpreted, so although element names and HTML @lang attributes could be used to provide datatype and language information for RDF data, this would be contrary to the microdata specification.

Thus, in some places the needs of RDF consumers violate requirements of the microdata specification. This specification highlights where such violations occur and the reasons for them.

This specification allows for vocabulary-specific rules that affect the generation of property URIs and value serializations. This is facilitated by a registry that associates URIs with specific rules based on matching itemtype values against registered URI prefixes do determine a vocabulary and potentially vocabulary-specific processing rules.

This specification also assumes that consumers of RDF generated from microdata may have to process the results in order to, for example, assign appropriate datatypes to property values.

1.2 Use Cases

This section is non-normative.

During the period of the task force, a number of use cases were put forth for the use of microdata in generating RDF:

Semantic search engines such as Sindice use RDF as their backend data model. They want to gather information expressed using microdata alongside information expressed in RDF-based formats and make it available to others to use, as a service. In these cases, the ultimate consumer, who will need to understand the vocabularies used within the microdata, is the program or person who pulls out data from Sindice. Sindice needs to retain the distinctions in the original microdata (e.g. ordering of items) and might not have built-in knowledge about the vocabulary of interest to the ultimate consumer. In this case, the ultimate consumer is likely to have to map/validate/handle errors in the data they get from Sindice.
A consumer such as openelectiondata.org wants to support microdata-based markup of their vocabulary as well as RDFa-based markup, both going into an RDF-based data store. They want to use an off-the-shelf tool to extract the microdata. They want to configure the tool to give them the RDF that is appropriate for their known vocabulary.
A browser plugin that captures data for the user uses an RDF model as its backend store. Any time it encounters microdata on a page, it wants to pull that microdata into the store on the fly.
GoodRelations properties do not take rdf:List values; when they take multiple values they are unordered. The rdfs:range of a GoodRelations property indicates the datatype of the expected value, and GoodRelations processors will expect values to be cast to that type. Language information from the HTML needs to be captured as it is common that multiple values will be used to specify the same information in different languages.
Schema.org has an extension mechanism to allow authors to express information that is more detail than the pre-defined types, properties and enumerations. Property URIs are all in the same flat-namespace as types, but authors can add more detail by using a '/' after the type or property to provide more detail. For example, schema.org defines a musicGroupMember property having a URI of http://schema.org/musicGroupMember, and an author might express more detail through an ad-hoc sub-property musicGroupMember/leadVocalist, having the URI http://schema.org/musicGroupMember/leadVocalist.

1.3 Issues

This section is non-normative.

Decisions or open issues in the specification are tracked on the Task Force Issue Tracker. These include the following:

Issue 1

Vocabulary specific parsing for Microdata. This specification attempts to create generic rules for processing microdata with typical RDF vocabularies. A registry allows for exceptions to the default processing rules for certain well-known vocabularies.

Issue 2

Should Microdata-RDF generate XMLLiteral values. This issue has been closed with no change as this would violate microdata's data model.

Issue 3

Should the registry allow property datatype specification. The consensus is that datatypes are only derived from HTML semantics, so that only <time> values have a datatype other than plain.

Issue 4

Should the registry allow a name or URL to be used as an alias for itemid.

The purpose of this specification is to provide input to a future working group that can make decisions about the need for a registry and the details of processing. Among the options investigated by the Task Force are the following:

Property URI generation using the original microdata specification with a base URI and fragment made up of the in-scope item type and properties.
Vocabulary-based URI generation, where the vocabulary is determined from the in-scope item type, either through an algorithmic modification of the type URL or by matching the URL against a registry. The vocabulary URI is then used to generate property URIs in a namespace parallel to the type URI.
When there are multiple top-level items in a document, place items in an RDF Collection. Alternatively, simply list the items as multiple values, or do not generate an http://www.w3.org/ns/md#item mapping at all.
When an item has multiple values for a given property, place the values in an RDF Collection. Alternatively, do not use collections, use an alternative such as rdf:Seq, or place all values, whether or not multiple, into some form of collection.

More examples and explanatory information are available in [MICRODATA-RDF-SUPPLEMENT], which may be updated from time to time.

3. Vocabulary Registry

This section is non-normative.

In a perfect world, all processors would be able to generate the same output for a given input without regards to the requirements of a particular vocabulary. However, microdata doesn't provide sufficient syntactic help in making these decisions. Different vocabularies have different needs.

The registry is located at the namespace defined for microdata: http://www.w3.org/ns/md in a variety of formats.

The registry associates a URI prefix with one or more key-value pairs denoting processor behavior. A hypothetical JSON representation of such a registry might be the following:

Example 1

{
  "http://schema.org/": {
    "propertyURI":    "vocabulary",
    "multipleValues": "unordered",
    "properties": {
      "tracks": {"multipleValues": "list"}
    }
  },
  "http://microformats.org/profile/hcard": {
    "propertyURI":    "vocabulary",
    "multipleValues": "list",
    "properties" {
      "url": {"multipleValues": "unordered"}
    }
  }
}

This structure associates mappings for two URIs, http://schema.org/ and http://microformats.org/profile/hcard. Items having an item type with a URI prefix from this registry use the the rules described for that prefix within the scope of that item type. This mapping currently defines two rules: propertyURI and multipleValues with values to indicate specific behavior. It also allows overrides on a per-property basis; the properties key associates an individual name with overrides for default behavior. The interpretation of these rules is defined in the following sections. If an item has no current type or the registry contains no URI prefix matching current type, a conforming processor must use the default values defined for these rules.

Note

The concept of a registry, including a hypothetical format, location and updating rules is presented as an abstract concept useful for describing the function of a microdata processor. There are issues surrounding update frequency, URL naming, and how updates are authorized. This spec just considers the semantic content of such a registry and how it can be used to affect processing without defining its representation or update policies.

3.1 Property URI Generation

This section is non-normative.

For names which are not absolute URLs, the propertyURI rule defines the algorithm for generating an absolute URL given an evaluation context including a current type, current name and current vocabulary.

The procedure for generating property URIs is defined in Generate Predicate URI.

Possible values for propertyURI are the following:

contextual

The contextual URI generation scheme guarantees that generated property URIs are unique based on the value of current name. This is required as the microdata data model requires that names are associated with specific items and do not have a global scope. (See Step 5 in Generate Predicate URI).

URI creation uses a base URI with query parameters to indicate the in-scope type and name list. Consider the following example:

Example 2

<span itemscope itemtype="http://microformats.org/profile/hcard">
  <span itemprop="n" itemscope>
    <span itemprop="given-name">
      Princeton
    </span>
  </span>
</span>

The first name n generates the URI http://www.w3.org/ns/md?type=http://microformats.org/profile/hcard?prop=n. However, the included name given-name is included in untyped item. The inherited property URI is used to create a new property URI: http://www.w3.org/ns/md?type=http://microformats.org/profile/hcard?prop=n.given-name.

This scheme is compatible with the needs of other RDF serialization formats such as RDF/XML [RDF-SYNTAX-GRAMMAR], which rely on QNames for expressing properties. For example, the generated property URIs can be split as follows:

Example 3

<rdf:Description xmlns:hcard="http://www.w3.org/ns/md?type=http://microformats.org/profile/hcard?prop="
                 rdf:type="http://microformats.org/profile/hcard">
  <hcard:n>
    <rdf:Description>
      <hcard:n.given-name>
        Princeton
      </hcard:n.given-name>
    </rdf:Description>
  </hcard:n>
</rdf:Description>

Looking at another example:

Example 4

<div itemscope itemtype="http://schema.org/Person">
  <h2 itemprop="name">Jeni</h2>
</div>

This would generate http://www.w3.org/ns/md?type=http://schema.org/Person&prop=name.

vocabulary

The vocabulary URI generation scheme appends names that are not absolute URLs to the URI prefix. When generating property URIs, if the URI prefix does not end with a '/' or '#', a '#' is appended to the URI prefix. (See Step 4 in Generate Predicate URI.)

URI creation uses a base URL with query parameters to indicate the in-scope type and name list. Consider the following example:

Example 5

<span itemscope itemtype="http://microformats.org/profile/hcard">
  <span itemprop="n" itemscope>
    <span itemprop="given-name">
      Princeton
    </span>
  </span>
</span>

Given the URI prefix http://microformats.org/profile/hcard, this would generate http://microformats.org/profile/hcard#n and http://microformats.org/profile/hcard#given-name. Note that the '#' is automatically added as a separator.

Looking at another example:

Example 6

<div itemscope itemtype="http://schema.org/Person">
  <h2 itemprop="name">Jeni</h2>
</div>

Given the URI prefix http://schema.org/, this would generate http://schema.org/name. Note that if the itemtype were http://schema.org/Person/Teacher, this would generate the same property URI.

If the registry contains no match for current type implementations act as if there is a URI prefix made from the first itemtype value by stripping either the fragment content or last path segment, if the value has no fragment (See [RFC3986]).

Note

Deconstructing the itemtype URL to create or identify a vocabulary URI is a violation of the microdata specification which is necessary to support the use of existing vocabularies designed for use with RDF, and shared or inherited properties within all vocabularies.

The default value of propertyURI is vocabulary.

Example 7

<div itemscope itemtype="http://schema.org/Book">
  <h2 itemprop="title">Just a Geek</h2>
</div>

In this example, assuming no matching entry in the registry, the URI prefix is constructed by removing the last path segment, leaving the URI http://schema.org/. As there is no explicit propertyURI, the default vocabulary is used, and the resulting property URI would be http://schema.org/title.

3.2 Value Ordering

This section is non-normative.

For items having multiple values for a given property, the multipleValues rule defines the algorithm for serializing these values. Microdata uses document order when generating property values, as defined in Microdata DOM API as element.itemValue. However, many RDF vocabularies expect multiple values to be generated as triples sharing a common subject and predicate. In some cases, it may be useful to retain value ordering.

The procedure for generating property values is defined in Generate Property Values.

Possible values for multipleValues are the following:

unordered: Values are serialized without ordering using a common subject and predicate. (See Step 7 in Generate Property Values).
list: Multi-valued itemprops are serialized using an RDF Collection. (See Step 8 in Generate Property Values).

An example of how this might be specified in a registry is the following:

Example 8

{
  "http://schema.org/": {
    "propertyURI":    "vocabulary",
    "multipleValues": "unordered"
  },
  "http://microformats.org/profile/hcard": {
    "propertyURI":    "vocabulary",
    "multipleValues": {"multipleValues": "list"}
  }
}

Additionally, some vocabularies may wish to specify this on a per-property basis. For example, within http://schema.org/MusicPlaylist the tracks property might depend on the order of values to to reproduce associated MusicRecording values.

Example 9

{
 "http://schema.org/": {
   "propertyURI": "vocabulary",
   "multipleValues": "unordered",
   "properties": {
     "tracks": {"multipleValues": "list"}
   }
 }
}

The properties key takes a JSON Object as a value, which in turn has keys for each property that is to be given alternate semantics. Each name is implicitly expanded to it's URI representation as defined in Generate Predicate URI, so that the behavior is the same whether or not the name is listed as an absolute URL.

The default value of multipleValues is unordered.

Note

An alternative mechanism would output both unordered and ordered values, to allow an application to choose the most useful representation. For example, consider the following:

Example 10

<div itemscope itemtype="http://schema.org/MusicPlaylist">
  <span itemprop="name">Classic Rock Playlist</span>
  <meta itemprop="numTracks" content="2"/>
  <p>Including works by
    <span itemprop="byArtist">Lynard Skynard</span> and
    <span itemprop="byArtist">AC/DC</span></p>.

  <div itemprop="tracks" itemscope itemtype="http://schema.org/MusicRecording">
    1.<span itemprop="name">Sweet Home Alabama</span> -
    <span itemprop="byArtist">Lynard Skynard</span>
    <link href="sweet-home-alabama" itemprop="url" />
   </div>

  <div itemprop="tracks" itemscope itemtype="http://schema.org/MusicRecording">
    2.<span itemprop="name">Shook you all Night Long</span> -
    <span itemprop="byArtist">AC/DC</span>
    <link href="shook-you-all-night-long" itemprop="url" />
  </div>
</div>

This might generate the following Turtle:

Example 11

@prefix md: <http://www.w3.org/ns/md#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfa: <http://www.w3.org/ns/rdfa#> .
@prefix schema: <http://schema.org/> .

<> md:item [
  a schema:MusicPlaylist;
    schema:name "Classic Rock Playlist";
    schema:byArtist ("Lynard Skynard" "AC/DC");
    schema:numTracks "2";
    schema:tracks _:track1, _:track2, (_:track1 _:track2)
  ];
  rdfa:usesVocabulary schema:
 .
_:track1 a schema:MusicRecording;
  schema:byArtist ("Lynard Skynard");
  schema:name "Sweet Home Alabama";
  schema:url <sweet-home-alabama> .
_:track2 a schema:MusicRecording;
  schema:byArtist ("AC/DC");
  schema:name "Shook you all Night Long";
  schema:url <shook-you-all-night-long> .

By providing both _:track1 and _:track2 as object values of the playlist along with an RDF Collection containing the ordered values, the data may be queried via a simple query using the playlist subject, or as an ordered collection.

3.3 Value Typing

This section is non-normative.

In microdata, all values are strings. In RDF, values may be resources or may be typed with an appropriate datatype.

In some cases, the type of a microdata value can be determined from the element on which it is specified. In particular:

URL property elements provide URLs
time element provides dates and times

Issue

Using information about the content of the document where the microdata is marked up might be a violation of the spirit of the microdata specification, though it does not explicitly say in normative text that consumers cannot use other information from the HTML DOM to interpret microdata.

Additionally, one possible use of a registry would allow vocabularies to be marked with datatype information, so that a dc:time value, for example, would be understood to represent a literal with datatype xsd:date. This could be done by adding information for each property in the vocabulary requiring special treatment.

This might be represented using a syntax such as the following:

Example 12

{
 "http://schema.org/": {
   "propertyURI": "vocabulary",
   "multipleValues": "unordered",
   "properties": {
     "dateCreated": {"datatype": "http://www.w3.org/2001/XMLSchema#date"}
   }
 }
}

The datatype identifies a URI to be used in constructing a typed literal.

In most cases, the relevant datatype for a value can be derived from knowledge of what property the value is for and the syntax of the value itself. Thus, values can be given datatypes in a post-processing step after the mapping of microdata to RDF described by this specification. However, where there is information in the HTML markup, such as knowledge of what element was used to mark up the value, which can help with determining its datatype, that information is used by this specification.

This concept is not explored further at this time, but could be developed further in a future revision of this document.

Note

If property URI generation was fixed to vocabulary, multiple values always generated both unordered and ordered representations, and there were datatype support, the registry could be reduced to a simple list of URLs without any further structure necessary.

4. Vocabulary Expansion

Microdata requires that all values of itemtype come from the same vocabulary. This is required as itemprop values are resolved relative to that vocabulary. However, it is often useful to define an item to have types from multiple different vocabularies.

Vocabulary expansion uses simple rules to generate additional triples based on rules and property relationships described in the registry. Within the registry, a property definition may have either equivalentProperty or subPropertyOf keys having a IRI value (or array of IRI values) of the associated property. Such an entry causes the processor to generate triples associating the source property IRI with the target property IRI using either http://www.w3.org/2000/01/rdf-schema#subPropertyOf or http://www.w3.org/2002/07/owl#equivalentProperty predicates.

For example, the registry definition for the additionalType property within schema.org, defines additionalType to have an rdfs:subPropertyOf relationship with http://www.w3.org/1999/02/22-rdf-syntax-ns#type.

Example 13

{
  "http://schema.org/": {
    "properties": {
      "additionalType": {"subPropertyOf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"}
    }
}

<div itemscope itemtype="http://schema.org/Product">
  <link itemprop="additionalType" href="http://www.productontology.org/id/Laser_printer" />
  <p itemprop="name">Laser Printer</a>
</div>

The previous example, indicates a registry rule, which causes the processor to emit an extra triple when first seeing the additionalProperty itemprop:

Example 14

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfa: <http://www.w3.org/ns/rdfa#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/> .

<> md:item (
  [ a schema:Product;
    schema:additionalType <http://www.productontology.org/id/Laser_printer> ;
    schema:name "Laser Printer"]
  );
  rdfa:usesVocabulary schema: .

schema:additionalProperty rdfs:subPropertyOf rdf:type .

After performing vocabulary expansion, an additional rdf:type triple is generated:

Example 15

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/> .

<> md:item (
  [ a schema:Product, <http://www.productontology.org/id/Laser_printer>;
    schema:additionalType <http://www.productontology.org/id/Laser_printer> ;
    schema:name "Laser Printer"]
  )
  rdfa:usesVocabulary schema: .

schema:additionalProperty rdfs:subPropertyOf rdf:type .

4.1 Vocabulary Entailment

Formally, and for the purpose of vocabulary processing, microdata uses a very restricted subset of the OWL2 vocabulary and is based on the RDF-Based Semantics of OWL2 [OWL2-RDF-BASED-SEMANTICS]. Vocabulary Entailment uses the following terms:

rdfs:subPropertyOf
owl:equivalentProperty

Vocabulary Entailment considers only the entailment on individuals (i.e., not on the relationships that can be deduced on the properties or the classes themselves.)

Note

While the formal definition of the Entailment refers to the general OWL 2 Semantics, practical implementations may rely on a subset of the OWL 2 RL Profile’s entailment expressed in rules ( section 4.3 of [OWL2-PROFILES]). The relevant rules are, using the rule identifications in section 4.3 of [OWL2-PROFILES]): prp-spo1, prp-eqp1, and prp-eqp2.

[RDFA-CORE] implements a more complete form of vocabulary entailement, including retrieving the vocabulary URI to find additional class and property expansion definitions, as described in RDFa Vocabulary Entailment. Microdata implementations may use RDFa Vocabulary Entailment as an alternative to implementing a separate entailment algorithm. To allow [RDFA-CORE] processors to be used for microdata vocabulary expansion, microdata acts as if there is an implicit @vocab RDFa attribute set to a detected vocabulary by emitting a triple using the rdfa:usesVocabulary predicate.

Note

The entailment described in this section is the minimum useful level for microdata. Processors may, of course, choose to follow more powerful entailment regimes, e.g., include full RDFS [RDF-MT] or OWL2 [OWL2-OVERVIEW] entailments. Using those entailments applications may perform datatype validation by checking rdfs:range of a property, or use the advanced facilities offered by, e.g., OWL2’s property chains to interlink vocabularies further.

4.2 Vocabulary Expansion Control of Microdata Processors

Conforming processors must perform the basic vocabulary expansion.

If vocabulary expansion is performed by the microdata processor using [RDFA-CORE] vocabulary expansion, and the vocab_expansion option is passed to the microdata processor, the full [RDFA-CORE] expansion must also be performed.

5. Algorithm

Transformation of Microdata to RDF makes use of general processing rules described in [MICRODATA] for the treatment of items.

5.1 Algorithm Terms

absolute URL

The term absolute URL is defined in [HTML5].

blank node

A blank node is a node in a graph that is neither a URI reference nor a literal. Items without a global identifier have a blank node allocated to them. (See [RDF-CONCEPTS]).

document base

The base address of the document being processed, as defined in Resolving URLs in [HTML5].

evaluation context

A data structure including the following elements:

memory: a mapping of items to subjects, initially empty;
current name: an absolute URL for the in-scope name, used for generating URIs for properties of items without an item type;
Note
current name is required for the contextual property URI generation scheme. Without this scheme, this evaluation context component would not be required.
current type: an absolute URL for the current type, used when an item does not contain an item type;
current vocabulary: an absolute URL for the current vocabulary, from the registry.

item

An item is described by an element containing an itemscope attribute. The list of top-level microdata items may be retrieved using the Microdata DOM API document.getItems method.

item properties

The mechanism for finding the properties of an item The list of item properties items may be retrieved using the Microdata DOM API element.properties attribute.

fragment-escape

The term fragment-escape is defined in [HTML5]. This involves transforming elements added to URLs to ensure that the result remains a valid URL. The following characters are subject to percent escaping:

U+0022 QUOTATION MARK character (")
U+0023 NUMBER SIGN character (#)
U+0025 PERCENT SIGN character (%)
U+003C LESS-THAN SIGN character (<)
U+003E GREATER-THAN SIGN character (>)
U+005B LEFT SQUARE BRACKET character ([)
U+005C REVERSE SOLIDUS character (\)
U+005D RIGHT SQUARE BRACKET character (])
U+005E CIRCUMFLEX ACCENT character (^)
U+007B LEFT CURLY BRACKET character ({)
U+007C VERTICAL LINE character (|)
U+007D RIGHT CURLY BRACKET character (})

global identifier

The value of an item's itemid attribute, if it has one. (See Items in [MICRODATA]).

literal

Literals are values such as strings and dates, including typed literals and plain literals. (See [RDF-CONCEPTS]).

property

Each name identifies a property of an item. An item may have multiple elements sharing the same name, creating a multi-valued property.

property names

The tokens of an element's itemprop attribute. Each token is a name. (See Names: the itemprop attribute in [MICRODATA]).

property value

The property value of a name-value pair added by an element with an itemprop attribute depends on the element.

If the element has no itemprop attribute

The value is null and no triple should be generated.

If the element creates an item (by having an itemscope attribute)

The value is the URI reference or blank node returned from generate the triples for that item.

If the element is a URL property element (a, area, audio, embed, iframe, img, link, object, source, track or video)

The value is a URI reference created from


              element.itemValue

. (See relevant attribute descriptions in [HTML5]).

If the element is a time element.

The value is a literal made from


            element.itemValue

If the value is a valid date string having the lexical form of xsd:date [RDF-SCHEMA].: The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#date.
If the value is a valid time string having the lexical form of xsd:time [RDF-SCHEMA].: The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#time.
If the value is a valid local date and time string or valid global date and time string having the lexical form of xsd:dateTime [RDF-SCHEMA].: The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#dateTime.
If the value is a valid month string having the lexical form of xsd:gYearMonth [RDF-SCHEMA].: The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#gYearMonth.
If the value is a valid non-negative integer having the lexical form of xsd:gYear [RDF-SCHEMA].: The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#gYear.
If the value has the lexical form of xsd:duration [RDF-SCHEMA].: The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#duration.
Note
The referenced version of [HTML5] does not include a duration data type, but it is in the Editor's Draft and is expected to be included in a forthcoming update to the Working Draft
Otherwise: The value is a plain literal created from the value with language information set from the language of the property element.
Note
The HTML valid yearless date string is similar to xsd:gMonthDay, but the lexical forms differ, so it is not included in this conversion.

See The time element in [HTML5].

Otherwise

The value is a plain literal created from the value with language information set from the language of the property element.

See The lang and xml:lang attributes in [HTML5] for determining the language of a node.

top-level item

An item which does not contain an itemprop attribute. Available through the Microdata DOM API as document.getItems. (See Associating names with items in [MICRODATA]).

URI reference

URI references are suitable to be used in subject, predicate or object positions within an RDF triple, as opposed to a literal value that may contain a string representation of a URI. (See [RDF-CONCEPTS]).

Issue

The HTML5/microdata content model for @href, @src, @data, itemtype and itemprop and itemid is that of a URL, not a URI or IRI.

A proposed mechanism for specifying the range of property values to be URI reference or IRI could allow these to be specified as subject or object using a @content attribute.

vocabulary

A vocabulary is a collection of URIs, suitable for use as an itemtype or itemprop value, that share a common URI prefix. That prefix is the vocabulary URI. A vocabulary URI is not allowed to be a prefix of another vocabulary URI.

Note

This definition differs from the language in the HTML spec and is just for the purpose of this document. In HTML, a vocabulary is a specification, and doesn't have a URI. In our view, if one specification defines ten itemtypes, then these could be treated as one vocabulary or as ten distinct vocabularies; it is entirely up to the vocabulary creator.

5.2 RDF Conversion Algorithm

A HTML document containing microdata may be converted to any other RDF-compatible document format using the algorithm specified in this section.

A conforming microdata processor implementing RDF conversion must implement a processing algorithm that results in the equivalent triples to those that the following algorithm generates:

Set item list to an empty list.

For each element that is also a top-level item run the following algorithm:
1. Generate the triples for an item item, using the evaluation context. Let result be the (URI reference or blank node) subject returned.
2. Append result to item list.
Generate an RDF Collection list from the ordered list of values. Set value to the value returned from generate an RDF Collection.
Generate the following triple:

subject

document base

predicate

http://www.w3.org/ns/md#item

object

value
Perform Vocabulary Entailment.

5.3 Generate the triples

When the user agent is to Generate triples for an item item, given evaluation context, it must run the following steps:

Note

This algorithm has undergone substantial change from the original microdata specification [MICRODATA].

If there is an entry for item in memory, then let subject be the subject of that entry. Otherwise, if item has a global identifier and that global identifier is an absolute URL, let subject be that global identifier. Otherwise, let subject be a new blank node.
Add a mapping from item to subject in memory
For each type returned from element.itemType of the element defining the item.
1. If type is an absolute URL, generate the following triple:
  
  subject
  
  subject
  
  predicate
  
  http://www.w3.org/1999/02/22-rdf-syntax-ns#type
  
  object
  
  type (as a URI reference)
Set type to the first value returned from element.itemType of the element defining the item.
If type is an absolute URL, set current name in evaluation context to null.
Otherwise, set type to current type from evaluation context if not empty.
If the registry contains a URI prefix that is a character for character match of type up to the length of the URI prefix, set vocab as that URI prefix and generate the following triple:

subject

document base

predicate

http://www.w3.org/ns/rdfa#usesVocabulary

object

vocab (as a URI reference)
Otherwise, if type is not empty, construct vocab by removing everything following the last SOLIDUS U+002F ("/") or NUMBER SIGN U+0023 ("#") from the path component of type.
Update evaluation context setting current vocabulary to vocab.
Set property list to an empty array mapping properties to one or more values as established below.
For each element element that has one or more property names and is one of the properties of the item item, in the order those elements are given by the algorithm that returns the properties of the item, run the following substep:
1. For each name in the element's property names, run the following substeps:
  1. Let context be a copy of evaluation context with current type set to type.
  2. Let predicate be the result of generate predicate URI using context and name. Update context by setting current name to predicate.
  3. Let value be the property value of element.
  4. If value is an item, then generate the triples for value using context. Replace value by the subject returned from those steps.
  5. Add value to property list for predicate.
For each predicate in property list:
1. Generate property values subject, predicate and the list of values associated with predicate from property list as values.
Return subject

5.4 Generate Predicate URI

Predicate URI generation makes use of current type, current name, and current vocabulary from an evaluation context context along with name.

If name is an absolute URL, return name as a URI reference.
If current type from context is null, there can be no current vocabulary. Return the URI reference that is the document base with its fragment set to the fragment-escaped value of name.
Note
This rule is intended to allow for a the case where no type is set, and therefore there is no vocabulary from which to extract rules. For example, if there is a document base of http://example.org/doc and an itemprop of 'title', a URI will be constructed to be http://example.org/doc#title.
Otherwise, if current vocabulary from context is not null and registry has an entry for current vocabulary having a propertyURI entry that is not null, set that as scheme. Otherwise, set scheme to vocabulary.
If scheme is vocabulary set expandedURI to the URI reference constructed by appending the fragment-escaped value of name to current vocabulary, separated by a U+0023 NUMBER SIGN character (#) unless the current vocabulary ends with either a U+0023 NUMBER SIGN character (#) or SOLIDUS U+002F (/).
Otherwise, if scheme is contextual, set expandedURI to the URI reference constructed as follows:
1. Let s be current name from context.
2. If http://www.w3.org/ns/md?type= is a prefix of s, return the concatenation of s, a U+002E FULL STOP character (.) and the fragment-escaped value of name.
3. Otherwise, return the concatenation of http://www.w3.org/ns/md?type=, the fragment-escaped value of current type, the string &prop=, and the fragment-escaped value of name.
If the registry entry for propertyURI has an equivalentProperty key, generate the following triple using the value of that key:

subject

expandedURI

predicate

http://www.w3.org/2002/07/owl#equivalentProperty

object

value

If the value is an array, generate a triple for each value of that array.
If the registry entry for propertyURI has an subPropertyOf key, generate the following triple using the value of that key:

subject

expandedURI

predicate

http://www.w3.org/2000/01/rdf-schema#subPropertyOf

object

value

If the value is an array, generate a triple for each value of that array.
Return expandedURI.

5.5 Generate Property Values

Property value serialization makes use of subject, predicate and values.

If the registry contains a URI prefix that is a character for character match of predicate up to the length of the URI prefix, set vocab as that URI prefix. Otherwise set vocab to null.
If vocab is not null and registry has an entry for vocab that is a JSON Object, let registry object be that value. Otherwise set registry object to null.
If registry object is not null and registry object contains key properties which has a JSON Object value, let properties be that value. Otherwise, set properties to null.
If properties is not null, and properties contains a key, which after Generate Predicate URI expansion has a value which is a JSON Object, let property override be that value. Otherwise, set property override to null.
If property override contains the key multipleValues, set that as method.
Otherwise, if registry object con contains the key multipleValues, set that as method.
Otherwise, set method to unordered.
If method is unordered, for each value in values, generate the following triple:

subject

subject

predicate

predicate

object

value
Otherwise, if method is list:
1. Set value to the value returned from generate an RDF Collection.
2. Generate the following triple:
  
  subject
  
  subject
  
  predicate
  
  predicate
  
  object
  
  value

5.6 Generate RDF Collection

An RDF Collection is a mechanism for defining ordered sequences of objects in RDF (See RDF Collections in [RDF-SCHEMA]). As the RDF data-model is that of an unordered graph, a linking method using properties rdf:first and rdf:next is required to be able to specify a particular order.

In the microdata to RDF mapping, RDF Collections are used when an item has more than one value associated with a given property to ensure that the original document order is maintained. The following procedure should be used to generate triples when an item property has more than one value (contained in list):

Create a new array array containing a blank node for every value in list.
For each pair of bnode from array and value from list the following triple is generated:

subject

bnode

predicate

http://www.w3.org/1999/02/22-rdf-syntax-ns#first

object

value
For each bnode in array the following triple is generated:

subject

bnode

predicate

http://www.w3.org/1999/02/22-rdf-syntax-ns#rest

object

next bnode in array or, if that does not exist, http://www.w3.org/1999/02/22-rdf-syntax-ns#nil
Return the first blank node from array.

B. Markup Examples

This section is non-normative.

The microdata example below expresses book information as an FRBR Work item.

Example 16

<dl itemscope
    itemtype="http://purl.org/vocab/frbr/core#Work"
    itemid="http://books.example.com/works/45U8QJGZSQKDH8N"
    lang="en">
 <dt>Title</dt>
 <dd><cite itemprop="http://purl.org/dc/terms/title">Just a Geek</cite></dd>
 <dt>By</dt>
 <dd><span itemprop="http://purl.org/dc/terms/creator">Wil Wheaton</span></dd>
 <dt>Format</dt>
 <dd itemprop="http://purl.org/vocab/frbr/core#realization"
     itemscope
     itemtype="http://purl.org/vocab/frbr/core#Expression"
     itemid="http://books.example.com/products/9780596007683.BOOK">
  <link itemprop="http://purl.org/dc/terms/type" href="http://books.example.com/product-types/BOOK">
  Print
 </dd>
 <dd itemprop="http://purl.org/vocab/frbr/core#realization"
     itemscope
     itemtype="http://purl.org/vocab/frbr/core#Expression"
     itemid="http://books.example.com/products/9780596802189.EBOOK">
  <link itemprop="http://purl.org/dc/terms/type" href="http://books.example.com/product-types/EBOOK">
  Ebook
 </dd>
</dl>

Assuming that registry contains a an entry for http://purl.org/vocab/frbr/core# with propertyURI set to vocabulary, this is equivalent to the following Turtle:

Example 17

@prefix dc: <http://purl.org/dc/terms/> .
@prefix md: <http://www.w3.org/ns/md#> .
@prefix frbr: <http://purl.org/vocab/frbr/core#> .
@prefix rdfa: <http://www.w3.org/ns/rdfa#> .

<> md:item (<http://books.example.com/works/45U8QJGZSQKDH8N>) ;
  rdfa:usesVocabulary frbr: .

<http://books.example.com/works/45U8QJGZSQKDH8N> a frbr:Work ;
  dc:creator "Wil Wheaton"@en ;
  dc:title "Just a Geek"@en ;
  frbr:realization <http://books.example.com/products/9780596007683.BOOK>,
    <http://books.example.com/products/9780596802189.EBOOK> .

<http://books.example.com/products/9780596007683.BOOK> a frbr:Expression ;
  dc:type <http://books.example.com/product-types/BOOK> .

<http://books.example.com/products/9780596802189.EBOOK> a frbr:Expression ;
  dc:type <http://books.example.com/product-types/EBOOK> .

The following snippet of HTML has microdata for two people with the same address. This illustrates two items referencing a third item, and how only a single RDF resource definition is created for that third item.

Example 18

<p>
 Both
 <span itemscope itemtype="http://microformats.org/profile/hcard" itemref="home">
   <span itemprop="fn"
       ><span itemprop="n" itemscope
       ><span itemprop="given-name">Princeton</span></span></span>
  </span>
 and
 <span itemscope itemtype="http://microformats.org/profile/hcard" itemref="home">
   <span itemprop="fn"
     ><span itemprop="n" itemscope
       ><span itemprop="given-name">Trekkie</span></span></span>
  </span>
 live at
 <span id="home" itemprop="adr" itemscope>
   <span itemprop="street-address">Avenue Q</span>.
 </span>
</p>

Assuming that registry contains a an entry for http://microformats.org/profile/hcard with propertyURI set to vocabulary, it generates these triples expressed in Turtle:

Example 19

@prefix md: <http://www.w3.org/ns/md#> .
@prefix hcard: <http://microformats.org/profile/hcard#> .
@prefix rdfa: <http://www.w3.org/ns/rdfa#> .

<> md:item (
  [ a <http://microformats.org/profile/hcard>;
    hcard:fn "Princeton";
    hcard:n [ hcard:given-name "Princeton" ];
    hcard:adr _:a
  ]
  [ a <http://microformats.org/profile/hcard>;
    hcard:fn "Trekkie";
    hcard:n [ hcard:given-name "Trekkie" ];
    hcard:adr _:a
  ]) ;
  rdfa:usesVocabulary <http://microformats.org/profile/hcard> .

_:a hcard:street-address "Avenue Q" .

The following snippet of HTML has microdata for a playlist, and illustrates overriding a property to place elements in an RDF Collection. This also illustrates the use of the schema:additionalType property to relate recordings to the Music Ontology:

Example 20

<div itemscope itemtype="http://schema.org/MusicPlaylist">
  <span itemprop="name">Classic Rock Playlist</span>
  <meta itemprop="numTracks" content="2"/>
  <p>Including works by
    <span itemprop="byArtist">Lynard Skynard</span> and
    <span itemprop="byArtist">AC/DC</span></p>.

  <div itemprop="tracks" itemscope itemtype="http://schema.org/MusicRecording">
    <link itemprop="additionalType" href="http://purl.org/ontology/mo/MusicalManifestation"/>
    1.<span itemprop="name">Sweet Home Alabama</span> -
    <span itemprop="byArtist">Lynard Skynard</span>
    <link href="sweet-home-alabama" itemprop="url" />
   </div>

  <div itemprop="tracks" itemscope itemtype="http://schema.org/MusicRecording">
    <link itemprop="additionalType" href="http://purl.org/ontology/mo/MusicalManifestation"/>
    2.<span itemprop="name">Shook you all Night Long</span> -
    <span itemprop="byArtist">AC/DC</span>
    <link href="shook-you-all-night-long" itemprop="url" />
  </div>
</div>

Assuming that registry contains a an entry for http://schema.org/ with propertyURI set to vocabulary, multipleValues set to unordered with the properties track and byArtist having multipleValues set to list, it generates these triples expressed in Turtle:

Example 21

@prefix md: <http://www.w3.org/ns/md#> .
@prefix mo: <http://purl.org/ontology/mo/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfa: <http://www.w3.org/ns/rdfa#> .
@prefix schema: <http://schema.org/> .

<> md:item ([ a schema:MusicPlaylist;
  schema:name "Classic Rock Playlist";
  schema:byArtist ("Lynard Skynard" "AC/DC");
  schema:numTracks "2";
  schema:tracks (
    [ a schema:MusicRecording, mo:MusicalManifestation;
      schema:additionalType mo:MusicalManifestation;
      schema:byArtist ("Lynard Skynard");
      schema:name "Sweet Home Alabama";
      schema:url <sweet-home-alabama>]
    [ a schema:MusicRecording, mo:MusicalManifestation;
      schema:additionalType mo:MusicalManifestation;
      schema:byArtist ("AC/DC");;
      schema:name "Shook you all Night Long";
      schema:url <shook-you-all-night-long>]
  )]);
  rdfa:usesVocabulary schema: .
  
schema:additionalType rdfs:subPropertyOf rdf:type .

Microdata to RDF

Transformation from HTML+Microdata to RDF

W3C Working Draft 19 September 2012

Abstract

Status of This Document

Table of Contents

1. Introduction

1.1 Background

1.2 Use Cases

1.3 Issues

2. Attributes and Syntax

3. Vocabulary Registry

3.1 Property URI Generation

3.2 Value Ordering

3.3 Value Typing

4. Vocabulary Expansion

4.1 Vocabulary Entailment

4.2 Vocabulary Expansion Control of Microdata Processors

5. Algorithm

5.1 Algorithm Terms

5.2 RDF Conversion Algorithm

5.3 Generate the triples

5.4 Generate Predicate URI

5.5 Generate Property Values

5.6 Generate RDF Collection

A. Testing

B. Markup Examples

C. Example registry

D. Acknowledgements

E. References

E.1 Normative references

E.2 Informative references

dc:	http://purl.org/dc/terms/
md:	http://www.w3.org/ns/md#
rdf:	http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdf:	http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfa:	http://www.w3.org/ns/rdfa#
xsd:	http://www.w3.org/2001/XMLSchema#