W3C

MicroXML

Latest version:
http://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html
Editors:
James Clark
John Cowan

Abstract

MicroXML is a subset of XML intended for use in contexts where full XML is, or is perceived to be, too large and complex. It has been designed to complement rather than replace XML, JSON and HTML. Like XML, it is a general format for making use of markup vocabularies rather than a specific markup vocabulary like HTML. This document provides a complete description of MicroXML.

Status of This Document

This specification was published by the MicroXML Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups. Please send comments to public-microxml@w3.org (public archives).

Table of Contents

1 Introduction

MicroXML is a Unicode-based textual format for general-purpose structured information interchange. A sequence of characters or bytes in this format is called a MicroXML document. MicroXML is designed to be syntactically compatible with XML; more precisely, any MicroXML document is a well-formed XML document according to XML 1.0 Fifth Edition [XML].

MicroXML also specifies an abstract data model for MicroXML documents. This is substantially compatible with a subset of the information items and properties of the XML Information Set [INFOSET]. See Appendix B for details.

A MicroXML parser is a software module that accepts a sequence of bytes or characters as input, determines whether that sequence is a MicroXML document, and, if it is, makes a representation of its abstract data model available to other modules.

MicroXML is designed to be dramatically simpler than XML, not only in its syntax but also in its data model. Experience with XML has shown that for many applications much of the complexity of XML is unnecessary. Indeed, many specifications that use XML have invented their own ad-hoc subsets of XML (XMPP, SOAP, E4X, Scala). The complexity of XML does not affect just the developers of XML parsers and other tools, but has an ongoing cost to users of XML and developers of XML applications.

Although JSON has replaced XML in many applications where greater simplicity is desired, JSON is awkward for representing structured documents that include mixed content (content that mixes data characters and element). HTML is very widely used for representing structured documents. However, MicroXML is a fundamentally different kind of format from HTML: MicroXML does not define the semantics of any element or attribute names, whereas HTML does. MicroXML with appropriately chosen element and attribute names can be trivially transformed into valid HTML. Like HTML and XML, MicroXML is designed to support the use of plain text editors for authoring; it therefore preserves some of the conveniences provided by XML for such usage.

MicroXML has a number of advantages over full XML as a format for network protocols. First, MicroXML does not constrain how parsers recover from errors; in particular, MicroXML does not adopt XML's Draconian error handling requirements. This allows protocols using MicroXML to follow the traditional policy of being liberal in what they accept.. Second, the features of XML that are most problematic from a security perspective have been eliminated from MicroXML: most importantly, MicroXML completely eliminates document type declarations, including entity declarations; MicroXML documents are self-contained, in the sense that the parsing of a MicroXML document never requires access to any external resource.

This document, together with [RFC 2119] for requirement keywords and [Unicode] for characters, provides all the information necessary to understand MicroXML and construct computer programs to process it.

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

2 Data Model

The MicroXML abstract data model uses three primitive types:

A string is not a primitive type: it is just a list of zero or more characters.

The top-level construct in the MicroXML data model is an element item. An element item is a list with exactly three members:

  1. a name item;
  2. a attributes map: this is a possibly empty map, whose keys are name items and whose values are strings;
  3. a content list: this is a list with zero or more members; each member is either a character or an element item.

A name item is a non-empty string. The first character in the string MUST match the production nameStartChar, and any subsequent characters MUST match the production nameChar. In addition, a name item occurring as a key in an attributes map MUST not be xmlns.

Any character occurring in the value of an attributes map or as a member of a content list MUST match the production char.

2.1 JSON Syntax (informative)

There are many possible ways of representing the data model in [JSON]. The following is one possible way:

This document will use this syntax to represent the data model in examples.

3 Syntax

This section specifies the syntax of MicroXML. It also specifies how the syntax is parsed into the abstract data model: for each syntactic form that contributes to the data model, it specifies how the parse result for that form is constructed from the parse results of syntactic subforms.

The abstract data model for a sequence of characters is constructed in two logical phases:

  1. line breaks in the sequence of characters are normalized by translating both the two-character sequence #xD followed by #xA, and any #xD that is not followed by #xA, to a single #xA character;
  2. the sequence of characters is then parsed as a document, yielding an element item as the parse result.

3.1 Documents

[1] document ::= byteOrderMark? (comment | s)* element (comment | s)*
[2] byteOrderMark ::= #xFEFF

The top-level syntactic form in MicroXML is a document. The parse result of a document is the parse result of its single element.

This is an example of a small but complete MicroXML document exhibiting all syntactic features:

<comment lang="en" date="2012-09-11">
I <em>love</em> &#xB5;<!-- MICRO SIGN -->XML!<br/>
It's so clean &amp; simple.</comment>

The abstract data model of this document in the JSON syntax described in Section 2.1 is:

[ "comment",
  {  "date": "2012-09-11", "lang": "en" },
  [ "\nI ",
    ["em", {}, ["love"]],
    " \u03BCXML!",
    ["br", {}, []],
    "\nIt's so clean & simple."
  ]
]

3.2 Elements

[3] element ::= startTag content endTag
              | emptyElementTag
[4] startTag ::= '<' name attributeList s* '>'
[5] endTag ::= '</' name s* '>'
[6] content ::= (element | comment | dataChar | charRef)*
[7] dataChar ::= char - ('<'|'&'|'>')
[8] emptyElementTag ::= '<' name attributeList s* '/>'

The startTag and endTag of an element MUST have the same name. Note that the syntax prohibits overlapping elements.

The parse result of an element is an element item. There are two alternative syntaxes for an element. For the general syntax, which uses a startTag and an endTag, the parse result is constructed as follows:

The parse result of content is a content list, which is constructed by combining in order the parse results of each element, dataChar, and charRef in the content. The parse result of a dataChar is the character itself.

For example, this element has content that consists of two elements, each of whose content consists of characters:

<location><city>New York</city><country>US</country><location>

Its abstract data model in JSON syntax is:

["location", {}, [["city", {}, ["New York"]], ["country", {}, ["US"]]]]

The other syntax, which consists of just an emptyElementTag, is equivalent to a startTag immediately followed by an endTag. The parse result is constructed as follows:

This is a simple example of an emptyElementTag:

<page-break/>

Its abstract data model in JSON syntax is:

["page-break", {}, []]

3.3 Attributes

[9] attributeList ::= (s+ attribute)*
[10] attribute ::= attributeName s* '=' s* attributeValue
[11] attributeValue ::= '"' ((attributeValueChar - '"') | charRef)* '"'
                      | "'" ((attributeValueChar - "'") | charRef)* "'"
[12] attributeValueChar ::= char - ('<'|'>'|'&')
[13] attributeName ::= name - 'xmlns'

The parse result of an attributeList is an attributes map. For each attribute in the attributeList, there is a key and associated value in the attributes map: the key comes from the attributeName and the value comes from the attributeValue.

All the attributeNames in an attributeList MUST be distinct.

The parse results of each attributeValueChar and charRef in the attributeValue are combined in order to construct the parse result of the attributeValue. The parse result of an attributeValueChar is the character itself.

For example, this is an element with two attributes:

<location city="New York" country="US"/>

Its abstract data model in JSON syntax is:

["location", { "city": "New York", "country": "US" }, []]

3.4 Comments

[14] comment ::= '<!--' ((char - '-') | ('-' (char - '-')))* '-->'

Comments are not part of the MicroXML data model and so have no parse result.

The syntax prohibits the occurrence of -- except as part of the opening or closing delimiter of the comment.

For example, this is a comment:

<!-- declarations for <head> & <body> -->

Note that <head> and <body> are not recognized as start-tags.

3.5 Character References

[15] charRef ::= numericCharRef | namedCharRef
[16] numericCharRef ::= '&#x' charNumber ';'
[17] charNumber ::= [0-9a-fA-F]+ 
[18] namedCharRef ::= '&' charName ';'
[19] charName ::= 'amp' | 'lt' | 'gt' | 'quot' | 'apos'

MicroXML provides two kinds of character references: named character references provide an easy way to escape characters that MicroXML does not allow to be used literally as data characters; numeric character references provide a way to include arbitrary Unicode characters in MicroXML documents without needing a Unicode-aware text editor.

The parse result of a charRef is a single character. The code point of the parse result of a numericCharRef is equal to charNumber interpreted as a hexadecimal number; this MUST be the code point of a character that matches the char production. The parse result of a namedCharRef depends on the the charName as follows:

For example, this is an element that contains two numeric character references:

<p>&#x3C;&#x3bb;</p>

It has the same data model as this element, which uses one named character reference:

<p>&lt;λ</p>

The data model in JSON syntax is:

["p", {}, "<\u03BB"]

3.6 Names

[20] name ::= nameStartChar nameChar*
[21] nameStartChar ::= [A-Z] | [a-z] | "_" | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D]
                     | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF]
                     | [#x3001-#xD7FF] | ([#xF900-#xEFFFF] - nonCharacterCodePoint)
[22] nameChar ::= nameStartChar | [0-9] | "-" | "." | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

The parse result of a name is a string whose members are the characters in the name.

Names beginning with a match to (('X'|'x')('M'|'m')('L'|'l')) are reserved for standardization by the W3C.

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are useful as delimiters in contexts where MicroXML names are used outside MicroXML documents. The character #x037E is excluded because Unicode normalization turns it into a semicolon.

3.7 Whitespace

[24] s ::= #x9 | #xA | #x20

Whitespace is permitted in various places to increase readability. It does not affect the data model. Note that #xD is not included here, because #xD characters are translated to #xA characters during line-break normalization.

3.8 Characters

[23] char ::= s | ([#x0-#x10FFFF] - forbiddenCodePoint)

[25] forbiddenCodePoint ::= controlCodePoint | surrogateCodePoint | nonCharacterCodePoint
[26] controlCodePoint ::= [#x0-#x1F] | [#x7F-#x9F]
[27] surrogateCodePoint ::= [#xD800-#xDFFF]
[28] nonCharacterCodePoint ::= [#xFDD0-#xFDEF] | [#xFFFE-#xFFFF] | [#x1FFFE-#x1FFFF]
                             | [#x2FFFE-#x2FFFF] | [#x3FFFE-#x3FFFF] | [#x4FFFE-#x4FFFF]
                             | [#x5FFFE-#x5FFFF] | [#x6FFFE-#x6FFFF] | [#x7FFFE-#x7FFFF]
                             | [#x8FFFE-#x8FFFF] | [#x9FFFE-#x9FFFF] | [#xAFFFE-#xAFFFF]
                             | [#xBFFFE-#xBFFFF] | [#xCFFFE-#xCFFFF] | [#xDFFFE-#xDFFFF]
                             | [#xEFFFE-#xEFFFF] | [#xFFFFE-#xFFFFF] | [#x10FFFE-#x10FFFF]

MicroXML prohibits three kinds of code points from occurring literally in MicroXML documents:

4 Conformance

4.1 Document Conformance

A sequence of characters is a conforming MicroXML document if, after line normalization, it matches the production document, and meets the further constraints found in the text of this document marked with the keywords MUST or REQUIRED.

A sequence of bytes is a conforming MicroXML document if it is the UTF-8 [Unicode] encoding of a sequence of characters that is a conforming XML document.

[Unicode] says that canonically equivalent sequences of characters ought to be treated as identical. However, documents that are canonically equivalent according to Unicode but that use distinct code point sequences are considered distinct by MicroXML parsers. This gives rise to the possibility that the user might unintentionally create sequences of characters that are canonically equivalent but are treated as distinct by MicroXML parsers. To avoid this possibility, all documents SHOULD be in Normalization Form C as described by [Unicode].

4.2 Parser Conformance

For any sequence of bytes, a conforming MicroXML parser MUST be able to report correctly whether it is a conforming MicroXML document. If it is a conforming MicroXML document, then a conforming MicroXML parser MUST be able to report the correct abstract data model for the document.

In some contexts, it may be appropriate for a conforming MicroXML parser to operate on sequences of characters rather than sequences of bytes. In this case, the conformance requirement of the preceding paragraph applies to sequences of characters rather than bytes.

A conforming MicroXML parser is free to use any data structure to represent the abstract data model, provided that the data structure provides the same information as the abstract data model.

A MicroXML parser MAY perform error correction, by providing an abstract data model even for sequences of bytes that are not conforming MicroXML documents. It MUST, however, still comply with the requirement of the first paragraph to report that the sequence of bytes is not a conforming MicroXML document.

5 Security Considerations

MicroXML does not provide any built-in service for integrity. Integrity services have been defined for XML using XML Canonicalization [RFC3076] and XML Signatures [RFC3275]. These can be applied to MicroXML, with a number of caveats. First, the XPath data model that is input into XML Canonicalization MUST be constructed using a conforming MicroXML parser rather than an XML parser. This is because an XML parser will normalize some attribute values in a way that is incompatible with MicroXML, as detailed in Appendix B.2. Second, canonicalization MUST NOT use the option to include comments, since the MicroXML data model does not preserve comments. Third, although the canonicalization of a MicroXML document will be a well-formed XML document, it will not always be a conforming MicroXML document, because the definition of XML canonicalization does not escape > characters in attribute values; this will not usually be a problem because the output of canonicalization is typically used only as input to a digest algorithm.

MicroXML documents are encoded in UTF-8; the security considerations described in [RFC3629] are therefore applicable to MicroXML.

6 Notation

The formal grammar of MicroXML is given in this document using a simple Extended Backus-Naur Form (EBNF) notation. Each rule in the grammar defines one symbol, in the form symbol ::= expression.

Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters:

#xN
where N is a hexadecimal integer, the expression matches the character in Unicode whose code point has the value indicated.
[a-zA-Z], [#xN-#xN]
matches any character with a value in the range(s) indicated (inclusive).
[abc], [#xN#xN#xN]
matches any character with a value among the characters enumerated. Enumerations and ranges can be mixed in one set of brackets.
"string"
matches a literal string matching the one given inside the double quotes.
'string'
matches a literal string matching the one given inside the single quotes.

These symbols can be combined to match more complex patterns as follows, where A and B represent expressions:

(A)
expression is treated as a unit and can be combined as described in this list.
A?
matches A or nothing; OPTIONAL A.
A B
matches A followed by B. This operator has higher precedence than alternation; thus A B | C D is identical to (A B) | (C D).
A | B
matches A or B.
A - B
matches any string that matches A but does not match B.
A+
matches one or more occurrences of A. This operation has higher precedence than alternation; thus A+ | B+ is identical to (A+) | (B+).
A*
matches zero or more occurrences of A. This operation has higher precedence than alternation; thus A* | B* is identical to (A*) | (B*).

Appendix A: References

While these references cite a particular edition of a specification, conforming implementations of MicroXML MAY support later editions either in addition or as replacements, thus allowing MicroXML users to benefit from corrections and extensions to the other specifications on which it depends.

A.1 Normative References

RFC 2119
IETF (Internet Engineering Task Force). RFC 2119: Key words for use in RFCs to Indicate Requirement Levels. Scott Bradner, 1997. (See http://www.ietf.org/rfc/rfc2119.txt.)
RFC3629
IETF (Internet Engineering Task Force). RFC3629: UTF-8, a transformation format of ISO 10646. F. Yergeau, 2003. (See http://www.ietf.org/rfc/rfc3629.txt.)
Unicode
The Unicode Consortium. The Unicode Standard, Version 6.0.0, (Mountain View, CA: The Unicode Consortium, 2011. ISBN 978-1-936213-01-6)

A.2 Informative References

JSON
IETF (Internet Engineering Task Force). RFC 4627: The application/json Media Type for JavaScript Object Notation (JSON). D. Crockford, 2006. (See http://www.ietf.org/rfc/rfc4627.txt.)
INFOSET
W3C (World Wide Web Consortium). XML Information Set. John Cowan and Richard Tobin. (See http://www.w3.org/TR/xml-infoset.)
RFC3076
IETF (Internet Engineering Task Force). RFC 3076: Canonical XML Version 1.0. J. Boyer, 2001. (See http://www.ietf.org/rfc/rfc3076.txt.)
RFC3275
IETF (Internet Engineering Task Force). RFC 3275: (Extensible Markup Language) XML-Signature Syntax and Processing. D. Eastlake, J. Reagle, and D. Solo, 2002. (See http://www.ietf.org/rfc/rfc3275.txt.)
XML
W3C (World Wide Web Consortium). Extensible Markup Language (XML) 1.0 (Fifth Edition). Tim Bray et al., 2008. (See http://www.w3.org/TR/xml/.)

Appendix B: Relationship to XML (informative)

B.1 Syntax

Relative to XML 1.0 Fifth Edition, MicroXML prohibits:

MicroXML parsers are not required to use draconian error handling.

B.2 Data Model

The MicroXML data model corresponds to the following information items and properties from the XML information set:

MicroXML's data model is incompatible with XML in one respect: XML requires that literal newlines and tabs in attribute values are normalized into spaces, but MicroXML leaves them unchanged. For example, in XML

<doc att="hello
world"/>

and

<doc att="hello world"/>

have the same information set, but in MicroXML they do not. Note that this incompatibility cannot in general be fixed by postprocessing, since XML does not normalize newlines and tabs in attribute values that were entered as numeric character references, and the MicroXML data model does not provide information about which characters were entered as numeric character references.