XML-ER

4 May 2012

This Version:: http://dvcs.w3.org/hg/xml-er/raw-file/tip/Overview.html
Participate:: public-xml-er (archives); IRC: #whatwg on Freenode
Editor:: Anne van Kesteren (Opera Software ASA) <annevk@annevk.nl>

To the extent possible under law, the editors have waived all copyright and related or neighboring rights to this work. In addition, as of 4 May 2012, the editors have made this specification available under the Open Web Foundation Agreement Version 1.0, which is available at http://www.openwebfoundation.org/legal/the-owf-1-0-agreements/owfa-1-0.

1 Conformance

All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.

The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]

2 Writing XML documents

...

3 Parsing XML documents

This section and its subsection define the XML parser.

This specification defines the parsing rules for XML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The handling for parse errors is well-defined: user agents must either act as described below when encountering such problems, or must terminate processing at the first error that they encounter for which they do not wish to apply the rules described below.

3.1 Overview

The input to the XML parsing process consists of a stream of octets which is converted to a stream of code points, which in turn are tokenized, and finally those tokens are used to construct a tree.

3.2 Input stream

The stream of Unicode characters that consists the input to the tokenization stage will be initially seen by the user agent as a stream of octets (typically coming over the network or from the local file system). The octets encode Unicode code points according to a particular encoding, which the user agent must use to decode the octets into code points.

Define how to find the encoding...

3.3 Tokenization

Implementations must act as if they used the following state machine to tokenize XML. The state machine must start in the data state. Most states consume a single character, which can have various side-effects, and either switches the state machine to a new state to reconsume the same character, or switches it to a new state (to consume the next character), or repeats the same state (to consume the next character). Some states have more complicated behaviour and can consume several characters before switching to another state.

The output of the tokenization stage is a series of zero or more of the following tokens: start tag, empty tag, end tag, short end tag, comment, character, processing instruction and end-of-file. Start and empty tag tokens have a tag name and a list of attributes, each of which has a name and a value. End tags have a tag name. Comment and character tokens have data. Processing instructions have a name and data.

The tokenization stage also uses a list of entities and a list of parameter entities. Both lists are populated with tokens consisting of a name and value during the tokenization stage and are also used within this stage.

Whenever the steps below indicate that the user agent has to append an entity an entity has to be appended to the list of entities unless the entity flag has been set to "parameter" in which case it hsa to be appended to the list of parameter entities. The entity flag has two values: "normal" and "parameter". Its default value is "normal". It is set to "normal" after an entity has been appended.

The tokenization stage also has a list of attribute declarations each consisting of a tag name and a list of attributes which consist of an attribute name, type and default value.

Data state

Consume the next input character:

U+0026 (&): ...
U+003C (<): Switch to the tag state.
EOF: Emit an end-of-file token.
Anything else: Emit the input character as character token. Stay in this state.

Tag state

Consume the next input character:

U+002F (/): Switch to the end tag state.
U+003F (?): Switch to the pi state.
U+0021 (!): Switch to the markup declaration state.
U+0009
U+000A
U+0020
U+003A (:)
U+003C (<)
U+003E (>)
EOF: Parse error. Emit a U+003C (<) character. Reconsume the current input character in the data state.
Anything else: Create a new tag token and set its name to the input character, then switch to the tag name state.

End tag state

Consume the next input character:

U+003E (>): Emit a short end tag token and then switch to the data state.
U+0009
U+000A
U+0020
U+003C (<)
U+003A (:)
EOF: Parse error. Emit a U+003C (<) character token and a U+002F (/) character token. Reconsume the current input character in the data state.
Anything else: Create an end tag token and set its name to the input character, then switch to the end tag name state.

End tag name state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the end tag name after state.
EOF: Parse error. Emit the current token and then reprocess the current input character in the data state.
U+003E (>): Emit the current token and then switch to the data state.
Anything else: Append the current input character to the tag name and stay in the current state.

End tag name after state

Consume the next input character:

U+003E (>): Emit the current token and then switch to the data state.
U+0009
U+000A
U+0020: Stay in the current state.
EOF: Parse error. Emit the current token and then reprocess the current input character in the data state.
Anything else: Parse error. Stay in the current state.

Pi state

Consume the next input character:

U+0009
U+000A
U+0020
EOF: Parse error. Reprocess the current input character in the bogus comment state.
Anything else: Create a new processing instruction token. Set target to the current input character and data to the empty string. Then switch to the pi target state.

Pi target state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the pi target after state.
EOF: Parse error. Emit the current token and then reprocess the current input character in the data state.
U+003F (?): Switch to the pi after state.
Anything else: Append the current input character to the tag name and stay in the current state.

Pi target after state

Consume the next input character:

U+0009
U+000A
U+0020: Stay in the current state.
Anything else: Reprocess the current input character in the pi data state.

Pi data state

Consume the next input character:

U+003F (?): Switch to the pi after state.
EOF: Parse error. Emit the current token and then reprocess the current input character in the data state.
Anything else: Append the current input character to the pi's data and stay in the current state.

Pi after state

Consume the next input character:

U+003E (>): Emit the current token and then switch to the data state.
U+003F (?): Append the current input character to the pi's data and stay in the current state.
Anything else: Reprocess the current input character in the pi data state.

Markup declaration state

If the next two characters are both U+002D (-) characters, consume those two characters, create a comment token whose data is the empty string and then switch to the comment state.

Otherwise, if the next seven characters are an exact match for "[CDATA[", then consume those characters and switch to the CDATA state.

Otherwise, if the next seven characters are an exact match for "DOCTYPE", then this is a parse error. Consume those characters and switch to the DOCTYPE state.

Otherwise, this is a parse error. Switch to the bogus comment state.

Comment state

Consume the next input character:

U+002D (-): Switch to the comment dash state.
EOF: Parse error. Emit the comment token and then reprocess the current input character in the data state.
Anything else: Append the current character to the comment data.

Comment dash state

Consume the next input character:

U+002D (-): Switch to the comment end state.
EOF: Parse error. Emit the comment token and then reprocess the current input character in the data state.
Anything else: Append a U+002D (-) and the current input character to the comment token's data. Stay in the current state.

Comment end state

Consume the next input character:

U+003E (>): Emit the comment token. Switch to the data state.
U+002D (-): Append the current input character to the comment token's data. Stay in the current state.
EOF: Parse error. Emit the comment token and then reprocess the current input character in the data state.
Anything else: Append two U+002D (-) characters and the current input character to the comment token's data. Switch to the comment state.

CDATA state

Consume the next input character:

U+005D (]): Switch to the CDATA bracket state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Emit the current input character as character token. Stay in the current state.

CDATA bracket state

Consume the next input character:

U+005D (]): Switch to the CDATA end state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Emit a U+005D (]) character as character token and also emit the current input character as character token. Stay in the current state.

CDATA end state

Consume the next input character:

U+003E (>): Switch to the data state.
U+005D (]): Emit the current input character as character token. Stay in the current state.
EOF: Parse error. Reconsume the current input character in the data state.
Anything else: Emit two U+005D (]) characters as character tokens and also emit the current input character as character token. Switch to the CDATA state.

DOCTYPE state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the DOCTYPE root name before state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Reprocess the current input character in the bogus comment state.

DOCTYPE root name before state

Consume the next input character:

U+0009
U+000A
U+0020: Stay in the current state.
U+003E (>): Switch to the data state.
EOF: Parse error.; Switch to the data state.
Anything else: Switch to the DOCTYPE root name state.

DOCTYPE root name state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the DOCTYPE root name after state.
U+003E (>): Switch to the data state.
U+005B ([): Switch to the DOCTYPE internal subset state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Stay in the current state.

DOCTYPE root name after state

Consume the next input character:

U+003E (>): Switch to the data state.
U+0022 ("): Switch to the DOCTYPE identifier double quoted state.
U+0027 ('): Switch to the DOCTYPE identifier single quoted state.
U+005B ([): Switch to the DOCTYPE internal subset state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Stay in the current state.

DOCTYPE identifier double quoted state

Consume the next input character:

U+0022 ("): Switch to the DOCTYPE root name after state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Stay in the current state.

DOCTYPE identifier single quoted state

Consume the next input character:

U+0027 ('): Switch to the DOCTYPE root name after state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Stay in the current state.

DOCTYPE internal subset state

Consume the next input character:

U+003C (<): Switch to the DOCTYPE tag state.
EOF: Parse error. Reprocess the current input character in the data state.
U+0025 (%): consume parameter entity
U+005D (]): Switch to the DOCTYPE internal subset after state.
Anything else: Stay in the current state.

DOCTYPE internal subset after state

Consume the next input character:

U+003E (>): Switch to the data state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Stay in the current state.

DOCTYPE tag state

Consume the next input character:

U+0021 (!): Switch to the DOCTYPE markup declaration state.
U+003F (?): Switch to the DOCTYPE pi state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Switch to the DOCTYPE bogus comment state.

DOCTYPE markup declaration state

If the next two characters are both U+002D (-) characters, then consume those characters and switch to the DOCTYPE comment state.

Otherwise, if the next six characters are an exact match for "ENTITY", then consume those characters and switch to the DOCTYPE ENTITY state.

Otherwise, if the next seven characters are an exact match for "ATTLIST", then consume those characters and switch to the DOCTYPE ATTLIST state.

Otherwise, if the next eight characters are an exact match for "NOTATION", then consume those characters and switch to the DOCTYPE NOTATION state.

Otherwise, switch to the DOCTYPE bogus comment state.

DOCTYPE comment state

Consume the next input character:

U+002D (-): Switch to the DOCTYPE comment dash state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Stay in the current state.

DOCTYPE comment dash state

Consume the next input character:

U+002D (-): Switch to the DOCTYPE comment end state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Switch to the DOCTYPE comment state.

DOCTYPE comment end state

Consume the next input character:

U+003E (>): Switch to the DOCTYPE internal subset state.
U+002D (-): Switch to the DOCTYPE comment dash state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Switch to the DOCTYPE comment state.

DOCTYPE ENTITY state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the DOCTYPE ENTITY type before state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Switch to the DOCTYPE bogus comment state.

DOCTYPE ENTITY type before state

Consume the next input character:

U+0009
U+000A
U+0020: Stay in the current state.
U+0025 (%): Switch to the DOCTYPE ENTITY parameter before state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Create an entity token with the name set to the current input character and the value set to the empty string. Then switch to the DOCTYPE ENTITY name state.

DOCTYPE ENTITY parameter before state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the DOCTYPE ENTITY parameter state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Switch to the DOCTYPE bogus comment state.

DOCTYPE ENTITY parameter state

Consume the next input character:

U+0009
U+000A
U+0020: Stay in the current state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Create an entity token with the name set to the current input character and the value set to the empty string. Set the entity flag to "parameter". Switch to the DOCTYPE ENTITY name state.

DOCTYPE ENTITY name state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the DOCTYPE ENTITY name after state.
EOF: Parse error. Reconsume the current input character in the data state.
Anything else: Append the current input character to the name of the entity.

DOCTYPE ENTITY name after state

Consume the next input character:

U+0009
U+000A
U+0020: Stay in the current state.
U+0022 ("): Switch to the DOCTYPE ENTITY value double quoted state.
U+0027 ('): Switch to the DOCTYPE ENTITY value single quoted state.
EOF: Parse error. Reconsume the current input character in the data state.
Anything else: Switch to the DOCTYPE ENTITY identifier state.

DOCTYPE ENTITY value double quoted state

Consume the next input character:

U+0022 ("): Switch to the DOCTYPE ENTITY value after state.
U+0026 (&):: ... normalize numeric entities only
EOF: Parse error. Reconsume the current input character in the data state.
Anything else: Append the current input character to the current entity token's value.

DOCTYPE ENTITY value single quoted state

Consume the next input character:

U+0027 ('): Switch to the DOCTYPE ENTITY value after state.
U+0026 (&):: ... normalize numeric entities only
EOF: Switch to the data state.
Anything else: Append the current input character to the current entity token's value.

DOCTYPE ENTITY value after state

Consume the next input character:

U+0009
U+000A
U+0020: Stay in the current state.
U+003E (>): Append an entity. Switch to the DOCTYPE internal subset state.
EOF: Parse error. Reconsume the current input character in the data state.
Anything else: Stay in the current state.

DOCTYPE ENTITY identifier state

Consume the next input character:

U+003E (>): append entity ...
U+0022 ("): Switch to the DOCTYPE ENTITY identifier double quoted state.
U+0027 ('): Switch to the DOCTYPE ENTITY identifier single quoted state.
EOF: Parse error. Reconsume the current input character in the data state.
Anything else: Stay in the current state.

DOCTYPE ENTITY identifier double quoted state

Consume the next input character:

U+0022 ("): Switch to the DOCTYPE ENTITY identifier state.
EOF: Parse error. Reconsume the current input character in the data state.
Anything else: Stay in the current state.

DOCTYPE ENTITY identifier single quoted state

Consume the next input character:

U+0027 ('): Switch to the DOCTYPE ENTITY identifier state.
EOF: Parse error. Reconsume the current input character in the data state.
Anything else: Stay in the current state.

DOCTYPE ATTLIST state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the DOCTYPE ATTLIST name before state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Switch to the DOCTYPE bogus comment state.

DOCTYPE ATTLIST name before state

Consume the next input character:

U+0009
U+000A
U+0020: Stay in the current state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: ...

DOCTYPE ATTLIST name state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the DOCTYPE ATTLIST name after state.
EOF: Switch to the data state.
Anything else: ...

DOCTYPE ATTLIST name after state

Consume the next input character:

U+0009
U+000A
U+0020: Stay in the current state.
U+003E (>): Switch to the DOCTYPE internal subset state.
EOF: Switch to the data state.
Anything else: ...

DOCTYPE ATTLIST attribute name state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the DOCTYPE ATTLIST attribute name after state.
EOF: Switch to the data state.
Anything else: ...

DOCTYPE ATTLIST attribute name after state

Consume the next input character:

U+0009
U+000A
U+0020: Stay in the current state.
EOF: Switch to the data state.
Anything else: ...

DOCTYPE ATTLIST attribute type state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the DOCTYPE ATTLIST attribute type after state.
EOF: Switch to the data state.
Anything else: ...

DOCTYPE ATTLIST attribute type after state

Consume the next input character:

U+0009
U+000A
U+0020: Stay in the current state.
U+0023 (#): Switch to the DOCTYPE ATTLIST attribute declaration before state.
EOF: Switch to the data state.
Anything else: Switch to the DOCTYPE bogus comment state.

DOCTYPE ATTLIST attribute declaration before state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the DOCTYPE bogus comment state.
EOF: Switch to the data state.
Anything else: Switch to the DOCTYPE ATTLIST attribute declaration state.

DOCTYPE ATTLIST attribute declaration state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the DOCTYPE ATTLIST attribute declaration after state.
EOF: Switch to the data state.
Anything else: Stay in the current state.

DOCTYPE ATTLIST attribute declaration after state

Consume the next input character:

U+0009
U+000A
U+0020: Stay in the current state.
U+003E (>): Switch to the DOCTYPE internal subset state.
U+0022 ("): Switch to the DOCTYPE ATTLIST attribute value double quoted state.
U+0027 ('): Switch to the DOCTYPE ATTLIST attribute value single quoted state.
EOF: Switch to the data state.
Anything else: ...

DOCTYPE ATTLIST attribute value double quoted state

Consume the next input character:

U+0022 ("): Switch to the DOCTYPE ATTLIST name after state.
U+0026 (&):: ...
Anything else: ...

DOCTYPE ATTLIST attribute value single quoted state

Consume the next input character:

U+0027 ('): Switch to the DOCTYPE ATTLIST name after state.
U+0026 (&):: ...
Anything else: ...

DOCTYPE NOTATION state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the DOCTYPE NOTATION identifier state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Switch to the DOCTYPE bogus comment state.

DOCTYPE NOTATION identifier state

Consume the next input character:

U+003E (>): Switch to the DOCTYPE internal subset state.
U+0022 ("): Switch to the DOCTYPE NOTATION identifier double quoted state.
U+0027 ('): Switch to the DOCTYPE NOTATION identifier single quoted state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Stay in the current state.

DOCTYPE NOTATION identifier double quoted state

Consume the next input character:

U+0022 ("): Switch to the DOCTYPE NOTATION identifier state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Stay in the current state.

DOCTYPE NOTATION identifier single quoted state

Consume the next input character:

U+0027 ('): Switch to the DOCTYPE NOTATION identifier state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Stay in the current state.

DOCTYPE pi state

Consume the next input character:

U+003F (?): Switch to the DOCTYPE pi after state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Stay in the current state.

DOCTYPE pi after state

Consume the next input character:

U+003E (>): Switch to the DOCTYPE internal subset state.
U+003F (?): Stay in the current state.
EOF: Parse error. Reprocess the current input character in the data state.
Anything else: Switch to the DOCTYPE pi state.

DOCTYPE bogus comment state

Consume every character up to the first U+003E (>) or EOF, whichever comes first. Emit a comment token whose data is the concatenation of all those consumed characters. Then consume the next input character and switch to the DOCTYPE internal subset state reprocessing the EOF character if that was the character consumed.

Tag name state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the tag attribute name before state.
U+003E (>): Emit the current token and then switch to the data state.
EOF: Parse error. Emit the current token and then reprocess the current input character in the data state.
U+002F (/): Switch to the empty tag state.
Anything else: Append the current input character to the tag name and stay in the current state.

Empty tag state

Consume the next input character:

U+003E (>): Emit the current tag token as empty tag token and then switch to the data state.
Anything else: Parse error. Reprocess the current input character in the tag attribute name before state.

Tag attribute name before state

Consume the next input character:

U+0009
U+000A
U+0020: Stay in the current state.
uU+003E (>): Emit the current token and then switch to the data state.
U+002F (/): Switch to the Empty tag state.
U+003A (:): Parse error. Stay in the current state.
EOF: Parse error. Emit the current token and then reprocess the current input character in the data state.
Anything else: Start a new attribute in the current tag token. Set that attribute's name to the current input character and its value to the empty string and then switch to the tag attribute name state.

Tag attribute name state

Consume the next input character:

U+003D (=): Switch to the tag attribute value before state.
U+003E (>): Emit the current token as start tag token. Switch to the data state.
U+0009
U+000A
U+0020: Switch to the tag attribute name after state.
U+002F (/): Switch to the Empty tag state.
EOF: Parse error. Emit the current token as start tag token and then reprocess the current input character in the data state.
Anything else: Append the current input character to the current attribute's name. Stay in the current state.

When the user agent leaves this state (and before emitting the tag token, if appropriate), the complete attribute's name must be compared to the other attributes on the same token; if there is already an attribute on the token with the exact same name, then this is a parse error and the new attribute must be dropped, along with the value that gets associated with it (if any).

Tag attribute name after state

Consume the next input character:

U+0009
U+000A
U+0020: Stay in the current state.
U+003D (=): Switch to the tag attribute value before state.
U+003E (>): Emit the current token and then switch to the data state.
U+002F (/): Switch to the empty tag state.
EOF: Parse error. Emit the current token and then reprocess the current input character in the data state.
Anything else: Start a new attribute in the current tag token. Set that attribute's name to the current input character and its value to the empty string and then switch to the tag attribute name state.

Tag attribute value before state

Consume the next input character:

U+0009
U+000A
U+0020: Stay in the current state.
U+0022 ("): Switch to the tag attribute value double quoted state.
U+0027 ('): Switch to the tag attribute value single quoted state.
U+0026 (&):: Reprocess the input character in the tag attribute value unquoted state.
U+003E (>): Emit the current token and then switch to the data state.
EOF: Parse error. Emit the current token and then reprocess the current input character in the data state.
Anything else: Append the current input character to the current attribute's value and then switch to the tag attribute value unquoted state.

Tag attribute value double quoted state

Consume the next input character:

U+0022 ("): Switch to the tag attribute name before state.
U+0026 (&): ...
EOF: Parse error. Emit the current token and then reprocess the current input character in the data state.
Anything else: Append the input character to the current attribute's value. Stay in the current state.

Tag attribute value single quoted state

Consume the next input character:

U+0027 ('): Switch to the tag attribute name before state.
U+0026 (&): ...
EOF: Parse error. Emit the current token and then reprocess the current input character in the data state.
Anything else: Append the input character to the current attribute's value. Stay in the current state.

Tag attribute value unquoted state

Consume the next input character:

U+0009
U+000A
U+0020: Switch to the tag attribute name before state.
U+0026 (&):: ...
U+003E (>): Emit the current token as start tag token and then switch to the data state.
EOF: Parse error. Emit the current token as start tag token and then reprocess the current input character in the data state.
Anything else: Append the input character to the current attribute's value. Stay in the current state.

Bogus comment state

Consume every character up to the first U+003E (>) or EOF, whichever comes first. Emit a comment token whose data is the concatenation of all those consumed characters. Then consume the next input character and switch to the data state reprocessing the EOF character if that was the character consumed.

3.4 Tree construction

The input to the tree construction stage is a sequence of tokens from the tokenization stage. The output of this stage is a tree model represented by a Document object.

The tree construction stage passes through several phases. The initial phase is the start phase.

The stack of open elements contains all elements of which the closing tag has not yet been encountered. Once the first start tag token in the start phase is encountered it will contain one open element. The rest of the elements are added during the main phase.

The current element is the bottommost node in the stack of open elements.

The stack of open elements is said to have an element in scope if the target element is in the stack of open elements.

When the steps below require the user agent to append a character to a node, the user agent must collect it and all subsequent consecutive characters that would be appended to that node and insert a single Text node whose data is the concatenation of all those characters.

Need to define create an element for the token...

When the steps below require the user agent to insert an element for a token the user agent must create an element for the token and then append it to the current element and push it into the stack of open elements so that it becomes the new current element.

Start phase

Each token emitted from the tokenization stage must be processed as follows until the algorithm below switches to a different phase:

A start tag token: Create an element for the token and then append it to the Document node and push it into the stack of open elements. This element is the root element and the first current element. Then switch to the main phase.
An empty tag token: Create an element for the token and append it to the Document node. Then switch to the end phase.
A comment token: Append a Comment node to the Document node with the data attribute set to the data given in the token.
A processing instruction token: Append a ProcessingInstruction node to the Document node with the target and data atributes set to the target and data given in the token.
An end-of-file token: Parse error. Reprocess the token in the end phase.
Anything else: Parse error. Ignore the token.

Main phase

Once a start tag token has been encountered (as detailed in the previous phase) each token must be process using the following steps until further notice:

A character token

Append a character to the current element.

A start tag token

Insert an element for the token.

An empty tag token

Create an element for the token and append it to the current element.

An end tag token

If the tag name of the current node does not match the tag name of the end tag token this is a parse error.

If there is an element in scope with the same tag name as that of the token pop nodes from the stack of open elements until the first such element has been popped from the stack.

If there are no more elements on the stack of open elements at this point switch to the end phase.

A short end tag token

Pop an element from the stack of open elements. If there are no more elements on the stack of open elements switch to the end phase.

A comment token

Append a Comment node to the current element with the data attribute set to the data given in the token.

A processing instruction token

Append a ProcessingInstruction node to the current element with the target and data atributes set to the target and data given in the token.

An end-of-file token

Parse error. Reprocess the token in the end phase.

End phase

Tokens in the end phase must be handled as follows:

A comment token: Append a Comment node to the Document node with the data attribute set to the data given in the token.
A processing instruction token: Append a ProcessingInstruction node to the Document node with the target and data atributes set to the target and data given in the token.
An end-of-file token: Stop parsing.
Anything else: Parse error. Ignore the token.

Once the user agent stops parsing the document, it must follow these steps:

References

[RFC2119]: Key words for use in RFCs to Indicate Requirement Levels, Scott Bradner. IETF.

XML-ER

4 May 2012

See also

Table of contents

1 Conformance

2 Writing XML documents

3 Parsing XML documents

3.1 Overview

3.2 Input stream

3.3 Tokenization

3.4 Tree construction

References