See also
Table of contents
- 1 Conformance
- 2 Writing XML documents
- 3 Parsing XML documents
- 3.1 Overview
- 3.2 Input stream
- 3.3 Tokenization
- 3.4 Tree construction
- References
All diagrams, examples, and notes in this specification are
non-normative, as are all sections explicitly marked non-normative.
Everything else in this specification is normative.
The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in the normative parts of this document are to be
interpreted as described in RFC2119. For readability, these words do
not appear in all uppercase letters in this specification.
[RFC2119]
2 Writing XML documents
...
3 Parsing XML documents
This section and its subsection define the XML parser.
This specification defines the parsing rules for XML documents, whether
they are syntactically correct or not. Certain points in the parsing
algorithm are said to be parse errors. The
handling for parse errors is well-defined: user agents must either act as
described below when encountering such problems, or must terminate
processing at the first error that they encounter for which they do not wish
to apply the rules described below.
3.1 Overview
The input to the XML parsing process consists of a stream of octets which
is converted to a stream of code points, which in turn are tokenized, and
finally those tokens are used to construct a tree.
The stream of Unicode characters that consists the input to the
tokenization stage will be initially seen by the user agent as a stream of
octets (typically coming over the network or from the local file system).
The octets encode Unicode code points according to a particular encoding,
which the user agent must use to decode the octets into code points.
Define how to find the encoding...
3.3 Tokenization
Implementations must act as if they used the following
state machine to tokenize XML. The state machine must
start in the data state. Most states consume a single character,
which can have various side-effects, and either switches the state machine to
a new state to reconsume the same character, or switches it to a new state
(to consume the next character), or repeats the same state (to consume the
next character). Some states have more complicated behaviour and can consume
several characters before switching to another state.
The output of the tokenization stage is a series of zero or more of the
following tokens: start tag, empty tag, end tag, short end tag, comment,
character, processing instruction and end-of-file. Start and empty tag tokens
have a tag name and a list of attributes, each of which has a name and a
value. End tags have a tag name. Comment and character tokens have data.
Processing instructions have a name and data.
The tokenization stage also uses a list of entities and a
list of parameter entities. Both lists are populated with tokens
consisting of a name and value during the tokenization stage and are also used
within this stage.
Whenever the steps below indicate that the user agent has to
append an entity an entity has to be appended to
the list of entities unless the entity flag has been set to
"parameter" in which case it hsa to be appended to the list of parameter
entities. The entity flag has two values: "normal" and
"parameter". Its default value is "normal". It is set to "normal" after an
entity has been appended.
The tokenization stage also has a list of attribute declarations
each consisting of a tag name and a list of attributes which consist of an
attribute name, type and default value.
- Data state
-
Consume the next input character:
- U+0026 (
&
)
- ...
- U+003C (
<
)
- Switch to the tag state.
- EOF
- Emit an end-of-file token.
- Anything else
- Emit the input character as character token. Stay in this state.
- Tag state
-
Consume the next input character:
- U+002F (
/
)
- Switch to the end tag state.
- U+003F (
?
)
- Switch to the pi state.
- U+0021 (
!
)
- Switch to the markup declaration state.
- U+0009
- U+000A
- U+0020
- U+003A (
:
)
- U+003C (
<
)
- U+003E (
>
)
- EOF
- Parse error. Emit a U+003C (
<
) character.
Reconsume the current input character in the data state.
- Anything else
- Create a new tag token and set its name to the input character, then
switch to the tag name state.
- End tag state
-
Consume the next input character:
- U+003E (
>
)
- Emit a short end tag token and then switch to the data
state.
- U+0009
- U+000A
- U+0020
- U+003C (
<
)
- U+003A (
:
)
- EOF
- Parse error. Emit a U+003C (
<
) character
token and a U+002F (/
) character token. Reconsume the current
input character in the data state.
- Anything else
- Create an end tag token and set its name to the input character, then
switch to the end tag name state.
- End tag name state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the end tag name after state.
- EOF
- Parse error. Emit the current token and then reprocess the
current input character in the data state.
- U+003E (
>
)
- Emit the current token and then switch to the data
state.
- Anything else
- Append the current input character to the tag name and stay in the
current state.
- End tag name after state
-
Consume the next input character:
- U+003E (
>
)
- Emit the current token and then switch to the data state.
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- EOF
- Parse error. Emit the current token and then reprocess the
current input character in the data state.
- Anything else
- Parse error. Stay in the current state.
- Pi state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- EOF
- Parse error. Reprocess the current input character in the
bogus comment state.
- Anything else
- Create a new processing instruction token. Set target to the current
input character and data to the empty string. Then switch to the pi
target state.
- Pi target state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the pi target after state.
- EOF
- Parse error. Emit the current token and then reprocess the
current input character in the data state.
- U+003F (
?
)
- Switch to the pi after state.
- Anything else
- Append the current input character to the tag name and stay in the
current state.
- Pi target after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- Anything else
- Reprocess the current input character in the pi data
state.
- Pi data state
-
Consume the next input character:
- U+003F (
?
)
- Switch to the pi after state.
- EOF
- Parse error. Emit the current token and then reprocess the
current input character in the data state.
- Anything else
- Append the current input character to the pi's data and stay in the
current state.
- Pi after state
-
Consume the next input character:
- U+003E (
>
)
- Emit the current token and then switch to the data state.
- U+003F (
?
)
- Append the current input character to the pi's data and stay in the
current state.
- Anything else
- Reprocess the current input character in the pi data
state.
- Markup declaration state
-
If the next two characters are both U+002D (-
)
characters, consume those two characters, create a comment token whose
data is the empty string and then switch to the
comment state.
Otherwise, if the next seven characters are an exact match for
"[CDATA[
", then consume those characters and switch
to the CDATA state.
Otherwise, if the next seven characters are an exact match for
"DOCTYPE
", then this is a parse error.
Consume those characters and switch to the
DOCTYPE state.
Otherwise, this is a parse error. Switch to the
bogus comment state.
-
-
Consume the next input character:
- U+002D (
-
)
- Switch to the comment dash state.
- EOF
- Parse error. Emit the comment token and then reprocess the
current input character in the data state.
- Anything else
- Append the current character to the comment data.
-
-
Consume the next input character:
- U+002D (
-
)
- Switch to the comment end state.
- EOF
- Parse error. Emit the comment token and then reprocess the
current input character in the data state.
- Anything else
- Append a U+002D (
-
) and the current input character to the
comment token's data. Stay in the current state.
-
-
Consume the next input character:
- U+003E (
>
)
- Emit the comment token. Switch to the data state.
- U+002D (
-
)
- Append the current input character to the comment token's data. Stay in
the current state.
- EOF
- Parse error. Emit the comment token and then reprocess the
current input character in the data state.
- Anything else
- Append two U+002D (
-
) characters and the current input
character to the comment token's data. Switch to the comment
state.
- CDATA state
-
Consume the next input character:
- U+005D (
]
)
- Switch to the CDATA bracket state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Emit the current input character as character token. Stay in the
current state.
- CDATA bracket state
-
Consume the next input character:
- U+005D (
]
)
- Switch to the CDATA end state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Emit a U+005D (
]
) character as character token and also
emit the current input character as character token. Stay in the current
state.
- CDATA end state
-
Consume the next input character:
- U+003E (
>
)
- Switch to the data state.
- U+005D (
]
)
- Emit the current input character as character token. Stay in the
current state.
- EOF
- Parse error. Reconsume the current input character in the
data state.
- Anything else
- Emit two U+005D (
]
) characters as character tokens and
also emit the current input character as character token. Switch to the
CDATA state.
- DOCTYPE state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE root name before state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Reprocess the current input character in the bogus comment
state.
- DOCTYPE root name before state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+003E (
>
)
- Switch to the data state.
- EOF
- Parse error.
- Switch to the data state.
- Anything else
- Switch to the DOCTYPE root name state.
- DOCTYPE root name state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE root name after state.
- U+003E (
>
)
- Switch to the data state.
- U+005B (
[
)
- Switch to the DOCTYPE internal subset state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Stay in the current state.
- DOCTYPE root name after state
-
Consume the next input character:
- U+003E (
>
)
- Switch to the data state.
- U+0022 (
"
)
- Switch to the DOCTYPE identifier double quoted state.
- U+0027 (
'
)
- Switch to the DOCTYPE identifier single quoted state.
- U+005B (
[
)
- Switch to the DOCTYPE internal subset state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Stay in the current state.
- DOCTYPE identifier double quoted state
-
Consume the next input character:
- U+0022 (
"
)
- Switch to the DOCTYPE root name after state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Stay in the current state.
- DOCTYPE identifier single quoted state
-
Consume the next input character:
- U+0027 (
'
)
- Switch to the DOCTYPE root name after state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Stay in the current state.
- DOCTYPE internal subset state
-
Consume the next input character:
- U+003C (
<
)
- Switch to the DOCTYPE tag state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- U+0025 (
%
)
- consume parameter entity
- U+005D (
]
)
- Switch to the DOCTYPE internal subset after state.
- Anything else
- Stay in the current state.
- DOCTYPE internal subset after state
-
Consume the next input character:
- U+003E (
>
)
- Switch to the data state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Stay in the current state.
- DOCTYPE tag state
-
Consume the next input character:
- U+0021 (
!
)
- Switch to the DOCTYPE markup declaration state.
- U+003F (
?
)
- Switch to the DOCTYPE pi state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Switch to the DOCTYPE bogus comment state.
- DOCTYPE markup declaration state
-
If the next two characters are both U+002D (-
) characters,
then consume those characters and switch to the DOCTYPE comment
state.
Otherwise, if the next six characters are an exact match for "ENTITY",
then consume those characters and switch to the DOCTYPE ENTITY
state.
Otherwise, if the next seven characters are an exact match for "ATTLIST",
then consume those characters and switch to the DOCTYPE ATTLIST
state.
Otherwise, if the next eight characters are an exact match for
"NOTATION", then consume those characters and switch to the DOCTYPE
NOTATION state.
Otherwise, switch to the DOCTYPE bogus comment state.
-
-
Consume the next input character:
- U+002D (
-
)
- Switch to the DOCTYPE comment dash state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Stay in the current state.
-
-
Consume the next input character:
- U+002D (
-
)
- Switch to the DOCTYPE comment end state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Switch to the DOCTYPE comment state.
-
-
Consume the next input character:
- U+003E (
>
)
- Switch to the DOCTYPE internal subset state.
- U+002D (
-
)
- Switch to the DOCTYPE comment dash state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Switch to the DOCTYPE comment state.
- DOCTYPE ENTITY state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ENTITY type before state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Switch to the DOCTYPE bogus comment state.
- DOCTYPE ENTITY type before state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+0025 (
%
)
- Switch to the DOCTYPE ENTITY parameter before state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Create an entity token with the name set to the current input character
and the value set to the empty string. Then switch to the DOCTYPE
ENTITY name state.
- DOCTYPE ENTITY parameter before state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ENTITY parameter state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Switch to the DOCTYPE bogus comment state.
- DOCTYPE ENTITY parameter state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Create an entity token with the name set to the current input character
and the value set to the empty string. Set the entity flag to
"parameter". Switch to the DOCTYPE ENTITY name state.
- DOCTYPE ENTITY name state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ENTITY name after state.
- EOF
- Parse error. Reconsume the current input character in the
data state.
- Anything else
- Append the current input character to the name of the entity.
- DOCTYPE ENTITY name after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+0022 (
"
)
- Switch to the DOCTYPE ENTITY value double quoted state.
- U+0027 (
'
)
- Switch to the DOCTYPE ENTITY value single quoted state.
- EOF
- Parse error. Reconsume the current input character in the
data state.
- Anything else
- Switch to the DOCTYPE ENTITY identifier state.
- DOCTYPE ENTITY value double quoted state
-
Consume the next input character:
- U+0022 (
"
)
- Switch to the DOCTYPE ENTITY value after state.
- U+0026 (
&
):
- ... normalize numeric entities only
- EOF
- Parse error. Reconsume the current input character in the
data state.
- Anything else
- Append the current input character to the current entity token's
value.
- DOCTYPE ENTITY value single quoted state
-
Consume the next input character:
- U+0027 (
'
)
- Switch to the DOCTYPE ENTITY value after state.
- U+0026 (
&
):
- ... normalize numeric entities only
- EOF
- Switch to the data state.
- Anything else
- Append the current input character to the current entity token's
value.
- DOCTYPE ENTITY value after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+003E (
>
)
- Append an entity. Switch to the DOCTYPE internal
subset state.
- EOF
- Parse error. Reconsume the current input character in the
data state.
- Anything else
- Stay in the current state.
- DOCTYPE ENTITY identifier state
-
Consume the next input character:
- U+003E (
>
)
- append entity ...
- U+0022 (
"
)
- Switch to the DOCTYPE ENTITY identifier double quoted state.
- U+0027 (
'
)
- Switch to the DOCTYPE ENTITY identifier single quoted state.
- EOF
- Parse error. Reconsume the current input character in the
data state.
- Anything else
- Stay in the current state.
- DOCTYPE ENTITY identifier double quoted state
-
Consume the next input character:
- U+0022 (
"
)
- Switch to the DOCTYPE ENTITY identifier state.
- EOF
- Parse error. Reconsume the current input character in the
data state.
- Anything else
- Stay in the current state.
- DOCTYPE ENTITY identifier single quoted state
-
Consume the next input character:
- U+0027 (
'
)
- Switch to the DOCTYPE ENTITY identifier state.
- EOF
- Parse error. Reconsume the current input character in the
data state.
- Anything else
- Stay in the current state.
- DOCTYPE ATTLIST state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ATTLIST name before state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Switch to the DOCTYPE bogus comment state.
- DOCTYPE ATTLIST name before state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- ...
- DOCTYPE ATTLIST name state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ATTLIST name after state.
- EOF
- Switch to the data state.
- Anything else
- ...
- DOCTYPE ATTLIST name after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+003E (
>
)
- Switch to the DOCTYPE internal subset state.
- EOF
- Switch to the data state.
- Anything else
- ...
- DOCTYPE ATTLIST attribute name state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ATTLIST attribute name after state.
- EOF
- Switch to the data state.
- Anything else
- ...
- DOCTYPE ATTLIST attribute name after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- EOF
- Switch to the data state.
- Anything else
- ...
- DOCTYPE ATTLIST attribute type state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ATTLIST attribute type after state.
- EOF
- Switch to the data state.
- Anything else
- ...
- DOCTYPE ATTLIST attribute type after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+0023 (
#
)
- Switch to the DOCTYPE ATTLIST attribute declaration before state.
- EOF
- Switch to the data state.
- Anything else
- Switch to the DOCTYPE bogus comment state.
- DOCTYPE ATTLIST attribute declaration before state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE bogus comment state.
- EOF
- Switch to the data state.
- Anything else
- Switch to the DOCTYPE ATTLIST attribute declaration
state.
- DOCTYPE ATTLIST attribute declaration state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ATTLIST attribute declaration after state.
- EOF
- Switch to the data state.
- Anything else
- Stay in the current state.
- DOCTYPE ATTLIST attribute declaration after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+003E (
>
)
- Switch to the DOCTYPE internal subset state.
- U+0022 (
"
)
- Switch to the DOCTYPE ATTLIST attribute value double quoted state.
- U+0027 (
'
)
- Switch to the DOCTYPE ATTLIST attribute value single quoted state.
- EOF
- Switch to the data state.
- Anything else
- ...
- DOCTYPE ATTLIST attribute value double quoted state
-
Consume the next input character:
- U+0022 (
"
)
- Switch to the DOCTYPE ATTLIST name after state.
- U+0026 (
&
):
- ...
- Anything else
- ...
- DOCTYPE ATTLIST attribute value single quoted state
-
Consume the next input character:
- U+0027 (
'
)
- Switch to the DOCTYPE ATTLIST name after state.
- U+0026 (
&
):
- ...
- Anything else
- ...
- DOCTYPE NOTATION state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE NOTATION identifier state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Switch to the DOCTYPE bogus comment state.
- DOCTYPE NOTATION identifier state
-
Consume the next input character:
- U+003E (
>
)
- Switch to the DOCTYPE internal subset state.
- U+0022 (
"
)
- Switch to the DOCTYPE NOTATION identifier double quoted state.
- U+0027 (
'
)
- Switch to the DOCTYPE NOTATION identifier single quoted state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Stay in the current state.
- DOCTYPE NOTATION identifier double quoted state
-
Consume the next input character:
- U+0022 (
"
)
- Switch to the DOCTYPE NOTATION identifier state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Stay in the current state.
- DOCTYPE NOTATION identifier single quoted state
-
Consume the next input character:
- U+0027 (
'
)
- Switch to the DOCTYPE NOTATION identifier state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Stay in the current state.
- DOCTYPE pi state
-
Consume the next input character:
- U+003F (
?
)
- Switch to the DOCTYPE pi after state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Stay in the current state.
- DOCTYPE pi after state
-
Consume the next input character:
- U+003E (
>
)
- Switch to the DOCTYPE internal subset state.
- U+003F (
?
)
- Stay in the current state.
- EOF
- Parse error. Reprocess the current input character in the
data state.
- Anything else
- Switch to the DOCTYPE pi state.
-
Consume every character up to the first U+003E (>
) or
EOF, whichever comes first. Emit a comment token whose data is the
concatenation of all those consumed characters. Then consume the next input
character and switch to the DOCTYPE internal subset state
reprocessing the EOF character if that was the character consumed.
- Tag name state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the tag attribute name before state.
- U+003E (
>
)
- Emit the current token and then switch to the data state.
- EOF
- Parse error. Emit the current token and then reprocess the
current input character in the data state.
- U+002F (
/
)
- Switch to the empty tag state.
- Anything else
- Append the current input character to the tag name and stay in the
current state.
- Empty tag state
-
Consume the next input character:
- U+003E (
>
)
- Emit the current tag token as empty tag token and then switch to the
data state.
- Anything else
- Parse error. Reprocess the current input character in the
tag attribute name before state.
- Tag attribute name before state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- uU+003E (
>
)
- Emit the current token and then switch to the data state.
- U+002F (
/
)
- Switch to the Empty tag state.
- U+003A (
:
)
- Parse error. Stay in the current state.
- EOF
- Parse error. Emit the current token and then reprocess the
current input character in the data state.
- Anything else
- Start a new attribute in the current tag token. Set that attribute's
name to the current input character and its value to the empty string and
then switch to the tag attribute name state.
- Tag attribute name state
-
Consume the next input character:
- U+003D (
=
)
- Switch to the tag attribute value before state.
- U+003E (
>
)
- Emit the current token as start tag token. Switch to the data
state.
- U+0009
- U+000A
- U+0020
- Switch to the tag attribute name after state.
- U+002F (
/
)
- Switch to the Empty tag state.
- EOF
- Parse error. Emit the current token as start tag token and
then reprocess the current input character in the data
state.
- Anything else
- Append the current input character to the current attribute's name.
Stay in the current state.
When the user agent leaves this state (and before emitting the tag token,
if appropriate), the complete attribute's name must be
compared to the other attributes on the same token; if there is already an
attribute on the token with the exact same name, then this is a parse error
and the new attribute must be dropped, along with the
value that gets associated with it (if any).
- Tag attribute name after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+003D (
=
)
- Switch to the tag attribute value before state.
- U+003E (
>
)
- Emit the current token and then switch to the data state.
- U+002F (
/
)
- Switch to the empty tag state.
- EOF
- Parse error. Emit the current token and then reprocess the
current input character in the data state.
- Anything else
- Start a new attribute in the current tag token. Set that attribute's
name to the current input character and its value to the empty string and
then switch to the tag attribute name state.
- Tag attribute value before state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+0022 (
"
)
- Switch to the tag attribute value double quoted state.
- U+0027 (
'
)
- Switch to the tag attribute value single quoted state.
- U+0026 (
&
):
- Reprocess the input character in the tag attribute value unquoted
state.
- U+003E (
>
)
- Emit the current token and then switch to the data state.
- EOF
- Parse error. Emit the current token and then reprocess the
current input character in the data state.
- Anything else
- Append the current input character to the current attribute's value and
then switch to the tag attribute value unquoted state.
- Tag attribute value double quoted state
-
Consume the next input character:
- U+0022 (
"
)
- Switch to the tag attribute name before state.
- U+0026 (
&
)
- ...
- EOF
- Parse error. Emit the current token and then reprocess the
current input character in the data state.
- Anything else
- Append the input character to the current attribute's value. Stay in
the current state.
- Tag attribute value single quoted state
-
Consume the next input character:
- U+0027 (
'
)
- Switch to the tag attribute name before state.
- U+0026 (
&
)
- ...
- EOF
- Parse error. Emit the current token and then reprocess the
current input character in the data state.
- Anything else
- Append the input character to the current attribute's value. Stay in
the current state.
- Tag attribute value unquoted state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the tag attribute name before state.
- U+0026 (
&
):
- ...
- U+003E (
>
)
- Emit the current token as start tag token and then switch to the
data state.
- EOF
- Parse error. Emit the current token as start tag token and
then reprocess the current input character in the
data state.
- Anything else
- Append the input character to the current attribute's value. Stay in
the current state.
-
Consume every character up to the first U+003E (>
) or
EOF, whichever comes first. Emit a comment token whose data is the
concatenation of all those consumed characters. Then consume the next input
character and switch to the data state reprocessing the EOF
character if that was the character consumed.
3.4 Tree construction
The input to the tree construction stage is a sequence of tokens from the
tokenization stage. The output of this stage is a tree model
represented by a Document
object.
The tree construction stage passes through several phases. The initial
phase is the start phase.
The stack of open elements contains all elements of which the
closing tag has not yet been encountered. Once the first start tag token in
the start phase is encountered it will contain one open
element. The rest of the elements are added during the
main phase.
The current element is the bottommost node in the
stack of open elements.
The stack of open elements is said to
have an element in scope if the target element is in the
stack of open elements.
When the steps below require the user agent to
append a character to a node, the user agent must collect it
and all subsequent consecutive characters that would be appended to that
node and insert a single Text
node whose data is the
concatenation of all those characters.
Need to define
create an element for the token...
When the steps below require the user agent to
insert an element for a token the user agent must
create an element for the token and then append it to the
current element and push it into the
stack of open elements so that it becomes the new
current element.
- Start phase
-
Each token emitted from the tokenization stage must be
processed as follows until the algorithm below switches to a different
phase:
- A start tag token
Create an element for the token and then append it to
the Document
node and push it into the
stack of open elements. This element is the root element and
the first current element. Then switch to the
main phase.
- An empty tag token
Create an element for the token and append it to the
Document
node. Then switch to the end phase.
- A comment token
Append a Comment
node to the Document
node
with the data
attribute set to the data given in the
token.
- A processing instruction token
Append a ProcessingInstruction
node to the
Document
node with the target
and data
atributes set to the target and data given in the token.
- An end-of-file token
Parse error. Reprocess the token in the
end phase.
- Anything else
Parse error. Ignore the token.
- Main phase
-
Once a start tag token has been encountered (as detailed in the
previous phase) each token must be process using the following steps until
further notice:
- A character token
Append a character to the current
element.
- A start tag token
Insert an element for the token.
- An empty tag token
Create an element for the token and append it to the
current element.
- An end tag token
-
If the tag name of the current node does not match the tag
name of the end tag token this is a parse error.
If there is an element in scope with the same tag name as
that of the token pop nodes from the stack of open elements
until the first such element has been popped from the stack.
If there are no more elements on the
stack of open elements at this point switch to the
end phase.
- A short end tag token
Pop an element from the stack of open elements. If
there are no more elements on the stack of open elements
switch to the end phase.
- A comment token
Append a Comment
node to the current element
with the data
attribute set to the data given in the
token.
- A processing instruction token
Append a ProcessingInstruction
node to the current
element with the target
and data
atributes
set to the target and data given in the token.
- An end-of-file token
Parse error. Reprocess the token in the
end phase.
- End phase
-
Tokens in the end phase must be handled as follows:
- A comment token
Append a Comment
node to the Document
node
with the data
attribute set to the data given in the
token.
- A processing instruction token
Append a ProcessingInstruction
node to the
Document
node with the target
and data
atributes set to the target and data given in the token.
- An end-of-file token
Stop parsing.
- Anything else
Parse error. Ignore the token.
Once the user agent stops parsing the
document, it must follow these steps:
References