W3C

URL

W3C Editor's Draft 8 November 2012

This Version:
http://dvcs.w3.org/hg/url/raw-file/tip/Overview.html
Latest WHATWG Version:
http://url.spec.whatwg.org/
Previous Versions:
http://www.w3.org/TR/2012/ED-url-20120524/
Author:
Anne van Kesteren <>
Editor:
Web Applications Working Group <public-webapps@w3.org>
Former editors:
Adam Barth <w3c@adambarth.com>
Erik Arvidsson <arv@chromium.org>
Michael[tm] Smith <mike@w3.org>

Abstract

This specification defines the term URL, various algorithms for dealing with URLs, and an API for constructing, parsing, and resolving URLs.

The behavior specified in this document for how browsers process URLs might or might not match any particular browser, but browsers might be well-served by adopting the behavior defined herein.

Status of this Document

This is the 8 November 2012 draft of the URL standard.

The Abstract and all of the contents starting with the Table of Contents were copied from the 4 November 2012 version of the WHATWG's URL Living Standard.

Please send comments to public-webapps@w3.org (archived) with [url] at the start of the subject line.

This document is maintained by the Web Applications (WebApps) Working Group. The WebApps Working Group is part of the Rich Web Clients Activity in the W3C Interaction Domain.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Table of Contents

  1. Goals
  2. 1 Conformance
  3. 2 Terminology
  4. 3 Hosts and IP addresses
    1. 3.1 Writing
    2. 3.2 Parsing
    3. 3.3 Serializing
  5. 4 URLs
    1. 4.1 Writing
    2. 4.2 Parsing
    3. 4.3 Serializing
  6. 5 API
    1. 5.1 Constructors
    2. 5.2 Interface URLUtils
    3. 5.3 Interface URLQuery
  7. References
  8. Acknowledgments

Goals

The URL standard sets out to make URLs throughout the web platform fully predictable and interoperable. This is the plan:

As the editor learns more about the subject matter the goals might increase in scope somewhat.

1 Conformance

All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.

The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this specification are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]

2 Terminology

Some terms used in this specification are defined in the Encoding Standard. [ENCODING]

The ASCII digits are code points in the range U+0030 to U+0039.

The ASCII hex digits are ASCII digits or are code points in the range U+0041 to U+0046 or in the range U+0061 to U+0066.

The ASCII alpha are code points in the range U+0041 to U+005A or in the range U+0061 to U+007A.

The ASCII alphanumeric are ASCII digits or ASCII alpha.

The domain label separators are the code points U+002E, U+3002, U+FF0E, and U+FF61.

A URL-encoded byte is "%", followed by two ASCII hex digits.

For maximum robustness, sequences of URL-encoded bytes, after conversion to bytes, ought to not cause utf-8 decode to emit one or more decoder errors.

To URL encode a byte into a URL-encoded byte, return a string consisting of "%", followed by a double-digit, uppercase, hexadecimal representation of byte.

3 Hosts and IP addresses

A host is a string that represents a network address, either in the form of a domain or an IPv6 address.

This is a slightly more generic definition of host than its traditional meaning for the sake of convenience.

A domain is an ordered list of one or more domain labels.

An IPv6 address is ...

3.1 Writing

A domain must be one or more domain labels separated from each other by a domain label separator, optionally followed by a domain label separator.

An IPv6 address must ...

3.2 Parsing

IDNA2003 vs IDNA2008 (and also UTS #46)

A parsed host consists either of a parsed domain or a parsed IPv6 address.

A parsed domain consists of a list of labels and a trailing dot flag.

A parsed IPv6 address consists of an ordered set of eight 16-bit integers.

To parse ...

3.3 Serializing

Normalize IPv6 addresses

4 URLs

A URL is a string that represents an identifier.

A URL is either a relative URL or an absolute URL. Either form can be followed by a fragment.

A relative URL is a URL that is relative to a parsed URL. Such a parsed URL is a base URL. If the base URL has no relative scheme, parsing the relative URL results in a fatal error.

An absolute URL stands on its own and is therefore a potential base URL.

Provided there are no fatal errors, parsing and serializing a URL will turn it into an absolute URL. The intermediate form is named a parsed URL. The components a URL can consist of and parsed URL consists of are scheme, scheme data (not used if scheme is a relative scheme), userinfo, host, port, path, query, and fragment.

A relative scheme is a scheme listed in the first column of the following table. It typically has an associated default port listed in the second column on the same row.

scheme port
"ftp"21
"file"
"gopher"70
"http"80
"https"443
"ws"80
"wss"443

4.1 Writing

A URL must be either a relative URL or an absolute URL, optionally followed by "#" and a fragment.

An absolute URL is a scheme, followed by ":", followed by scheme data.

A scheme is one ASCII alpha, followed by zero or more of ASCII alphanumeric, "+", "-", and ".". A scheme must be registered ....

The syntax of scheme data depends on the scheme and is typically defined alongside it. For a relative scheme, scheme data is a scheme-relative URL. For other schemes, specifications or standards must define scheme data either literally as or as a subset of one or more URL units that do not start with "?".

A relative URL is either a scheme-relative URL, an absolute-path-relative URL, or a path-relative URL that does not start with a scheme and ":". A relative URL must be relative to a base URL with a relative scheme.

A scheme-relative URL is "//", optionally followed by userinfo and "@", followed by a host, optionally followed by ":" and a port, optionally followed by either an absolute-path-relative URL or a "?" and a query.

A userinfo is zero or more URL units, excluding "/", "?", and "@".

A port is zero or more ASCII digits.

An absolute-path-relative URL is "/", followed by a path-relative URL that does not start with "/".

A path-relative URL is zero or more path segments separated from each other by a "/", optionally followed by a "?" and a query.

A path segment is zero or more URL units, excluding "/" and "?".

A query is zero or more URL units.

A fragment is zero or more URL units.

The URL units are ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "@", "_", "~", code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFEF, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD, and URL-encoded bytes.

Code points higher than U+009F will be converted to URL-encoded bytes by the parser.

4.2 Parsing

Aside from the components mentioned earlier, a parsed URL also has an associated fatal error flag and relative flag.

To clear a parsed URL, set its scheme, scheme data, userinfo, host, and port to the empty string, query and fragment to null, and its path to the empty list.

To neuter a parsed URL, clear it, and then set its fatal error flag.

The URL simple escape set are all code points less than U+0020 (i.e. excluding U+0020) and all code points greater than U+007E.

The URL default escape set are all code points less than U+0021, all code points greater than U+007E, '"', "#", "<", ">", "?", and "`".

The URL query escape set are all code points less than U+0021, all code points greater than U+007E, '"', "#", "<", ">", and "`".

To URL escape a code point, using an escape set, and an optional encoding override, run these steps:

  1. Let encoding be encoding override, if given, and utf-8 otherwise.

  2. If code point is not in escape set, return code point.

  3. Let bytes be the result of running encode on c using encoding as encoding.

    Whenever the encoder algorithm emits an encoder error, emit a 0x3F byte instead and do not terminate the algorithm.

    Using utf-8 ensures you never hit an encoder error here, but unfortunately legacy content uses other encodings.

  4. URL encode each byte in bytes, and then return them concatenated, in the same order.

Add the ability to halt on the first conformance error.

To parse a string input, optionally with a optionally with a base URL base, optionally with an encoding encoding override, optionally with an parsed URL url, and if url is given, optionally with a state override state override, run these steps:

The url and state override arguments can be used for API manipulation of a parsed URL.

  1. If url is not given:

    1. Set url to a new parsed URL.

    2. Clear url.

    3. Remove any leading and trailing ASCII whitespace from input.

  2. Let state be state override if given, or scheme start state otherwise.

  3. If base is not given, set it to null.

  4. If encoding override is not given, set it to utf-8.

  5. Let buffer be the empty string.

  6. Let the @ flag and the [] flag be unset.

  7. Append a conceptual EOF code point (to signify end-of-input) to input.

  8. Let pointer be a pointer to first code point in input.

  9. Keep running the following state machine by switching on state, increasing pointer by one after each time it is run, as long as url's fatal error flag is not set and as long as pointer does not point past the end of input.

    Let c be the code point to which pointer points.

    Let remaining be the substring starting after pointer in input.

    If input is "mailto:example@example" (omitting the EOF code point here) and pointer points to "@", remaining is "example" (again omitting the EOF code point).

    scheme start state
    1. If c is not the EOF code point and is in the range U+0041 through to U+005A or U+0061 through to U+007A, append c, lowercased, to buffer, and set state to scheme state.

    2. Otherwise, if state override is not given, set buffer to the empty string, set state to no scheme state, and decrease pointer by one.

    3. Otherwise, terminate this algorithm.

    scheme state
    1. If c is not the EOF code point and is in the range U+0030 through to U+0039, U+0041 through to U+005A, U+0061 through to U+007A, or c is "+", "-", or ".", append c, lowercased, to buffer.

    2. Otherwise, if c is ":", set url's scheme to buffer, buffer to the empty string, and then run these substeps:

      1. If state override is given, terminate this algorithm.

      2. If url's scheme is a relative scheme, set url's relative flag.

      3. If url's scheme is "file", set state to relative state.

      4. Otherwise, if url's relative flag is set, base is not null and base's scheme is equal to url's scheme, set state to relative state.

      5. Otherwise, if url's relative flag is set, set state to authority start state.

      6. Otherwise, set state to scheme data state.

    3. Otherwise, set state to no scheme state, buffer to the empty string, and start over (from the first code point in input).

    scheme data state
    1. If state override is not given and c is "#", set url's fragment to the empty string and state to fragment state.

    2. Otherwise, if c is none of EOF code point, U+0009, U+000A, and U+000D, URL escape c using the URL simple escape set, and append the result to url's scheme data.

      Path could be reused here to keep parsed URL simple.

    no scheme state

    If base is null, or base's scheme is not a relative scheme, neuter url.

    You do not want to check base's relative flag here, as the scheme itself can have been changed to something non-sensical through the protocol attribute.

    Otherwise, set state to relative state, and decrease pointer by one.

    relative state

    Set url's relative flag, set url's scheme to base's scheme, and then, based on c:

    EOF code point

    Set url's host to base's host, url's port to base's port, url's path to base's path, url's query to base's query, and then terminate this algorithm.

    "/"
    "\"

    If remaining starts with either "/" or "\", increase pointer by one, and run these steps:

    1. If url's scheme is "file" set state to file host state.

    2. Otherwise set state to authority start state.

    Otherwise, set url's host to base's host, url's port to base's port, state to relative path start state, and decrease pointer by one.

    "?"

    Set url's host to base's host, url's port to base's port, url's path to base's path, url's query to the empty string, and state to query state.

    "#"

    Set url's host to base's host, url's port to base's port, url's path to base's path, url's query to base's query, url's fragment to the empty string, and state to fragment state.

    Otherwise

    Set url's host to base's host, url's port to base's port, url's path to base's path, then remove url's path's last string, set state to relative path start state, and decrease pointer by one.

    authority start state

    If c is neither "/" nor "\", set state to authority state, and decrease pointer by one.

    authority state
    1. If c is @", run these substeps:

      1. If the @ flag is set, append "%40" to url's userinfo.

      2. Set the @ flag.

      3. For each code point in buffer, URL escape it using the URL default escape set, and append the result to url's userinfo.

      4. Set buffer to the empty string.

    2. If c is one of EOF code point, "/", "\", "?", and "#", decrease pointer by the number of code points in buffer, set buffer to the empty string, and state to host state.

    3. Otherwise, if c is none of U+0009, U+000A, and U+000D, append c to buffer.

    file host state
    1. If c is one of EOF code point, "/", "\", "?", and "#", decrease pointer by one, and run these substeps:

      1. If buffer consists of two code points, of which the first is an ASCII alpha and the second is either ":" or "|", set state to relative path state.

        This is a quirk for parsing Windows drive letters and therefore buffer is not reset here.

      2. Otherwise, set url's host to buffer, buffer to the empty string, and state to relative path start state.

    2. Otherwise, if c is none of U+0009, U+000A, and U+000D, append c to buffer.

    host state
    hostname state
    1. If c is ":" and the [] flag is unset, run these substeps:

      1. Set url's host to buffer, buffer to the empty string, and state to port state.

      2. If state override is hostname state, terminate this algorithm.

    2. Otherwise, if c is one of EOF code point, "/", "\", "?", and "#", decrease pointer by one, and run these substeps:

      1. Set url's host to buffer, buffer to the empty string, and state to relative path start state.

      2. If state override is given, terminate this algorithm.

    3. Otherwise, if c is none of U+0009, U+000A, and U+000D, run these substeps:

      1. If c is "[", set the [] flag.

      2. If c is "]", unset the [] flag.

      3. Append c to buffer.

    port state
    1. If c is in the range U+0030 through to U+0039, append c to buffer.

    2. Otherwise, if c is one of EOF code point, "/", "\", "?", and "#", or state override is given, run these substeps:

      1. parse buffer and set port

      2. If state override is given, terminate this algorithm.

      3. Set buffer to the empty string, state to relative path start state, and decrease pointer by one.

    3. Otherwise, neuter url.

    relative path start state

    Set state to relative path state and if c is neither "/" nor "\", decrease pointer by one.

    relative path state
    1. If either c is one of EOF code point, "/", and "\", or state override is not given and c is one of "?" and "#", run these substeps:

      1. If buffer is ".." and c is one of EOF code point, "/", and "\", set the last string in url's path to the empty string.

      2. Otherwise, if buffer is "..", remove the last string from url's path.

      3. Otherwise, if buffer is "." and c is one of EOF code point, "/", and "\", append an empty string to url's path.

      4. Otherwise, if buffer is not ".", run these inner substeps:

        1. If scheme is "file", path is the empty list, buffer consists of two code points, of which the first is an ASCII alpha, and the second is either ":" or "|", replace the second code point in buffer with ":".

          Windows drive letters are beautiful, no?

        2. Append buffer to url's path.

      5. Set buffer to the empty string.

      6. If c is "?", set url's query to the empty string, and state to query state.

      7. If c is "#", set url's fragment to the empty string, and state to fragment state.

    2. Otherwise, if c is "%" and remaining starts with either "2E" or "2e", increase pointer by two, and append "." to buffer.

    3. Otherwise, if c is none of U+0009, U+000A, and U+000D, URL escape c using the URL default escape set, and append the result to buffer.

    query state
    1. If state override is not given and c is "#", set url's fragment to the empty string, and state to fragment state.

    2. Otherwise, if c is none of EOF code point, U+0009, U+000A, and U+000D, URL escape c using the URL query escape set and encoding override, and append the result to url's query.

      Only use encoding override for http/https/file rather than for all relative schemes?

    fragment state

    If c is none of EOF code point, U+0009, U+000A, and U+000D, URL escape c using the URL simple escape set, and append the result to url's fragment.

  10. Return url.

4.3 Serializing

The serializing algorithms assume a parsed URL's fatal error flag is not set.

To serialize a parsed URL, run these steps:

  1. Let output be the result of serializing without fragment.

  2. If fragment is non-null, append "#" concatenated with fragment to output.

  3. Return output.

To serialize without fragment a parsed URL, run these steps:

  1. Let output be scheme and ":" concatenated.

  2. If the relative flag is set:

    1. Append "//" to output.

    2. If userinfo is not the empty string, append userinfo concatenated with "@" to output.

    3. Append host to output.

    4. If port is not the empty string, append ":" concatenated with port to output.

    5. Append "/" concatenated with the strings in path (including empty strings), separated from each other by "/" to output.

    6. If query is non-null, append "?" concatenated with query to output.

  3. Otherwise, if the relative flag is unset, append scheme data to output.

  4. Return output.

5 API

[Constructor(DOMString url, optional (URL or DOMString) base)]
interface URL {
};
URL implements URLUtils;

[NoInterfaceObject]
interface URLUtils {
  stringifier attribute DOMString href;
  readonly attribute DOMString origin;

           attribute DOMString protocol;
           attribute DOMString host;
           attribute DOMString hostname;
           attribute DOMString port;
           attribute DOMString pathname;
           attribute DOMString search;
  readonly attribute URLQuery query;
           attribute DOMString hash;
};

Any object implementing URLUtils has an associated URLQuery object, parsed URL url, base URL base, input, and query encoding. Unless stated otherwise, query encoding is utf-8. The others must be set on creation by the specification using URLUtils.

The associated query encoding is a legacy concept only relevant for HTML. [HTML]

5.1 Constructors

The URL(url, base) constructor must run these steps:

  1. If base is not given, set it to "about:blank".

  2. If base is a string, parse base and set base to the result of that algorithm.

  3. If base's fatal error flag is set, throw an "InvalidStateError" exception.

  4. Parse url with base URL base, and associate the result with a new URL object as its url, associate base with the new object as its base, associate url with the new object as its input, and then return the new object.

5.2 Interface URLUtils

The URLUtils interface is not exposed on the global object. It augments other interfaces, such as URL.

The href attribute must run these steps:

  1. If the fatal error flag is set, return the associated input.

  2. Return the serialization.

Setting the href attribute must run these steps:

  1. Clear.

  2. Set the associated input to the given value.

  3. Parse input, with any leading and trailing ASCII whitespace removed, with the associated url as url, base as base, and query encoding as encoding override.

The origin attribute must return the Unicode serialization of the parsed URL's origin. [ORIGIN]

It returns the Unicode rather than the ASCII serialization for compatibility with HTML's MessageEvent feature. [HTML]

The protocol attribute must return scheme and ":" concatenated.

In case the fatal error flag is set this will cause it to return ":".

Setting the protocol attribute must run these steps:

  1. If the fatal error flag is set, terminate these steps.

  2. Parse the given value and ":" concatenated with the associated url as url and scheme start state as state override.

The host attribute must run these steps:

  1. If the fatal error flag is set, return the empty string.

  2. If port is the empty string, return host.

  3. Return host, ":", and port concatenated.

Setting the host attribute must run these steps:

  1. If the fatal error flag is set, or the relative flag is unset, terminate these steps.

  2. Parse the given value with the associated url as url, and host state as state override.

The hostname attribute must return host.

Setting the hostname attribute must run these steps:

  1. If the fatal error flag is set, or the relative flag is unset, terminate these steps.

  2. Parse the given value with the associated url as url, and hostname state as state override.

The port attribute must return port.

Setting the port attribute must run these steps:

  1. If the fatal error flag is set, the relative flag is unset, or scheme is "file", terminate these steps.

  2. Parse the given value with the associated url as url, and port state as state override.

The pathname attribute must run these steps:

  1. If the fatal error flag is set, return the empty string.

  2. If the relative flag is unset, return scheme data.

  3. Return "/" concatenated with the strings in path (including empty strings), separated from each other by "/".

Setting the pathname attribute must run these steps:

  1. If the fatal error flag is set, or the relative flag is unset, terminate these steps.

  2. Set path to the empty list.

  3. Parsethe given value with the associated url as url, and relative path start state as state override.

The search attribute must run these steps:

  1. If the fatal error flag is set, or query is either null or the empty string, return the empty string.

  2. Return "?" concatenated with query.

Setting the search attribute must run these steps:

  1. If the fatal error flag is set, or the relative flag is unset, terminate these steps.

  2. If the given value is the empty string, set query to null, and terminate these steps.

  3. Let input be the given value with a single leading "?" removed, if any.

  4. Set query to the empty string.

  5. Parse input with the associated url as url, query state as state override, and the associated query encoding as encoding override.

The query attribute must return the associated URLQuery object.

The hash attribute must run these steps:

  1. If the fatal error flag is set, or fragment is either null or the empty string, return the empty string.

  2. Return "#" concatenated with fragment.

Setting the hash attribute must run these steps:

  1. If the fatal error flag is set, or the scheme is "javascript", terminate these steps.

  2. If the given value is the empty string, set fragment to null, and terminate these steps.

  3. Let input be the given value with a single leading "#" removed, if any.

  4. Set fragment to the empty string.

  5. Parse input with the associated url as url, and fragment state as state override.

5.3 Interface URLQuery

interface URLQuery {
  DOMString? get(DOMString name);
  sequence<DOMString> getAll(DOMString name);
  void set(DOMString name, (sequence<DOMString> or DOMString) value);
  void delete(DOMString name);
};

...

References

[ENCODING]
Encoding Standard, Anne van Kesteren. WHATWG.
[HTML]
(Non-normative) HTML, Ian Hickson. WHATWG.
[ORIGIN]
The Web Origin Concept, Adam Barth. IETF.
[RFC2119]
Key words for use in RFCs to Indicate Requirement Levels, Scott Bradner. IETF.
[RFC3986]
(Non-normative) Uniform Resource Identifier (URI): Generic Syntax, Tim Berners-Lee, Roy Fielding and Larry Masinter. IETF.
[RFC3987]
(Non-normative) Internationalized Resource Identifiers (IRIs), Martin Dürst and Michel Suignard. IETF.

Acknowledgments

Thanks to Adam Barth, Alexandre Morgaut, Boris Zbarsky, David Sheets, Erik Arvidsson, Gavin Carothers, Glenn Maynard, Henri Sivonen, Ian Hickson, James Graham, James Manger, Mathias Bynens, Michael™ Smith, Rodney Rehm, Simon Pieters, Tab Atkins, and Tantek Çelik for being awesome!

While this standard has been written from scratch, special thanks should be extended to the editors of the various specifications that previously defined what we now call URLs: Larry Masinter, Martin Dürst, Michel Suignard, Roy Fielding, and Tim Berners-Lee.