Web Speech API Specification

This specification defines a JavaScript API to enable web developers to incorporate speech recognition and synthesis into their web pages. It enables developers to use scripting to generate text-to-speech output and to use speech recognition as an input for forms, continuous dictation and control. The JavaScript API allows web pages to control activation and timing and to handle results and alternatives.

Status of This Document

1 Conformance requirements

All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.

The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]

Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.

Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)

User agents may impose implementation-specific limits on otherwise unconstrained inputs, e.g. to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.

Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification, as this specification uses that specification's terminology. [WEBIDL]

2 Introduction

The Web Speech API aims to enable web developers to provide, in a web browser, speech-input and text-to-speech output features that are typically not available when using standard speech-recognition or screen-reader software. The API itself is agnostic of the underlying speech recognition and synthesis implementation and can support both server-based and client-based/embedded recognition and synthesis. The API is designed to enable both brief (one-shot) speech input and continuous speech input. Speech recognition results are provided to the web page as a list of hypotheses, along with other relevant information for each hypothesis.

This specification is a subset of the API defined in the HTML Speech Incubator Group Final Report [1]. That report is entirely informative since it is not a standards track document. All portions of that report may be considered informative with regards to this document, and provide an informative background to this document. This specification is a fully-functional subset of that report. Specifically, this subset excludes the underlying transport protocol, the proposed additions to HTML markup, and it defines a simplified subset of the JavaScript API. This subset supports the majority of use-cases and sample code in the Incubator Group Final Report. This subset does not preclude future standardization of additions to the markup, API or underlying transport protocols, and indeed the Incubator Report defines a potential roadmap for such future work.

3 Use Cases

To keep the API to a minimum, this specification does not directly support the following use case. This does not preclude adding support for this as a future API enhancement, and indeed the Incubator report provides a roadmap for doing so.

Note that for many usages and implementations, it is possible to avoid the need for Rerecognition by using a larger grammar, or by combining multiple grammars — both of these techniques are supported in this specification.

4 Security and privacy considerations

Implementation considerations

5 API Description

5.1 The SpeechRecognition Interface

The speech recognition interface is the scripted web API for controlling a given recognition.

IDL

          
    [Constructor]
    interface SpeechRecognition : EventTarget {
        // recognition parameters
        attribute SpeechGrammarList grammars;
        attribute DOMString lang;
        attribute boolean continuous;
        attribute boolean interimResults;
        attribute unsigned long maxAlternatives;
        attribute DOMString serviceURI;

        // methods to drive the speech interaction
        void start();
        void stop();
        void abort();

        // event methods
        attribute EventHandler onaudiostart;
        attribute EventHandler onsoundstart;
        attribute EventHandler onspeechstart;
        attribute EventHandler onspeechend;
        attribute EventHandler onsoundend;
        attribute EventHandler onaudioend;
        attribute EventHandler onresult;
        attribute EventHandler onnomatch;
        attribute EventHandler onerror;
        attribute EventHandler onstart;
        attribute EventHandler onend;
    };

    interface SpeechRecognitionError : Event {
        enum ErrorCode {
          "no-speech",
          "aborted",
          "audio-capture",
          "network",
          "not-allowed",
          "service-not-allowed",
          "bad-grammar",
          "language-not-supported"
        };

        readonly attribute ErrorCode error;
        readonly attribute DOMString message;
    };

    // Item in N-best list
    interface SpeechRecognitionAlternative {
        readonly attribute DOMString transcript;
        readonly attribute float confidence;
    };

    // A complete one-shot simple response
    interface SpeechRecognitionResult {
        readonly attribute unsigned long length;
        getter SpeechRecognitionAlternative item(in unsigned long index);
        readonly attribute boolean final;
    };

    // A collection of responses (used in continuous mode)
    interface SpeechRecognitionResultList {
        readonly attribute unsigned long length;
        getter SpeechRecognitionResult item(in unsigned long index);
    };

    // A full response, which could be interim or final, part of a continuous response or not
    interface SpeechRecognitionEvent : Event {
        readonly attribute unsigned long resultIndex;
        readonly attribute SpeechRecognitionResultList results;
        readonly attribute any interpretation;
        readonly attribute Document emma;
    };

    // The object representing a speech grammar
    [Constructor]
    interface SpeechGrammar {
        attribute DOMString src;
        attribute float weight;
    };

    // The object representing a speech grammar collection
    [Constructor]
    interface SpeechGrammarList {
        readonly attribute unsigned long length;
        getter SpeechGrammar item(in unsigned long index);
        void addFromURI(in DOMString src,
                        optional float weight);
        void addFromString(in DOMString string,
                        optional float weight);
    };

5.1.1 SpeechRecognition Attributes

5.1.2 SpeechRecognition Methods

5.1.3 SpeechRecognition Events

The DOM Level 2 Event Model is used for speech recognition events. The methods in the EventTarget interface should be used for registering event listeners. The SpeechRecognition interface also contains convenience attributes for registering a single event handler for each event type. The events do not bubble and are not cancelable.

For all these events, the timeStamp attribute defined in the DOM Level 2 Event interface must be set to the best possible estimate of when the real-world event which the event object represents occurred. This timestamp must be represented in the user agent's view of time, even for events where the timestamps in question could be raised on a different machine like a remote recognition service (i.e., in a speechend event with a remote speech endpointer).

Unless specified below, the ordering of the different events is undefined. For example, some implementations may fire audioend before speechstart or speechend if the audio detector is client-side and the speech detector is server-side.

5.1.4 SpeechRecognitionError

5.1.5 SpeechRecognitionAlternative

The SpeechRecognitionAlternative represents a simple view of the response that gets used in a n-best list.

5.1.6 SpeechRecognitionResult

The SpeechRecognitionResult object represents a single one-shot recognition match, either as one small part of a continuous recognition or as the complete return result of a non-continuous recognition.

5.1.7 SpeechRecognitionList

The SpeechRecognitionResultList object holds a sequence of recognition results representing the complete return result of a continuous recognition. For a non-continuous recognition it will hold only a single value.

5.1.8 SpeechRecognitionEvent

The SpeechRecognitionEvent is the event that is raised each time there are any changes to interim or final results.

5.1.9 SpeechGrammar

The SpeechGrammar object represents a container for a grammar. [Editor note: The group is currently discussing options for which grammar formats should be supported, how builtin grammar types are specified, and default grammars when not specified.] [2] [3] This structure has the following attributes:

5.1.10 SpeechGrammarList

The SpeechGrammarList object represents a collection of SpeechGrammar objects. This structure has the following attributes:

5.2 The SpeechSynthesis Interface

The SpeechSynthesis interface is the scripted web API for controlling a text-to-speech output.

IDL

          
    interface SpeechSynthesis {
      readonly attribute boolean pending;
      readonly attribute boolean speaking;
      readonly attribute boolean paused;

      void speak(SpeechSynthesisUtterance utterance);
      void cancel();
      void pause();
      void resume();
      SpeechSynthesisVoiceList getVoices();
    };

    [NoInterfaceObject]
    interface SpeechSynthesisGetter
    {
      readonly attribute SpeechSynthesis speechSynthesis;
    };

    Window implements SpeechSynthesisGetter;

    [Constructor,
     Constructor(DOMString text)]
    interface SpeechSynthesisUtterance : EventTarget {
      attribute DOMString text;
      attribute DOMString lang;
      attribute DOMString voiceURI;
      attribute float volume;
      attribute float rate;
      attribute float pitch;

      attribute EventHandler onstart;
      attribute EventHandler onend;
      attribute EventHandler onerror;
      attribute EventHandler onpause;
      attribute EventHandler onresume;
      attribute EventHandler onmark;
      attribute EventHandler onboundary;
    };

    interface SpeechSynthesisEvent : Event {
        readonly attribute unsigned long charIndex;
        readonly attribute float elapsedTime;
        readonly attribute DOMString name;
    };

    interface SpeechSynthesisVoice {
      readonly attribute DOMString voiceURI;
      readonly attribute DOMString name;
      readonly attribute DOMString lang;
      readonly attribute boolean localService;
      readonly attribute boolean default;
    };

    interface SpeechSynthesisVoiceList {
      readonly attribute unsigned long length;
      getter SpeechSynthesisVoice item(in unsigned long index);
    }

5.2.1 SpeechSynthesis Attributes

5.2.2 SpeechSynthesis Methods

5.2.3 SpeechSynthesisUtterance Attributes

5.2.4 SpeechSynthesisUtterance Events

5.2.5 SpeechSynthesisEvent Attributes

5.2.6 SpeechSynthesisVoice

5.2.7 SpeechSynthesisVoiceList

The SpeechSynthesisVoiceList object holds a collection of SpeechSynthesisVoice objects. This structure has the following attributes.

6 Examples

6.1 Speech Recognition Examples

Using speech recognition to fill an input-field and perform a web search.

Example 1

            
  <script type="text/javascript">
    var recognition = new SpeechRecognition();
    recognition.onresult = function(event) {
      if (event.results.length > 0) {
        q.value = event.results[0][0].transcript;
        q.form.submit();
      }
    }
  </script>

  <form action="http://www.example.com/search">
    <input type="search" id="q" name="q" size=60>
    <input type="button" value="Click to Speak" onclick="recognition.start()">
  </form>

Using speech recognition to fill an options list with alternative speech results.

Example 2

            
    <script type="text/javascript">
      var recognition = new SpeechRecognition();
      recognition.maxAlternatives = 10;
      recognition.onresult = function(event) {
        if (event.results.length > 0) {
          var result = event.results[0];
          for (var i = 0; i < result.length; ++i) {
            var text = result[i].transcript;
            select.options[i] = new Option(text, text);
          }
        }
      }

      function start() {
        select.options.length = 0;
        recognition.start();
      }
    </script>

    <select id="select"></select>
    <button onclick="start()">Click to Speak</button>

Using continuous speech recognition to fill a textarea.

Example 3

            
  <textarea id="textarea" rows=10 cols=80></textarea>
  <button id="button" onclick="toggleStartStop()"></button>

  <script type="text/javascript">
    var recognizing;
    var recognition = new SpeechRecognition();
    recognition.continuous = true;
    reset();
    recognition.onend = reset;

    recognition.onresult = function (event) {
      for (var i = resultIndex; i < event.results.length; ++i) {
        if (event.results.final) {
          textarea.value += event.results[i][0].transcript;
        }
      }
    }

    function reset() {
      recognizing = false;
      button.innerHTML = "Click to Speak";
    }

    function toggleStartStop() {
      if (recognizing) {
        recognition.stop();
        reset();
      } else {
        recognition.start();
        recognizing = true;
        button.innerHTML = "Click to Stop";
      }
    }
  </script>

Using continuous speech recognition, showing final results in black and interim results in grey.

Example 4

            
  <button id="button" onclick="toggleStartStop()"></button>
  <div style="border:dotted;padding:10px">
    <span id="final_span"></span>
    <span id="interim_span" style="color:grey"></span>
  </div>

  <script type="text/javascript">
    var recognizing;
    var recognition = new SpeechRecognition();
    recognition.continuous = true;
    recognition.interim = true;
    reset();
    recognition.onend = reset;

    recognition.onresult = function (event) {
      var final = "";
      var interim = "";
      for (var i = 0; i < event.results.length; ++i) {
        if (event.results[i].final) {
          final += event.results[i][0].transcript;
        } else {
          interim += event.results[i][0].transcript;
        }
      }
      final_span.innerHTML = final;
      interim_span.innerHTML = interim;
    }

    function reset() {
      recognizing = false;
      button.innerHTML = "Click to Speak";
    }

    function toggleStartStop() {
      if (recognizing) {
        recognition.stop();
        reset();
      } else {
        recognition.start();
        recognizing = true;
        button.innerHTML = "Click to Stop";
        final_span.innerHTML = "";
        interim_span.innerHTML = "";
      }
    }
  </script>

6.2 Speech Synthesis Examples

Spoken text.