--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/media-stream-capture/scenarios.html Mon Dec 05 17:32:19 2011 -0800
@@ -0,0 +1,863 @@
+<!DOCTYPE html>
+<html>
+ <head>
+ <title>MediaStream Capture Scenarios</title>
+ <meta http-equiv='Content-Type' content='text/html; charset=utf-8'/>
+ <script type="text/javascript" src='http://dev.w3.org/2009/dap/ReSpec.js/js/respec.js' class='remove'></script>
+ <script type="text/javascript" src='http://dev.w3.org/2009/dap/ReSpec.js/js/sh_main.min.js' class='remove'></script>
+ <script type="text/javascript" class='remove'>
+ var respecConfig = {
+ specStatus: "CG-NOTE",
+ editors: [{
+ name: "Travis Leithead",
+ company: "Microsoft Corp.",
+ url: "mailto:travis.leithead@microsoft.com?subject=MediaStream Capture Scenarios Feedback",
+ companyURL: "http://www.microsoft.com"}],
+ previousPublishDate: null,
+ noIDLIn: true,
+ };
+ </script>
+ <script type="text/javascript" src='http://dev.w3.org/2009/dap/common/config.js' class='remove'></script>
+ <style type="text/css">
+ /* ReSpec.js CSS optimizations (Richard Tibbett) - cut-n-paste :) */
+ div.example {
+ border-top: 1px solid #ff4500;
+ border-bottom: 1px solid #ff4500;
+ background: #fff;
+ padding: 1em;
+ font-size: 0.9em;
+ margin-top: 1em;
+ }
+ div.example::before {
+ content: "Example";
+ display: block;
+ width: 150px;
+ background: #ff4500;
+ color: #fff;
+ font-family: initial;
+ padding: 3px;
+ padding-left: 5px;
+ font-weight: bold;
+ margin: -1em 0 1em -1em;
+ }
+
+ /* Clean up pre.idl */
+ pre.idl::before {
+ font-size:0.9em;
+ }
+
+ /* Add better spacing to sections */
+ section, .section {
+ margin-bottom: 2em;
+ }
+
+ /* Reduce note & issue render size */
+ .note, .issue {
+ font-size:0.8em;
+ }
+
+ /* Add addition spacing to <ol> and <ul> for rule definition */
+ ol.rule li, ul.rule li {
+ padding:0.2em;
+ }
+ </style>
+ </head>
+
+ <body>
+ <section id='abstract'>
+ <p>
+ This document collates the target scenarios for the Media Capture task force. Scenarios represent
+ the set of expected functionality that may be achieved by the use of the MediaStream Capture API. A set of
+ un-supported scenarios may also be documented here.
+ </p>
+ <p>This document builds on the assumption that the mechanism for obtaining fundamental access to local media
+ capture device(s) is <code>navigator.getUserMedia</code> (name/behavior subject to this task force), and that
+ the vehicle for delivery of the content from the local media capture device(s) is a <code>MediaStream</code>.
+ Hence the title of this note.
+ </p>
+ </section>
+
+ <section id='sotd'>
+ <p>
+ This document will eventually represent the consensus of the media capture task force on the set of scenarios
+ supported by the MediaStream Capture API. If you wish to make comments regarding this document, please
+ send them to <a href="mailto:public-media-capture@w3.org">public-media-capture@w3.org</a> (
+ <a href="mailto:public-media-capture-request@w3.org?subject=subscribe">subscribe</a>,
+ <a href="http://lists.w3.org/Archives/Public/public-media-capture/">archives</a>).
+ </p>
+ </section>
+
+ <section class="informative">
+ <h2>Introduction</h2>
+ <p>
+ One of the goals of the joint task force between the Device and Policy working group and the Web Real Time
+ Communications working groups is to bring media capture scenarios from both groups together into one unified
+ API that can address all relevant use cases.
+ </p>
+ <p>
+ The capture scenarios from WebRTC are primarily driven from real-time-communication-based scenarios, such as
+ the recording of live chats, teleconferences, and other media streamed from over the network from potentially
+ multiple sources.
+ </p>
+ <p>
+ The capture scenarios from DAP are primarily driven from "local" capture scenarios related to providing access
+ to a user agent's camera and related experiences.
+ </p>
+ <p>
+ Both groups include overlapping chartered deliverables in this space. Namely in DAP,
+ <a href="http://www.w3.org/2009/05/DeviceAPICharter">the charter specifies a recommendation-track deliverable</a>:
+ <ul>
+ <li>
+ <dt>Camera API</dt>
+ <dd>an API to manage a device's camera e.g. to take a picture</dd>
+ </li>
+ </ul>
+ </p>
+ <p>
+ And <a href="http://www.w3.org/2011/04/webrtc-charter.html">WebRTC's charter scope</a> describes enabling
+ real-time communications between web browsers that will require specific client-side technologies:
+ <ul>
+ <li>API functions to explore device capabilities, e.g. camera, microphone, speakers (currently in scope
+ for the <a href="http://www.w3.org/2009/dap/">Device APIs & Policy Working Group</a>)</li>
+ <li>API functions to capture media from local devices (camera and microphone) (currently in scope for the
+ <a href="http://www.w3.org/2009/dap/">Device APIs & Policy Working Group</a>)</li>
+ <li>API functions for encoding and other processing of those media streams,</li>
+ <li>API functions for decoding and processing (including echo cancelling, stream synchronization and a
+ number of other functions) of those streams at the incoming end,</li>
+ <li>Delivery to the user of those media streams via local screens and audio output devices (partially
+ covered with HTML5)</li>
+ </ul>
+ </p>
+ <p>
+ Note, that the scenarios described in this document specifically exclude peer-to-peer and networking scenarios
+ that do not overlap with local capture scenarios, as these are not considered in-scope for this task force.
+ </p>
+ <p>
+ Also excluded are scenarios that involve declarative capture scenarios, such as those where media capture can be
+ obtained and submitted to a server entirely without the use of script. Such scenarios generally involve the use
+ of a UA-specific app or mode for interacting with the recording device, altering settings and completing the
+ capture. Such scenarios are currently captured by the DAP working group's <a href="http://dev.w3.org/2009/dap/camera/">HTML Media Capture</a>
+ specification.
+ </p>
+ <p>
+ The scenarios contained in this document are specific to scenarios in which web applications require direct access
+ to the capture device, its settings, and the recording mechanism and output. Such scenarios have been deemed
+ crucial to building applications that can create a site-specific look-and-feel to the user's interaction with the
+ capture device, as well as utilize advanced functionality that may not be available in a declarative model.
+ </p>
+ </section>
+
+ <!-- Travis: No conformance section necessary?
+
+ <section id='conformance'>
+ <p>
+ This specification defines conformance criteria that apply to a single product: the
+ <dfn id="ua">user agent</dfn> that implements the interfaces that it contains.
+ </p>
+ <p>
+ Implementations that use ECMAScript to implement the APIs defined in this specification must implement
+ them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification
+ [[!WEBIDL]], as this specification uses that specification and terminology.
+ </p>
+ <p>
+ A conforming implementation is required to implement all fields defined in this specification.
+ </p>
+
+ <section>
+ <h2>Terminology</h2>
+ <p>
+ The terms <dfn>document base URL</dfn>, <dfn>browsing context</dfn>, <dfn>event handler attribute</dfn>,
+ <dfn>event handler event type</dfn>, <dfn>task</dfn>, <dfn>task source</dfn> and <dfn>task queues</dfn>
+ are defined by the HTML5 specification [[!HTML5]].
+ </p>
+ <p>
+ The <a>task source</a> used by this specification is the <dfn>device task source</dfn>.
+ </p>
+ <p>
+ To <dfn>dispatch a <code>success</code> event</dfn> means that an event with the name
+ <code>success</code>, which does not bubble and is not cancellable, and which uses the
+ <code>Event</code> interface, is to be dispatched at the <a>ContactFindCB</a> object.
+ </p>
+ <p>
+ To <dfn>dispatch an <code>error</code> event</dfn> means that an event with the name
+ <code>error</code>, which does not bubble and is not cancellable, and which uses the <code>Event</code>
+ interface, is to be dispatched at the <a>ContactErrorCB</a> object.
+ </p>
+ </section>
+ </section>
+ -->
+
+ <section>
+ <h2>Concepts and Definitions</h2>
+ <p>
+ This section describes some terminology and concepts that frame an understanding of the scenarios that
+ follow. It is helpful to have a common understanding of some core concepts to ensure that the scenarios
+ are interpreted uniformly.
+ </p>
+ <dl>
+ <dt>Stream</dt>
+ <dd>A stream including the implied derivative
+ <code><a href="http://dev.w3.org/2011/webrtc/editor/webrtc.html#introduction">MediaStream</a></code>,
+ can be conceptually understood as a tube or conduit between a source (the stream's generator) and a
+ destination (the sink). Streams don't generally include any type of significant buffer, that is, content
+ pushed into the stream from a source does not collect into any buffer for later collection. Rather, content
+ is simply dropped on the floor if the stream is not connected to a sink. This document assumes the
+ non-buffered view of streams as previously described.
+ </dd>
+ <dt><code>MediaStream</code> vs "media stream"</dt>
+ <dd>In some cases, I use these two terms interchangeably; my usage of the term "media stream" is intended as
+ a generalization of the more specific <code>MediaStream</code> interface as currently defined in the
+ WebRTC spec.</dd>
+ <dt><code>MediaStream</code> format</dt>
+ <dd>As stated in the WebRTC specification, the content flowing through a <code>MediaStream</code> is not in
+ any particular underlying format:</dd>
+ <dd><blockquote>[The data from a <code>MediaStream</code> object does not necessarily have a canonical binary form; for
+ example, it could just be "the video currently coming from the user's video camera". This allows user agents
+ to manipulate media streams in whatever fashion is most suitable on the user's platform.]</blockquote></dd>
+ <dd>This document reinforces that view, especially when dealing with recording of the <code>MediaStream</code>'s content
+ and the potential interaction with the <a href="http://dvcs.w3.org/hg/webapps/raw-file/tip/StreamAPI/Overview.htm">Streams API</a>.
+ </dd>
+ <dt>Virtualized device</dt>
+ <dd>Device virtualization (in my simplistic view) is the process of abstracting the settings for a device such
+ that code interacts with the virtualized layer, rather than with the actual device itself. Audio devices are
+ commonly virtualized. This allows many applications to use the audio device at the same time and apply
+ different audio settings like volume independently of each other. It also allows audio to be interleaved on
+ top of each other in the final output to the device. In some operating systems, such as Windows, a webcam's
+ video source is not virtualized, meaning that only one application can have control over the device at any
+ one time. In order for an app to use the webcam either another app already using the webcam must yield it up
+ or the new app must "steal" the camera from the previous app. An API could be exposed from a device that
+ changes the device configuration in such a way that prevents that device from being virtualized--for example,
+ if a "zoom" setting were applied to a webcam device. Changing the zoom level on the device itself would affect
+ all potential virtualized versions of the device, and therefore defeat the virtualization.</dd>
+ </dl>
+ </p>
+ </section>
+
+ <section>
+ <h2>Media Capture Scenarios</h2>
+
+ <section>
+ <h3>Stream initialization</h3>
+ <p>A web application must be able to initiate a request for access to the user's webcam(s) and/or microphone(s).
+ Additionally, the web application should be able to "hint" at specific device characteristics that are desired by
+ the particular usage scenario of the application. User consent is required before obtaining access to the requested
+ stream.</p>
+ <p>When then media capture devices have been obtained (after user consent), the associated stream should be active
+ and populated with the appropriate devices (likely in the form of tracks to re-use an existing
+ <code>MediaStream</code> concept). The active capture devices will be configured according to user preference; the
+ user may have an opportunity to configure the initial state of the devices, select specific devices, and/or elect
+ to enable/disabled a subset of the requested devices at the point of consent or beyond—the user remains in control).
+ </p>
+ <section>
+ <h4>Privacy</h4>
+ <p>Specific information about a given webcam and/or microphone must not be available until after the user has
+ granted consent. Otherwise "drive-by" fingerprinting of a UA's devices and characteristics can be obtained without
+ the user's knowledge—a privacy issue.</p>
+ </section>
+
+ <p>The navigator.getUserMedia API fulfills these scenarios today.</p>
+
+ <section>
+ <h4>Issues</h4>
+ <ul>
+ <li>What are the privacy/fingerprinting implications of the current "error" callback? Is it sufficiently "scary"
+ to warrant a change? Consider the following:
+ <ul>
+ <li>If the user doesn’t have a webcam/mic, and the developer requests it, a UA would be expected to invoke
+ the error callback immediately.</li>
+ <li>If the user does have a webcam/mic, and the developer requests it, a UA would be expected to prompt for
+ access. If the user denies access, then the error callback is invoked.</li>
+ <li>Depending on the timing of the invocation of the error callback, scripts can still profile whether the
+ UA does or does not have a given device capability.</li>
+ </ul>
+ </li>
+ <li>In the case of a user with multiple video and/or audio capture devices, what specific permission is expected to
+ be granted for the "video" and "audio" options presented to <code>getUserMedia</code>? For example, does "video"
+ permission mean that the user grants permission to any and all video capture devices? Similarly with "audio"? Is
+ it a specific device only, and if so, which one? Given the privacy point above, my recommendation is that "video"
+ permission represents permission to all possible video capture devices present on the user's device, therefore
+ enabling switching scenarios (among video devices) to be possible without re-acquiring user consent. Same for
+ "audio" and combinations of the two.
+ </li>
+ <li>When a user has only one of two requested device capabilities (for example only "audio" but not "video", and both
+ "audio" and "video" are requested), should access be granted without the video or should the request fail?
+ </li>
+ </ul>
+ </section>
+ </section>
+
+ <section>
+ <h3>Stream re-initialization</h3>
+
+ <p>After requesting (and presumably gaining access to media capture devices) it is entirely possible for one or more of
+ the requested devices to stop or fail (for example, if a video device is claimed by another application, or if the user
+ unplugs a capture device or physically turns it off, or if the UA shuts down the device arbitrarily to conserve battery
+ power). In such a scenario it should be reasonably simple for the application to be notified of the situation, and for
+ the application to re-request access to the stream.
+ </p>
+ <p>Today, the <code>MediaStream</code> offers a single <code>ended</code> event. This could be sufficient for this
+ scenario.
+ </p>
+ <p>Additional information might also be useful either in terms of <code>MediaStream</code> state such as an error object,
+ or additional events like an <code>error</code> event (or both).
+ </p>
+
+ <section>
+ <h4>Issues</h4>
+ <ul>
+ <li>How shall the stream be re-acquired efficiently? Is it merely a matter of re-requesting the entire
+ <code>MediaStream</code>, or can an "ended" mediastream be quickly revived? Reviving a local media stream makes
+ more sense in the context of the stream representing a set of device states, than it does when the stream
+ represents a network source.
+ </li>
+ <li>What's the expected interaction model with regard to user-consent? For example, if the re-initialization
+ request is for the same device(s), will the user be prompted for consent again?
+ </li>
+ <li>How can tug-of-war scenarios be avoided between two web applications both attempting to gain access to a
+ non-virtualized device at the same time?
+ </li>
+ </ul>
+ </section>
+ </section>
+
+ <section>
+ <h3>Preview a stream</h3>
+ <p>The application should be able to connect a media stream (representing active media capture device(s) to a sink
+ in order to "see" the content flowing through the stream. In nearly all digital capture scenarios, "previewing"
+ the stream before initiating the capture is essential to the user in order to "compose" the shot (for example,
+ digital cameras have a preview screen before a picture or video is captured; even in non-digital photography, the
+ viewfinder acts as the "preview"). This is particularly important for visual media, but also for non-visual media
+ like audio.
+ </p>
+ <p>Note that media streams connected to a preview output sink are not in a "recording" state as the media stream has
+ no default buffer (see the <a>Stream</a> definition in section 2). Content conceptually "within" the media stream
+ is streaming from the capture source device to the preview sink after which point the content is dropped (not
+ saved).
+ </p>
+ <p>The application should be able to affect changes to the media capture device(s) settings via the media stream
+ and view those changes happen in the preview.
+ </p>
+ <p>Today, the <code>MediaStream</code> object can be connected to several "preview" sinks in HTML5, including the
+ <code>video</code> and <code>audio</code> elements. (This support should also extend to the <code>source</code>
+ elements of each as well.) The connection is accomplished via <code>URL.createObjectURL</code>.
+ </p>
+ <p>These concepts are fully supported by the current WebRTC specification.</p>
+ <section>
+ <h4>Issues</h4>
+ <ul>
+ <li>Audio tag preview is somewhat problematic because of the acoustic feedback problem (interference that can
+ result from a loop between a microphone input that picks up the output from a nearby speaker). There are
+ software solutions that attempt to automatically compensate for these type of feedback problems. However, it
+ may not be appropriate to require implementations to all support such an acoustic feedback prevention
+ algorithm. Therefore, audio preview could be turned off by default and only enabled by specific opt-in.
+ Implementations without acoustic feedback prevention could fail to enable the opt-in?
+ </li>
+ <li>It makes a lot of sense for a 1:1 association between the source and sink of a media stream; for example,
+ one media stream to one video element in HTML5. It is less clear what the value might be of supporting 1:many
+ media stream sinks—for example, it could be a significant performance load on the system to preview a media
+ stream in multiple video elements at once. Implementation feedback here would be valuable. It would also be
+ important to understand that scenario that required a 1:many viewing of a single media stream.
+ </li>
+ <li>Are there any use cases for stopping or re-starting the preview (exclusively) that are sufficiently different
+ from the following scenarios?
+ <ul>
+ <li>Stopping/re-starting the device(s)—at the source of the media stream.</li>
+ <li>Assigning/clearing the URL from media stream sinks.</li>
+ <li>createObjectURL/revokeObjectURL – for controlling the [subsequent] connections to the media stream sink
+ via a URL.
+ </li>
+ </ul>
+ </li>
+ </ul>
+ </section>
+ </section>
+
+ <section>
+ <h3>Stopping local devices</h3>
+ <p>End-users need to feel in control of their devices. Likewise, it is expected that developers using a media stream
+ capture API will want to provide a mechanism for users to stop their in-use device(s) via the software (rather than
+ using hardware on/off buttons which may not always be available).
+ </p>
+ <p>Stopping or ending a media stream source device(s) in this context implies that the media stream source device(s)
+ cannot be re-started. This is a distinct scenario from simply "muting" the video/audio tracks of a given media stream.
+ </p>
+ <p>The current WebRTC draft describes a <code>stop</code> API on a <code>LocalMediaStream</code> interface, whose
+ purpose is to stop the media stream at its source.
+ </p>
+ <section>
+ <h4>Issues</h4>
+ <ul>
+ <li>Is there a scenario where end-users will want to stop just a single device, rather than all devices participating
+ in the current media stream?
+ </li>
+ </ul>
+ </section>
+ </section>
+
+ <section>
+ <h3>Pre-processing</h3>
+ <p>Pre-processing scenarios are a bucket of scenarios that perform processing on the "raw" or "internal" characteristics
+ of the media stream for the purpose of reporting information that would otherwise require processing of a known
+ format (i.e., at the media stream sink—like Canvas, or via recording and post-processing), significant
+ computationally-expensive scripting, etc.
+ </p>
+ <p>Pre-processing scenarios will require the UAs to provide an implementation (which may be non-trivial). This is
+ required because the media stream has no internal format upon which a script-based implementation could be derived
+ (and I believe advocating for the specification of such a format is unwise).
+ </p>
+ <p>Pre-processing scenarios provide information that is generally needed <i>before</i> a stream need be connected to a
+ sink or recorded.
+ </p>
+ <p>Pre-processing scenarios apply to both real-time-communication and local capture scenarios. Therefore, the
+ specification of various pre-processing requirements may likely fall outside the scope of this task force. However,
+ they are included here for scenario-completeness and to help ensure that a media capture API design takes them into
+ account.
+ </p>
+ <section>
+ <h4>Examples</h4>
+ <ul>
+ <li>Audio end-pointing. As described in <a href="http://lists.w3.org/Archives/Public/www-archive/2011Mar/att-0001/microsoft-api-draft-final.html">a
+ speech API proposal</a>, audio end-pointing allows for the detection of noise, speech, or silence and raises events
+ when these audio states change. End-pointing is necessary for scenarios that programmatically determine when to
+ start and stop recording an audio stream for purposes of hands-free speech commands, dictation, and a variety of
+ other speech and accessibility-related scenarios. The proposal linked above describes these scenarios in better
+ detail. Audio end-pointing would be required as a pre-processing scenario because it is a prerequisite to
+ starting/stopping a recorder of the media stream itself.
+ </li>
+ <li>Volume leveling/automatic gain control. The ability to automatically detect changes in audio loudness and adjust
+ the input volume such that the output volume remains constant. These scenarios are useful in a variety of
+ heterogeneous audio environments such as teleconferences, live broadcasting involving commercials, etc.
+ Configuration options for volume/gain control of a media stream source device are also useful, and are explored
+ later on.
+ </li>
+ <li>Video face-recognition and gesture detection. These scenarios are the visual analog to the previously described
+ audio end-pointing scenarios. Face-recognition is useful in a variety of contexts from identifying faces in family
+ photographs, to serving as part of an identity management system for system access. Likewise, gesture recognition
+ can act as an input mechanism for a computer.
+ </li>
+ </ul>
+ </section>
+ <section>
+ <h4>Issues</h4>
+ <ul>
+ <li>In general the set of audio pre-processing scenarios is much more constrained than the set of possible visual
+ pre-processing scenarios. Due to the large set of visual pre-processing scenarios (which could also be implemented
+ by scenario-specific post-processing in most cases), we may recommended that visual-related pre-processing
+ scenarios be excluded from the scope of our task force.
+ </li>
+ <li>The challenges of specifying pre-processing scenarios will be identifying what specific information should be
+ conveyed by the platform at a level at which serves the widest variety of scenarios. For example,
+ audio-end-pointing could be specified in high-level terms of firing events when specific words of a given language
+ are identified, or could be as low-level as reporting when there is silence/background noise and when there's not.
+ Not all scenarios will be able to be served by any API that is designed, therefore this group might choose to
+ evaluate which scenarios (if any) are worth including in the first version of the API.
+ </li>
+ </ul>
+ </section>
+ </section>
+
+ <section>
+ <h3>Post-processing</h3>
+ <p>Post processing scenarios are a group of all scenarios that can be completed after either:</p>
+ <ol>
+ <li>Connecting the media stream to a sink (such as the <code>video</code> or <code>audio</code> elements</li>
+ <li>Recording the media stream to a known format (MIME type)</li>
+ </ol>
+ <p>Post-processing scenarios will continue to expand and grow as the web platform matures and gains capabilities.
+ The key to understanding the available post-processing scenarios is to understand the other facets of the web
+ platform that are available for use.
+ </p>
+ <section>
+ <h4>Web platform post-processing toolbox</h4>
+ <p>The common post-processing capabilities for media stream scenarios are built on a relatively small set of web
+ platform capabilities:
+ </p>
+ <ul>
+ <li>HTML5 <a href="http://dev.w3.org/html5/spec/Overview.html#the-video-element"><code>video</code></a> and
+ <a href="http://dev.w3.org/html5/spec/Overview.html#the-audio-element"><code>audio</code></a> tags. These elements are natural
+ candidates for media stream output sinks. Additionally, they provide an API (see
+ <a href="http://dev.w3.org/html5/spec/Overview.html#htmlmediaelement">HTMLMediaElement</a>) for interacting with
+ the source content. Note: in some cases, these elements are not well-specified for stream-type sources—this task
+ force may need to drive some stream-source requirements into HTML5.
+ </li>
+ <li>HTML5 <a href="http://dev.w3.org/html5/spec/Overview.html#the-canvas-element"><code>canvas</code></a> element
+ and the <a href="http://dev.w3.org/html5/2dcontext/">Canvas 2D context</a>. The <code>canvas</code> element employs
+ a fairly extensive 2D drawing API and will soon be extended with audio capabilities as well (<b>RichT, can you
+ provide a link?</b>). Canvas' drawing API allows for drawing frames from a <code>video</code> element, which is
+ the link between the media capture sink and the effects made possible via Canvas.
+ </li>
+ <li><a href="http://dev.w3.org/2006/webapi/FileAPI/">File API</a> and
+ <a href="http://www.w3.org/TR/file-writer-api/">File API Writer</a>. The File API provides various methods for
+ reading and writing to binary formats. The fundamental container for these binary files is the <code>Blob</code>
+ which put simply is a read-only structure with a MIME type and a length. The File API integrates with many other
+ web APIs such that the <code>Blob</code> can be used uniformly across the entire web platform. For example,
+ <code>XMLHttpRequest</code>, form submission in HTML, message passing between documents and web workers
+ (<code>postMessage</code>), and Indexed DB all support <code>Blob</code> use.
+ </li>
+ <li><a href="http://dvcs.w3.org/hg/webapps/raw-file/tip/StreamAPI/Overview.htm">Stream API</a>. A new addition to
+ the WebApps WG, the <code>Stream</code> is another general-purpose binary container. The primary differences
+ between a <code>Stream</code> and a <code>Blob</code> is that the <code>Stream</code> is read-once, and has no
+ length. The Stream API includes a mechanism to buffer from a <code>Stream</code> into a <code>Blob</code>, and
+ thus all <code>Stream</code> scenarios are a super-set of <code>Blob</code> scenarios.
+ </li>
+ <li>JavaScript <a href="http://wiki.ecmascript.org/doku.php?id=strawman:typed_arrays">TypedArrays</a>. Especially
+ useful for post-processing scenarios, TypedArrays allow JavaScript code to crack-open a binary file
+ (<code>Blob</code>) and read/write its contents using the numerical data types already provided by JavaScript.
+ There's a cool explanation and example of TypedArrays
+ <a href="http://blogs.msdn.com/b/ie/archive/2011/12/01/working-with-binary-data-using-typed-arrays.aspx">here</a>.
+ </li>
+ </ul>
+ </section>
+ <p>Of course, post-processing scenarios made possible after sending a media stream or recorded media stream to a
+ server are unlimited.
+ </p>
+ <section>
+ <h4>Time sensitivity and performance</h4>
+ <p>Some post-processing scenarios are time-sensitive—especially those scenarios that involve processing large
+ amounts of data while the user waits. Other post-processing scenarios s are long-running and can have a performance
+ benefit if started before the end of the media stream segment is known. For example, a low-pass filter on a video.
+ </p>
+ <p>These scenarios generally take two approaches:</p>
+ <ol>
+ <li>Extract samples (video frames/audio clips) from a media stream sink and process each sample. Note that this
+ approach is vulnerable to sample loss (gaps between samples) if post-processing is too slow.
+ </li>
+ <li>Record the media stream and extract samples from the recorded native format. Note that this approach requires
+ significant understanding of the recorded native format.
+ </li>
+ </ol>
+ <p>Both approaches are valid for different types of scenarios.</p>
+ <p>The first approach is the technique described in the current WebRTC specification for the "take a picture"
+ scenario.
+ </p>
+ <p>The second approach is somewhat problematic from a time-sensitivity/performance perspective given that the
+ recorded content is only provided via a <code>Blob</code> today. A more natural fit for post-processing scenarios
+ that are time-or-performance sensitive is to supply a <code>Stream</code> as output from a recorder.
+ Thus time-or-performance sensitive post-processing applications can immediately start processing the [unfinished]
+ recording, and non-sensitive applications can use the Stream API's <code>StreamReader</code> to eventually pack
+ the full <code>Stream</code> into a <code>Blob</code>.
+ </p>
+ </section>
+ <section>
+ <h4>Examples</h4>
+ <ul>
+ <li>Image quality manipulation. If you copy the image data to a canvas element you can then get a data URI or
+ blob where you can specify the desired encoding and quality e.g.
+ <pre class="sh_javascript">
+canvas.toDataURL('image/jpeg', 0.6);
+// or
+canvas.toBlob(function(blob) {}, 'image/jpeg', 0.2);</pre>
+ </li>
+ <li>Image rotation. If you copy the image data to a canvas element and then obtain its 2D context you can then
+ call rotate() on that context object to rotate the displayed 'image'. You can then obtain the manipulated image
+ back via toDataURL or toBlob as above if you want to generate a file-like object that you can then pass around as
+ required.
+ </li>
+ <li>Image scaling. Thumbnails or web image formatting can be done by scaling down the captured image to a common
+ width/height and reduce the output quality.
+ </li>
+ <li>Speech-to-text. Post processing on a recorded audio format can be done to perform client-side speech
+ recognition and conversion to text. Note, that speech recognition algorithms are generally done on the server for
+ time-sensitive or performance reasons.
+ </li>
+ </ul>
+ </section>
+ <p>This task force should evaluate whether some extremely common post-processing scenarios should be included as
+ pre-processing features.
+ </p>
+ </section>
+
+ <section>
+ <h3>Device Selection</h3>
+ <p>A particular user agent may have zero or more devices that provide the capability of audio or video capture. In
+ consumer scenarios, this is typically a webcam with a microphone (which may or may not be combined), and a "line-in"
+ and or microphone audio jack. The enthusiast users (e.g., recording enthusiasts), may have many more available
+ devices.
+ </p>
+ <p>Device selection in this section is not about the selection of audio vs. video capabilities, but about selection
+ of multiple devices within a given "audio" or "video" category (i.e., "kind"). The term "device" and "available
+ devices" used in this section refers to one or a collection of devices of a kind (e.g., that provide a common
+ capability, such as a set of devices that all provide "video").
+ </p>
+ <p>Providing a mechanism for code to reliably enumerate the set of available devices enables programmatic control
+ over device selection. Device selection is important in a number of scenarios. For example, the user selected the
+ wrong camera (initially) and wants to change the media stream over to another camera. In another example, the
+ developer wants to select the device with the highest resolution for recording.
+ </p>
+ <p>Depending on how stream initialization is managed in the consent user experience, device selection may or may not
+ be a part of the UX. If not, then it becomes even more important to be able to change device selection after media
+ stream initialization. The requirements of the user-consent experience will likely be out of scope for this task force.
+ </p>
+ <section>
+ <h4>Privacy</h4>
+ <ul>
+ <li>As mentioned in the "Stream initialization" section, exposing the set of available devices before media stream
+ consent is given leads to privacy issues. Therefore, the device selection API should only be available after consent.
+ </li>
+ <li>Device selection should not be available for the set of devices within a given category/kind (e.g., "audio"
+ devices) for which user consent was not granted.
+ </li>
+ </ul>
+ </section>
+ <p>A selected device should provide some state information that identifies itself as "selected" (so that the set of
+ current device(s) in use can be programmatically determined). This is important because some relevant device information
+ cannot be surfaced via an API, and correct device selection can only be made by selecting a device, connecting a sink,
+ and providing the user a method for changing the device. For example, with multiple USB-attached webcams, there's no
+ reliable mechanism to describe how each device is oriented (front/back/left/right) with respect to the user.
+ </p>
+ <p>Device selection should be a mechanism for exposing device capabilities which inform the developer of which device to
+ select. In order for the developer to make an informed decision about which device to select, the developer's code would
+ need to make some sort of comparison between devices—such a comparison should be done based on device capabilities rather
+ than a guess, hint, or special identifier (see related issue below).
+ </p>
+ <p>Recording capabilities are an important decision-making point for media capture scenarios. However, recording capabilities
+ are not directly correlated with individual devices, and as such should not be mixed with the device capabilities. For
+ example, the capability of recording audio in AAC vs. MP3 is not correlated with a given audio device, and therefore not a
+ decision making factor for device selection.
+ </p>
+ <p>The current WebRTC spec does not provide an API for discovering the available devices nor a mechanism for selection.
+ </p>
+ <section>
+ <h4>Issues</h4>
+ <ul>
+ <li>The specification should provide guidance on what set of devices are to be made available—should it be the set of
+ potential devices, or the set of "currently available" devices (which I recommended since the non-available devices can't
+ be utilized by the developer's code, thus it doesn't make much sense to include them).
+ </li>
+ <li>A device selection API should expose device capability rather than by device identity. Device identity is a poor practice
+ because it leads to device-dependent testing code (for example, if "Name Brand Device", then…) similar to the problems that
+ exist today on the web as a result of user-agent detection. A better model is to enable selection based on capabilities.
+ Additionally, knowing the GUID or hardware name is not helpful to web developers as part of a scenario other than device
+ identification (perhaps for purposes of providing device-specific help/troubleshooting, for example).
+ </li>
+ </ul>
+ </section>
+ </section>
+
+ <section>
+ <h3>Change user-selected device capabilities</h3>
+ <p>In addition to selecting a device based on its capabilities, individual media capture devices may support multiple modes of
+ operation. For example, a webcam often supports a variety of resolutions which may be suitable for various scenarios (previewing
+ or recording a sample whose destination is a web server over a slow network connection, recording archival HD video for storing
+ locally). An audio device may have a gain control, allowing a developer to build a UI for an audio blender (varying the gain on
+ multiple audio source devices until the desired blend is achieved).
+ </p>
+ <p>A media capture API should support a mechanism to configure a particular device dynamically to suite the expected scenario.
+ Changes to the device should be reflected in the related media stream(s) themselves.
+ </p>
+ <p>Device capabilities that can be changed should be done in such a way that the changes are virtualized to the window that is
+ consuming the API (see definition of "virtual device"). For example, if two applications are using a device, changes to the
+ device's configuration in one window should not affect the other window.
+ </p>
+ <p>Changes to a device capability should be made in the form of requests (async operations rather than synchronous commands).
+ Change requests allow a device time to make the necessary internal changes, which may take a relatively long time without
+ blocking other script. Additionally, script code can be written to change device characteristics without careful error-detection
+ (because devices without the ability to change the given characteristic would not need to throw an exception synchronously).
+ Finally, a request model makes sense even in RTC scenarios, if one party of the teleconference, wants to issue a request that
+ another party mute their device (for example). The device change request can be propagated over the <code>PeerConnection</code>
+ to the sender asynchronously.
+ </p>
+ <p>In parallel, changes to a device's configuration should provide a notification when the change is made. This allows web
+ developer code to monitor the status of a media stream's devices and report statistics and state information without polling the
+ device (especially when the monitoring code is separate from the author's device-control code). This is also essential when the
+ change requests are asynchronous; to allow the developer to know at which point the requested change has been made in the media
+ stream (in order to perform synchronization, or start/stop a recording, for example).
+ </p>
+ <p>The current WebRTC spec only provides the "enabled" (on/off) capability for devices (where a device may be equated to a particular
+ track object).
+ </p>
+ <section>
+ <h4>Issues</h4>
+ <ul>
+ <li>If changing a particular device capability cannot be virtualized, this media capture task force should consider whether that
+ dynamic capability should be exposed to the web platform, and if so, what the usage policy around multiple access to that
+ capability should be.
+ </li>
+ <li>The specifics of what happens to a recording-in-progress when device behavior is changed must be described in the spec.
+ </li>
+ </ul>
+ </section>
+ </section>
+
+ <section>
+ <h3>Multiple active devices</h3>
+ <p>In some scenarios, users may want to initiate capture from multiple devices at one time in multiple media streams. For example,
+ in a home-security monitoring scenario, a user agent may want to capture 10 unique video streams representing various locations being
+ monitored. The user may want to capture all 10 of these videos into one recording, or record all 10 individually (or some
+ combination thereof).
+ </p>
+ <section>
+ <h4>Issues</h4>
+ <ul>
+ <li>Given that device selection should be restricted to only the "kind" of devices for which the user has granted consent, detection
+ of multiple capture devices could only be done after a media stream was obtained. An API would therefore want to have a way of
+ exposing the set of <i>all devices</i> available for use. That API could facilitate both switching to the given device in the
+ current media stream, or some mechanism for creating a new media stream by activating a set of devices. By associating a track
+ object with a device, this can be accomplished via <code>new MediaStream(tracks)</code> providing the desired tracks/devices used
+ to create the new media stream. The constructor algorithm is modified to activate a track/device that is not "enabled".
+ </li>
+ <li>For many user agents (including mobile devices) preview of more than one media stream at a time can lead to performance problems.
+ In many user agents, recording of more than one media stream can also lead to performance problems (dedicated encoding hardware
+ generally supports the media stream recording scenario, and the hardware can only handle one stream at a time). Especially for
+ recordings, an API should be designed such that it is not easy to accidentally start multiple recordings at once.
+ </li>
+ </ul>
+ </section>
+ </section>
+
+ <section>
+ <h3>Recording a media stream</h3>
+ <p>In its most basic form, recording a media stream is simply the process of converting the media stream into a known format. There's
+ also an expectation that the recording will end within a reasonable time-frame (since local buffer space is not unlimited).
+ </p>
+ <p>Local media stream recordings are common in a variety of sharing scenarios such as:
+ </p>
+ <ul>
+ <li>record a video and upload to a video sharing site</li>
+ <li>record a picture for my user profile picture in a given web app</li>
+ <li>record audio for a translation site</li>
+ <li>record a video chat/conference</li>
+ </ul>
+ <p>There are other offline scenarios that are equally compelling, such as usage in native-camera-style apps, or web-based recording
+ studios (where tracks are recorded and later mixed).
+ </p>
+ <p>The core functionality that supports most recording scenarios is a simple start/stop recording pair.
+ </p>
+ <p>Ongoing recordings should report progress to enable developers to build UIs that pass this progress notification along to users.
+ </p>
+ <p>Recording API should be designed to gracefully handle changes to the media stream, and should also report (and perhaps even
+ attempt to recover from) failures at the media stream source during recording.
+ </p>
+ <p>Uses of the recorded information is covered in the Post-processing scenarios described previously. An additional usage is the
+ possibility of default save locations. For example, by default a UA may store temporary recordings (those recordings that are
+ in-progress) in a temp (hidden) folder. It may be desirable to be able to specify (or hint) at an alternate default recording
+ location such as the users's common file location for videos or pictures.
+ </p>
+ <section>
+ <h4>DVR Scenarios</h4>
+ <p>Increasingly in the digital age, the ability to pause, rewind, and "go live" for streamed content is an expected scenario.
+ While this scenario applies mostly to real-time communication scenarios (and not to local capture scenarios), it is worth
+ mentioning for completeness.
+ </p>
+ <p>The ability to quickly "rewind" can be useful, especially in video conference scenarios, when you may want to quickly go
+ back and hear something you just missed. In these scenarios, you either started a recording from the beginning of the conference
+ and you want to seek back to a specific time, or you were only streaming it (not saving it) but you allowed yourself some amount
+ of buffer in order to review the last X minutes of video.
+ </p>
+ <p>To support these scenarios, buffers must be introduced (because the media stream is not implicitly buffered for this scenario).
+ In the pre-recorded case, a full recording is in progress, and as long as the UA can access previous parts of the recording
+ (without terminating the recording) then this scenario could be possible.
+ </p>
+ <p>In the streaming case, the only way to support this scenario is to add a [configurable] buffer directly into the media stream
+ itself. Given the complexities of this approach and the relatively limited scenarios, adding a buffer capability to a media stream
+ object is not recommended.
+ </p>
+ <p>Note that most streaming scenarios (where DVR is supported) are made possible exclusively on the server to avoid accumulating
+ large amounts of data (i.e., the buffer) on the client. Content protection also tends to require this limitation.
+ </p>
+ </section>
+ <section>
+ <h4>Issues</h4>
+ <ul>
+ <li>There are few (if any) scenarios that require support for overlapping recordings of a single media stream. Note, that the
+ current <code>record</code> API supports overlapping recordings by simply calling <code>record()</code> twice. In the case of
+ separate media streams (see previous section) overlapping recording makes sense. In either case, initiating multiple recordings
+ should not be so easy so as to be accidental.
+ </li>
+ </ul>
+ </section>
+ </section>
+
+ <section>
+ <h3>Selection of recording method</h3>
+ <p>All post-processing scenarios for recorded data require a known [standard] format. It is therefore crucial that the media capture
+ API provide a mechanism to specify the recording format. It is also important to be able to discover if a given format is supported.
+ </p>
+ <p>Most scenarios in which the recorded data is sent to the server for upload also have restrictions on the type of data that the server
+ expects (one size doesn't fit all).
+ </p>
+ <p>It should not be possible to change recording on-the-fly without consequences (i.e., a stop and/or re-start or failure). It is
+ recommended that the mechanism for specifying a recording format not make it too easy to change the format (e.g., setting the format
+ as a property may not be the best design).
+ </p>
+ <section>
+ <h4>Format detection</h4>
+ <ul>
+ <li>If we wish to re-use existing web platform concepts for format capability detection, the HTML5 <code>HTMLMediaElement</code>
+ supports an API called <code>canPlayType</code> which allows developer to probe the given UA for support of specific MIME types that
+ can be played by <code>audio</code> and <code>video</code> elements. A recording format checker could use this same approach.
+ </li>
+ </ul>
+ </section>
+ </section>
+
+ <section>
+ <h3>Programmatic activation of camera app</h3>
+ <p>As mentioned in the introduction, declarative use of a capture device is out-of-scope. However, there are some potentially interesting
+ uses of a hybrid programmatic/declarative model, where the configuration of a particular media stream is done exclusively via the user
+ (as provided by some UA-specific settings UX), but the fine-grained control over the stream as well as the recording of the stream is
+ handled programmatically.
+ </p>
+ <p>In particular, if the developer doesn't want to guess the user's preferred settings, or if there are specific settings that may not be
+ available via the media capture API standard, they could be exposed in this manner.
+ </p>
+ </section>
+
+ <section>
+ <h3>Take a picture</h3>
+ <p>A common usage scenario of local device capture is to simply "take a picture". The hardware and optics of many camera-devices often
+ support video in addition to photos, but can be set into a specific "camera mode" where the possible recording resolutions are
+ significantly larger than their maximum video resolution.
+ </p>
+ <p>The advantage to having a photo-mode is to be able to capture these very high-resolution images (versus the post-processing scenarios
+ that are possible with still-frames from a video source).
+ </p>
+ <p>Recording a picture is strongly tied to the "video" capability because a video preview is often an important component to setting up
+ the scene and getting the right shot.
+ </p>
+ <p>Because photo capabilities are somewhat different from those of regular video capabilities, devices that support a specific "photo"
+ mode, should likely provide their "photo" capabilities separately from their "video" capabilities.
+ </p>
+ <p>Many of the considerations that apply to recording also apply to taking a picture.
+ </p>
+ <section>
+ <h4>Issues</h4>
+ <ul>
+ <li>What are the implications on the device mode switch on video recordings that are in progress? Will there be a pause? Can this
+ problem be avoided?
+ </li>
+ <li>Should a "photo mode" be a type of user media that can be requested via <code>getUserMedia</code>?
+ </li>
+ </ul>
+ </section>
+ </section>
+
+ <section>
+ <h3>Picture tracks</h3>
+ <p>Another common scenario for media streams is to share photos via a video stream. For example, a user may want to select a photo and
+ attach the photo to an active media stream in order to share that photo via the stream. In another example, the photo can be used as a
+ type of "video mute" where the photo can be sent in place of the active video stream when a video track is "disabled".
+ </p>
+ <section>
+ <h4>Issues</h4>
+ <ul>
+ <li>It may be desireable to specify a photo/static image as a track type in order to allow it to be toggled on/off with a video track.
+ On the other hand, the sharing scenario could be fulfilled by simply providing an API to supply a photo for the video track "mute"
+ option (assuming that there's not a scenario that involves creating a parallel media stream that has both the photo track and the current
+ live video track active at once; such a use case could be satisfied by using two media streams instead).
+ </li>
+ </ul>
+ </section>
+ </section>
+
+ <section>
+ <h3>Caption Tracks</h3>
+ <p>The HTML5 <code>HTMLMediaElement</code> now has the ability to display captures and other "text tracks". While not directly applicable to
+ local media stream scenarios (caption support is generally done out-of-band from the original capture), it could be something worth adding in
+ order to integrate with HTML5 videos when the source is a PeerConnection where real-time captioning is being performed and needs to be displayed.
+ </p>
+ </section>
+
+ </section>
+ </body>
+</html>