What is EME?

It was suggested at the Mozilla Summit that there isn’t good information around about what Encrypted Media Extensions (EME) actually is. Since I’m on the HTML working group and have been reading the email threads about EME there, I thought that I could provide an introduction that explains things that may not be apparent from the specification itself.

TL;DR

EME is a JavaScript API that is part of a larger system for playing DRMed content in HTML <video>/<audio>. EME doesn’t define the whole system. EME only specifies the JS API that implies some things about the overall system. A DRM component called a Content Decryption Module (CDM) decrypts, likely decodes and perhaps also displays the video. A JavaScript program coordinating the process uses the EME API to pass messages between the CDM and a server that provides decryption keys. EME assumes the existence of one or more CDMs on the client system but it doesn’t define any or even their exact nature (e.g. software vs. hardware). That is, the interesting part is left undefined.

Context

Major Hollywood studios require that companies that license movies from them for streaming use DRM between the streaming company and the end user. Traditionally, in the Web context, this has been done by using the Microsoft PlayReady DRM component inside the Silverlight plug-in or the Adobe Access DRM component inside Flash Player. As the HTML/CSS/JS platform gains more and more capabilities, the general need to use Silverlight or Flash becomes smaller and smaller, such that soon the video DRM capability will be the only thing that Silverlight and Flash have but the HTML/CSS/JS platform doesn’t.

Proposals have been written to augment <video> with features that enable the Netflix player to be ported from Silverlight to <video> without a loss of features. The additions are split across two specifications: Media Source Extensions (MSE) and Encrypted Media Extensions (EME). The noncontroversial parts (giving JS precise control over media-related networking) are in MSE and the controversial parts (DRM interface) are in EME. I will not cover MSE further.

So What’s EME?

EME is a JavaScript API for the HTML <video> and <audio> for dealing with media files that contain encrypted tracks.

EME requires the presence of one or more components called Content Decryption Modules (CDM) which are integrated in some way with the browser. For the purpose of this introduction, the CDM is not considered to be part of the browser. The browser (which, as noted, excludes the CDM) is considered untrusted by copyright holders who require DRM to be used. (The browser is assumed to be trusted by the user as before.) The CDM is trusted by the copyright holders to hide certain pieces of data from the user (and to prevent the user from manipulating that data).

A CDM could be bundled with the browser, downloaded separately, bundled with the operating system, embedded in hardware as firmware running in a second domain of computing (such as ARM TrustZone) or wired into hardware. EME leaves this aspect implementation-dependent.

A CDM implements what is colloquially referred to as a DRM scheme but EME calls a Key System. A CDM implements at minimum a Key System-specific format for messages (byte buffers from the EME point of view) to request and receive keys and the capability to decrypt content with the keys acquired via these messages. The inputs of a CDM are Key System-specific initialization data, Key System-specific messages and encrypted media stream data.

EME specifies a toy Key System called Clear Key, which could be used to demonstrate interoperability of two EME implementations to the point of satisfying the requirements of the W3C Process. So far, there has been no indication that anyone would be interested in deploying Clear Key for non-test purposes.

EME does not specify the sort of Key System that one could expect to be deployed for the purpose of streaming Hollywood movies. The non-toy Key System supported by IE 11 on Windows 8.1 is PlayReady (proprietary to Microsoft and bundled with Windows 8.1) and the non-toy Key System supported by Chrome on Chrome OS is Widevine (proprietary to Google and bundled with Chrome OS). Therefore, a Web site that wishes to be cross-browser-compatible needs to support multiple Key Systems.

EME does not specify the output abstraction for CDMs. It leaves open several options. The CDM could:

Merely perform decryption and hand back the encoded media (e.g. H.264) to the browser.
Perform both decryption and video decoding and hand back the raw frames to the browser for painting.
Perform decryption and decoding and transfer decoded pixels directly to an operating system compositor in a way that bypasses the browser.
Perform decryption and decoding and then work together with the GPU so that not even the operating system gets the opportunity to read the pixels back from the GPU.

The more the CDM does to conceal the decryption keys, the elementary stream data or the decoded data from software that the user can control, the more likely the CDM is to be approved by the copyright holders for use with content that they hold copyright to. Also, the requirements placed by the copyright holders on CDMs permitted to play HD content may be stricter than the requirements placed on CDMs permitted to play SD content.

If you compare this system to NPAPI plug-ins, the EME JavaScript API is analogous to the <object> element. However, in the EME case there is no standardized analog to NPAPI, since as noted above, even the level of output abstraction isn’t specified.

The media that requires a CDM to play comes in one of the usual container formats. Since the W3C has avoided specifying mandatory formats for <video>, EME doesn’t normatively require support for a specific container format. The EME specification contains guidance for the MP4 (typically used with H.264) and WebM (typically used with VP8) containers. EME does not have normative requirements on whether encryption happens inside or outside the container, but in practice, encryption happens inside the container. Compared to ordinary use of MP4 or WebM, one or more of the elementary streams (“tracks”) inside the container format is encrypted with a key that is not included in the media file. The EME specification does not require a particular encryption scheme, and there are multiple possible ways to have encrypted tracks in MP4. However, when a scheme called Common Encryption is used, one MP4 file can be used with multiple Key Systems.

When a browser that supports EME encounters an encrypted track, it fires an event to a JavaScript program running in the context of the page that contains the <video> element to indicate that in order to decrypt the track, the JavaScript program needs to supply a key to do so. The event comes with initialization data that the browser has extracted from the media file. At this point, the JavaScript program creates a session using a Key System that’s available and pushes initialization data to the CDM that implements the Key System. If the media file works with multiple Key System there are multiple applicable Key Systems implementations (multiple CDMs or a multi-Key System CDM) available to the browser, the JavaScript program gets to choose which Key System to use.

Subsequently, events fired to JavaScript include byte buffers that contain messages in a format specific to the Key System that’s in use for the session. Neither the browser nor the JavaScript program understand the bytes. However, the bytes are expected to form a message that includes an identifier for a key that the CDM needs to decrypt content and likely some cryptographic evidence of the provenance of the CDM to demonstrate to the server that the CDM was provided by an entity that the content copyright holders trust to provide tamper-resistant CDMs. From the perspective of EME, these messages are simply opaque.

The JavaScript program then uses whatever method chosen by the Web application developer to pass the Key System-specific message to the server. Most likely, the JavaScript program does an HTTP POST using XHR. At this point, XHR automatically picks up the session cookie of the user, so user identification can work as in Web apps in general and the CDM does not need to be able to deal with handling user identity. This is in contrast to other plausible architectures where a component analogous to a CDM performs networking directly.

The server side of the Web app needs to have Key System-specific software for each Key System that the Web app supports in order to be able to make sense of the messages received and in order to be able to construct responses.

When the JavaScript program receives a message back from the server, it uses the EME API to push the message to the browser, which in turn pushes it to the CDM. The message is assumed to be encrypted in such a way that neither the JavaScript program nor the browser can make sense of it but the CDM can. The message contains at least the key that the CDM needs to decrypt the encrypted media track(s).

EME doesn’t define how many messages are emitted by the CDM and how many by the server. There might be any number, including new ones during playback, depending on the Key System.

EME does not specify any policy about the conditions under which the CDM may be allowed to decrypt media or a vocabulary for expressing such a policy. EME doesn’t even specify whether a policy is enforced on the server based on information that got sent there or whether a policy is enforced on the CDM based on information received back. That’s all Key System-specific.