To avoid having to deal with escapes (other than for <, >, &, and "), to avoid data loss in form submission, to avoid XSS when serving user-provided content, and to comply with the HTML Standard, always encode your HTML as UTF-8. Furthermore, in order to let browsers know that the document is UTF-8-encoded, always label it as such. To label your document, you need to do at least one of the following:
<meta charset="utf-8"> as the first thing after the
<head> start tag (i.e. as the first child of
meta tag, including its ending
> character needs to be within the first 1024 bytes of the file. Putting it right after the
<head> start tag is the easiest way to get this right. Do not put comments before the
<head> start tag and
<meta charset="utf-8"> to avoid accidentally pushing the latter past the first 1024 bytes.
Configure your server to send the header
Content-Type: text/html; charset=utf-8 on the HTTP layer.
Start the document with the UTF-8 BOM, i.e. the bytes 0xEF, 0xBB, and 0xBF.
Doing more than one of these is OK.
The above says the important bit. Here are answers to further questions:
Because HTML didn’t support UTF-8 in the very beginning and legacy content can’t be expected to opt out, you need to opt into UTF-8 just like you need to opt into the standards mode (via
<!DOCTYPE html>) and to mobile-friedly layout (via
<meta name="viewport" content="width=device-width, initial-scale=1">). (Longer answer)
<meta charset="utf-8"> has the benefit of keeping the label within your document even if you move it around. The main risk is that someone forgets that it needs to be within the first 1024 bytes and puts comments, Facebook metadata,
rel=preloads, stylesheets or scripts before it. Always put that other stuff after it.
The HTTP header has the benefit that if you are setting up a new server that doesn’t have any old non-UTF-8 documents on it, you can configure the header once, and it works for all HTML documents on the server thereafter.
The BOM method has the problem that it’s too easy to edit the file in a text editor that removes the BOM and not notice that this has happened. However, if you are writing a serializer library and you are neither in control of the HTTP header nor can inject a tag without interfering with what your users are doing, you can make the serializer always start with the UTF-8 BOM and know that things will be OK.
Don’t. If you serve user-provided content as UTF-16, it is possible to smuggle content that becomes executable when interpreted as other encodings. This is a cross-site scripting vulnerability if the user uses a browser that allows the user to manually override UTF-16 with another encoding.
UTF-16 cannot be labeled via
<meta charset="utf-8"> method is not available for plain text, but the other two are. In the case of plain text, the HTTP header is obviously
Content-Type: text/plain; charset=utf-8 instead.
XMLHttpRequest and Fetch post-date UTF-8, so they had a chance to introduce new rules. While the rules being inconsistent with navigating to HTML or plain text is not great, defaulting to UTF-8 is a simple rule that avoids issues related to reloading content in ways that would be consistent with navigation.
If you’ve labeled your HTML as UTF-8, you don’t need to label your UTF-8-encoded CSS files, since by default they inherit the encoding from the document that includes them. However, to make your CSS robust when referenced form non-UTF-8 HTML you can use the UTF-8 BOM or the HTTP header, which is
Content-Type: text/css; charset=utf-8 in the CSS case, or you can put
@charset "utf-8"; as the very first thing in the CSS file.
Unlabeled XML defaults to UTF-8, so you don’t need to label it.
JSON must be UTF-8 and is processed as UTF-8, so there’s no labeling.
WebVTT is always UTF-8, so there’s no labeling.