UTF-8 - contradictory definitions

问题

My understanding of UTF-8 encoding is that the first byte of a UTF-8 char carries either

data in the lower 7 bits (0-6) with high bit (7) clear for single byte ASCII range code-points
data in the lower 5 bits (0-4), with high bits 7-5 = 110 to indicate a 2 byte char
data in the lower 4 bits (0-3), with high bits 7-4 = 1110 to indicate a 3 byte char
data in the lower 5 bits (0-2), with high bits 7-3 = 11110 to indicate a 4 byte char

noting that bit 7 is always set and this tells utf-8 parsers that this is a multi-byte char.

This means that any unicode code-point in the range 128-255 has to be encoded in 2 or more bytes, because the high bit that is required if they were to be encoded in one byte is reserved in UTF-8 for the 'multi-byte indicator bit'. So e.g. the character é (e-acute, which is unicode code-point \u00E9, 233 decimal) is encoded in UTF-8 as a two byte character \xC3A9.

The following table from here shows how the code-point \u00E9 is encoded in UTF-8 as \xC3A9.

However this is not how it works in a web page it seems. I have recently had some contradictory behavior in the rendering of unicode chars, and in my exploratory reading came across this:

"UTF-8 is identical to both ANSI and 8859-1 for the values from 160 to 255." (w3schools)

which clearly contradicts the above.

And if I render these various values in jsfiddle I get

So HTML is rendering the unicode code-point as é, not the UTF-8 2-byte encoding of that code-point. In fact HTML renders the UTF-8 char \xC3A9 as the Hangul syllable that has the code-point \xC3A9:

W3schools has a table that explicitly defines the UTF-8 of é as Decimal 233 (\xE9):

So HTML is rendering code-points, not UTF-8 chars.

Am I missing something here? Can anyone explain to me why in a supposedly UTF-8 HTML document, it seems like there is no UTF-8 parsing going on at all?

回答1:

Your understanding of the encoding of UTF-8 bytes is correct.

Your jsfiddle example is using UTF-8 only as a byte encoding for the HTML file (hence the use of the <meta charset="UTF-8"> HTML tag), but not as an encoding of the HTML itself. HTML only uses ASCII characters for its markup, but that markup can represent Unicode characters.

UTF-8 is a byte encoding for Unicode codepoints. It is commonly used for transmissions of Unicode data, such as an HTML file over HTTP. But HTML itself is defined in terms of Unicode codepoints only, not in UTF-8 specifically. A webbrowser would receive the raw UTF-8 bytes over the wire and decode them to Unicode codepoints before processing them in the context of the HTML.

HTML entities deal in Unicode codepoints only, not in codeunits, such as used in UTF-8.

HTML entities in &#<xxx>; format represent Unicode codepoints by their numeric values directly.

é (é) and é (é) represent integer 233 in decimal and hex formats, respectively. 233 is the numeric value of Unicode codepoint U+00E9 LATIN SMALL LETTER E WITH ACUTE, which is encoded in UTF-8 bytes as 0xC3 0xA9.
쎩 (쎩) represents integer 50089 in hex format (0xC3A9). 50089 is the numeric value of Unicode codepoint U+C3A9 HANGUL SYLLABLE SSYEOLG, which is encoded in UTF-8 as bytes 0xEC 0x8E 0xA9.

HTML entities in &<name>; format represent Unicode codepoints by a human-readable name defined by HTML.

é (é) represents Unicode codepoint U+00E9, same as é and é do.

来源：https://stackoverflow.com/questions/62796670/utf-8-contradictory-definitions

标签

html

unicode

encoding

utf-8