UTF8, codepoints, and their representation in Erlang and Elixir

时光毁灭记忆、已成空白 提交于 2021-02-18 19:13:07

问题


going through Elixir's handling of unicode:

iex> String.codepoints("abc§")
["a", "b", "c", "§"]

very good, and byte_size/2 of this is not 4 but 5, because the last char is taking 2 bytes, I get that.

The ? operator (or is it a macro? can't find the answer) tells me that

iex(69)> ?§
167

Great; so then I look into the UTF-8 encoding table, and see value c2 a7 as hex encoding for the char. That means the two bytes (as witnessed by byte_size/1) are c2 (94 in decimal) and a7 (167 in decimal). That 167 is the result I got when evaluating earlier. What I don't understand, exactly, is.. why that number is a "code point", as per the description of the ? operator. When I try to work backwards, and evaluate the binary, I get what I want:

iex(72)> <<0xc2, 0xa7>>
"§"

And to make me go completely bananas, this is what I get in Erlang shell:

24> <<167>>.
<<"§">>
25> <<"\x{a7}">>.
<<"§">>
26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>>
27> <<"\x{c2a7}">>.    
<<"§">>

!! while Elixir is only happy with the code above... what is it that I don't understand? Why is Erlang perfectly happy with a single byte, given that Elixir insists that char takes 2 bytes - and Unicode table seems to agree?


回答1:


The codepoint is what identifies the Unicode character. The codepoint for § is 167 (0xA7). A codepoint can be represented in bytes in different ways, depending of your encoding of choice.

The confusion here comes from the fact that the codepoint 167 (0xA7) is identified by the bytes 0xC2 0xA7 when encoded to UTF-8.

When you add Erlang to the conversation, you have to remember Erlang default encoding was/is latin1 (there is an effort to migrate to UTF-8 but I am not sure if it made to the shell - someone please correct me).

In latin1, the codepoint § (0xA7) is also represented by the byte 0xA7. So explaining your results directly:

24> <<167>>.
<<"§">> %% this is encoded in latin1

25> <<"\x{a7}">>.
<<"§">> %% still latin1

26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>> %% this is encoded in utf8, as the /utf8 modifier says

27> <<"\x{c2a7}">>.
<<"§">>  %% this is latin1

The last one is quite interesting and potentially confusing. In Erlang binaries, if you pass an integer with value more than 255, it is truncated. So the last example is effectively doing <<49831>> which when truncated becomes <<167>>, which is again equivalent to <<"§">> in latin1.




回答2:


The code point is a number assigned to the character. It's an abstract value, not dependent on any particular representation in actual memory somewhere.

In order to store the character, you have to convert the code point to some sequence of bytes. There are several different ways to do this; each is called a Unicode Transformation Format, and named UTF-n, where the n is the number of bits in the basic unit of encoding. There used to be a UTF-7, used where 7-bit ASCII was assumed and even the 8th bit of a byte couldn't be reliably transmitted; in modern systems, there are UTF-8, UTF-16, and UTF-32.

Since the largest code point value fits comfortably in 21 bits, UTF-32 is the simplest; you just store the code point as a 32-bit integer. (There could theoretically be a UTF-24 or even a UTF-21, but common modern computing platforms deal naturally with values that take up either exactly 8 or a multiple of 16 bits, and have to work harder to deal with anything else.)

So UTF-32 is simple, but inefficient. Not only does it have 11 extra bits that will never be needed, it has 5 bits that are almost never needed. Far and away most Unicode characters found in the wild are in the Basic Multilingual Plane, U+0000 through U+FFFF. UTF-16 lets you represent all of those code points as a plain integer, taking up half the space of UTF-32. But it can't represent anything from U+10000 on up that way, so part of the 0000-FFFF range is reserved as "surrogate pairs" that can be put together to represent a high-plane Unicode character with two 16-bit units, for a total of 32 bits again but only when needed.

Java uses UTF-16 internally, but Erlang (and therefore Elixir), along with most other programming systems, uses UTF-8. UTF-8 has the advantage of completely transparent compatibility with ASCII - all characters in the ASCII range (U+0000 through U+007F, or 0-127 decimal) are represented by single bytes with the corresponding value. But any characters with code points outside the ASCII range require more than one byte each - even those in the range U+0080 through U+00FF, decimal 128 through 255, which only take up one byte in the Latin-1 encoding that used to be the default before Unicode.

So with Elixir/Erlang "binaries", unless you go out of your way to encode things differently, you are using UTF-8. If you look at the high bit of the first byte of a UTF-8 character, it's either 0, meaning you have a one-byte ASCII character, or it's 1. If it's 1, then the second-highest bit is also 1, because the number of consecutive 1-bits counting down from the high bit before you get to a 0 bit tells you how many bytes total the character takes up. So the pattern 110xxxxx means the character is two bytes, 1110xxxx means three bytes, and 11110xxx means four bytes. (There is no legal UTF-8 character that requires more than four bytes, although the encoding could theoretically support up to seven.)

The rest of the bytes all have the two high bits set to 10, so they can't be mistaken for the start of a character. And the rest of the bits are the code point itself.

To use your case as an example, the code point for "§" is U+00A7 - that is, hexadecimal A7, which is decimal 167 or binary 10100111. Since that's greater than decimal 127, it will require two bytes in UTF-8. Those two bytes will have the binary form 110abcde 10fghijk, where the bits abcdefghijk will hold the code point. So the binary representation of the code point, 10100111, is padded out to 00010100111 and split unto the sequences 00010, which replaces abcde in the UTF-8 template, and 100111, which replaces fghijk. That yields two bytes with binary values 11000010 and 10100111, which are C2 and A7 in hexadecimal, or 194 and 167 in decimal.

You'll notice that the second byte coincidentally has the same value as the code point you're encoding, but t's important to realize that this correspondence is just a coincidence. There are a total of 64 code points, from 128 (U+0080) through 191 (U+00BF), that work out that way: their UTF-8 encoding consists of a byte with decimal value 194 followed by a byte whose value is equal to the code point itself. But for the other 1,114,048 code points possible in Unicode, that is not the case.



来源:https://stackoverflow.com/questions/48409482/utf8-codepoints-and-their-representation-in-erlang-and-elixir

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!