How do locales work in Linux / POSIX and what transformations are applied?

后端 未结 3 876
心在旅途
心在旅途 2020-12-16 01:15

I\'m working with huge files of (I hope) UTF-8 text. I can reproduce it using Ubuntu 13.10 (3.11.0-14-generic) and 12.04.

While investigating a bug I\'ve encountere

3条回答
  •  情歌与酒
    2020-12-16 01:38

    It could be due to Unicode normalization. There are sequences of code points in Unicode which are distinct and yet are considered equivalent.

    One simple example of that is combining characters. Many accented characters like "é" can be represented as either a single code point (U+00E9, LATIN SMALL LETTER E WITH ACUTE), or as a combination of both an unaccepted character and a combining character, e.g. the two-character sequence (LATIN SMALL LETTER E, COMBINING ACUTE ACCENT).

    Those two byte sequences are obviously different, and so in the C locale, they compare as different. But in a UTF-8 locale, they're treated as identical due to Unicode normalization.

    Here's a simple two-line file with this example:

    $ echo -e '\xc3\xa9\ne\xcc\x81' > test.txt
    $ cat test.txt
    é
    é
    $ hexdump -C test.txt
    00000000  c3 a9 0a 65 cc 81 0a                              |...e...|
    00000007
    $ LC_ALL=C uniq -d test.txt  # No output
    $ LC_ALL=en_US.UTF-8 uniq -d test.txt
    é
    

    Edit by n.m. Not all Linux systems do Unicode normalization.

提交回复
热议问题