Why does the C runtime on Mac OS allow both precomposed and decomposed UTF-8?

六月ゝ 毕业季﹏ 提交于 2021-02-07 08:38:52

问题


So we all know that the filesystem on Mac OS has this wacky feature of using fully decomposed UTF-8. If you call POSIX APIs like realpath(), for example, you'll get such a fully decomposed UTF-8 string back from Mac OS. When using APIs like fopen(), however, passing precomposed UTF-8 seems to work as well.

Here is a little demo program which attempts to open a file named ä. The first call to fopen() passes a precomposed UTF-8 string, the second call passes a decomposed UTF-8 string and to my surprise both work. I'd expect only the second one to work but precomposed UTF-8 works as well.

#include <stdio.h>

int main(int argc, char *argv[])
{
    FILE *fp, *fp2;

    fp = fopen("\xc3\xa4", "rb");       // ä as precomposed UTF-8
    fp2 = fopen("\x61\xcc\x88", "rb");  // ä as decomposed UTF-8

    printf("CHECK: %p %p\n", fp, fp2);

    if(fp) fclose(fp);
    if(fp2) fclose(fp2);

    return 0;
}

Now to my questions:

  1. Is this defined behaviour? i.e. is it allowed to pass precomposed UTF-8 to POSIX APIs or should I always pass decomposed UTF-8?

  2. How can functions like fopen() even know whether the file passed contains precomposed or decomposed UTF-8? Couldn't this even lead to all sorts of issues, e.g. wrong files being opened because the passed string can be interpreted in two different ways and thus potentially point to two different files? This is somewhat confusing me.

EDIT

To make the confusion complete, this weird behaviour doesn't even seem to be limited to file I/O. Take a look at this code:

#include <stdio.h>

int main(int argc, char *argv[])
{
    printf("\xc3\xa4\n");
    printf("\x61\xcc\x88\n");

    return 0;
}

Both printf calls do exactly the same, i.e. they both print the character ä, the first call using precomposed UTF-8 and the second one using decomposed UTF-8. It's really weird.


回答1:


There're two different types of equivalence in Unicode strings: One thing is canonical equivalence, and another is compatibility. Since your question is about strings that seem to be considered identical by the software, let's focus in canonical equivalence (OTOH, compatibility allows for semantic differences, so it's off-topic in this question).

Citing from Unicode equivalence in Wikipedia:

Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.

In other words, if two strings are canonically equivalent, the software should consider the two strings represent exactly the same thing. So, MacOS is doing the correct thing here: You have two different UTF-8 strings (one decomposed, another precomposed), but they are canonically equivalent, so they map to the same object (the same file name in your example). That's correct (remember the "should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other" line in the quote above).

I don't really understand your second example about printf(). Yes, both a decomposed character and a precomposed character render the same output. That's precisely the point in the dual representation of characters supported by Unicode: You can choose whether to represent a combined character with a precomposed sequence of bytes, or a decomposed sequence of bytes. They print the same visual result, but their representation is different. If both representations are canonically equivalent (in some cases they are, in some cases they are not), then the system must consider them as two representations of the same object.

In order to manage all of this more comfortably in your software, you should normalize your Unicode strings before working with them.



来源:https://stackoverflow.com/questions/38484369/why-does-the-c-runtime-on-mac-os-allow-both-precomposed-and-decomposed-utf-8

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!