C programming: How to program for Unicode?

前端 未结 8 1071
予麋鹿
予麋鹿 2020-11-28 18:26

What prerequisites are needed to do strict Unicode programming?

Does this imply that my code should not use char types anywhere and that functions need

8条回答
  •  [愿得一人]
    2020-11-28 19:27

    The most important thing is to always make a clear distinction between text and binary data. Try to follow the model of Python 3.x str vs. bytes or SQL TEXT vs. BLOB.

    Unfortunately, C confuses the issue by using char for both "ASCII character" and int_least8_t. You'll want to do something like:

    typedef char UTF8; // for code units of UTF-8 strings
    typedef unsigned char BYTE; // for binary data
    

    You might want typedefs for UTF-16 and UTF-32 code units too, but this is more complicated because the encoding of wchar_t is not defined. You'll need to just a preprocessor #ifs. Some useful macros in C and C++0x are:

    • __STDC_UTF_16__ — If defined, the type _Char16_t exists and is UTF-16.
    • __STDC_UTF_32__ — If defined, the type _Char32_t exists and is UTF-32.
    • __STDC_ISO_10646__ — If defined, then wchar_t is UTF-32.
    • _WIN32 — On Windows, wchar_t is UTF-16, even though this breaks the standard.
    • WCHAR_MAX — Can be used to determine the size of wchar_t, but not whether the OS uses it to represent Unicode.

    Does this imply that my code should not use char types anywhere and that functions need to be used that can deal with wint_t and wchar_t?

    See also:

    • UTF-8 or UTF-16 or UTF-32 or UCS-2
    • Is wchar_t needed for Unicode support?

    No. UTF-8 is a perfectly valid Unicode encoding that uses char* strings. It has the advantage that if your program is transparent to non-ASCII bytes (e.g., a line ending converter which acts on \r and \n but passes through other characters unchanged), you'll need to make no changes at all!

    If you go with UTF-8, you'll need to change all the assumptions that char = character (e.g., don't call toupper in a loop) or char = screen column (e.g., for text wrapping).

    If you go with UTF-32, you'll have the simplicity of fixed-width characters (but not fixed-width graphemes, but will need to change the type of all of your strings).

    If you go with UTF-16, you'll have to discard both the assumption of fixed-width characters and the assumption of 8-bit code units, which makes this the most difficult upgrade path from single-byte encodings.

    I would recommend actively avoiding wchar_t because it's not cross-platform: Sometimes it's UTF-32, sometimes it's UTF-16, and sometimes its a pre-Unicode East Asian encoding. I'd recommend using typedefs

    Even more importantly, avoid TCHAR.

提交回复
热议问题