What prerequisites are needed to do strict Unicode programming?
Does this imply that my code should not use char
types anywhere and that functions need
The most important thing is to always make a clear distinction between text and binary data. Try to follow the model of Python 3.x str vs. bytes or SQL TEXT
vs. BLOB
.
Unfortunately, C confuses the issue by using char
for both "ASCII character" and int_least8_t
. You'll want to do something like:
typedef char UTF8; // for code units of UTF-8 strings
typedef unsigned char BYTE; // for binary data
You might want typedefs for UTF-16 and UTF-32 code units too, but this is more complicated because the encoding of wchar_t
is not defined. You'll need to just a preprocessor #if
s. Some useful macros in C and C++0x are:
__STDC_UTF_16__
— If defined, the type _Char16_t
exists and is UTF-16.__STDC_UTF_32__
— If defined, the type _Char32_t
exists and is UTF-32.__STDC_ISO_10646__
— If defined, then wchar_t
is UTF-32._WIN32
— On Windows, wchar_t
is UTF-16, even though this breaks the standard.WCHAR_MAX
— Can be used to determine the size of wchar_t
, but not whether the OS uses it to represent Unicode.Does this imply that my code should not use char types anywhere and that functions need to be used that can deal with wint_t and wchar_t?
See also:
No. UTF-8 is a perfectly valid Unicode encoding that uses char*
strings. It has the advantage that if your program is transparent to non-ASCII bytes (e.g., a line ending converter which acts on \r
and \n
but passes through other characters unchanged), you'll need to make no changes at all!
If you go with UTF-8, you'll need to change all the assumptions that char
= character (e.g., don't call toupper
in a loop) or char
= screen column (e.g., for text wrapping).
If you go with UTF-32, you'll have the simplicity of fixed-width characters (but not fixed-width graphemes, but will need to change the type of all of your strings).
If you go with UTF-16, you'll have to discard both the assumption of fixed-width characters and the assumption of 8-bit code units, which makes this the most difficult upgrade path from single-byte encodings.
I would recommend actively avoiding wchar_t
because it's not cross-platform: Sometimes it's UTF-32, sometimes it's UTF-16, and sometimes its a pre-Unicode East Asian encoding. I'd recommend using typedefs
Even more importantly, avoid TCHAR.