I\'d like to have a canonical place to pool information about Unicode support in various languages. Is it a part of the core language? Is it provided in libraries? Is it not
Python 3k (or 3.0 or 3000) has new approach for handling text (unicode) and data:
Text Vs. Data Instead Of Unicode Vs. 8-bit. See also Unicode HOWTO.
C before C99 has no built in unicode support. It uses zero terminated character arrays (char*
or char[]
) as strings. A char
is specified to by a byte (8 bits).
C99 specifies wcs
-functions in additions to the old str
-functions (e.g. strlen
-> wcslen
). These functions take wchar_t*
instead of char*
. wchar_t
stands for wide character type. The size of wchar_t
is compiler-specific and can be as small as 8 bits. While different compilers indeed use different sizes, it's usually 16-bit (UTF-16) or 32-bit (UTF-32).
Most C library functions are transparent to UTF-8. E.g. if your operating system supports UTF-8 (and UTF-8 is configured as your systems charset), then creating a file using fopen
passing an UTF-8 encoded string will create a properly named file.
The situation in C++ is very similar (std::string
-> std::wstring
), but there are at least efforts to get some sort of unicode support in the standard library.
Lua 5.3 has a built-in utf8
library, which handles the UTF-8 encoding. It allows you to convert a series of codepoints to the corresponding byte sequence and the other way around, get the length (the number of codepoints in a string), iterate over the codepoints in a string, get the byte position of the nth codepoint. It also provides a pattern, to be used by the pattern-matching functions in the string
library, that will match one UTF-8 byte sequence.
Lua 5.3 has Unicode code point escape sequences that can be used in string literals (for instance, "\u{61}"
for "a"
). They translate to UTF-8 byte sequences.
Lua source code can be encoded in UTF-8 or any encoding in which ASCII characters take up one byte. UTF-16 and UTF-32 are not understood by the vanilla Lua interpreter. But strings can contain any encoding, or arbitrary binary data.
Google's Go programming language supports Unicode and works with UTF-8.
Perl has built-in Unicode support, mostly. Sort of. From perldoc:
Rust's strings (std::String
and &str
) are always valid UTF-8, and do not use null terminators, and as a result can not be indexed as an array, like they can be in C/C++, etc. They can be sliced somewhat like Go using .get
since 1.20, with the caveat that it will fail if you try slicing the middle of a code point.
Rust also has OsStr
/OsString
for interacting with the Host OS. It's byte array on Unix (containing any sequence of bytes). On windows it's WTF-8 (A super-set of UTF-8 that handles the improperly formed Unicode strings that are allowed in Windows and Javascript), &str
and String
can be freely converted to OsStr
or OsString
, but require checks to covert the other way. Either by Failing on invalid unicode, or replacing with the Unicode replacement char. (There is also Path
/PathBuf
, which are just wrappers around OsStr
/OsString
).
There is also the CStr
and CString
types, which represent Null terminated C strings, like OsStr
on Unix they can contain arbitrary bytes.
Rust doesn't directly support UTF-16. But can convert OsStr
to UCS-2 on windows.