Unicode Support in Various Programming Languages

后端 未结 20 1967
醉话见心
醉话见心 2020-12-13 13:31

I\'d like to have a canonical place to pool information about Unicode support in various languages. Is it a part of the core language? Is it provided in libraries? Is it not

相关标签:
20条回答
  • 2020-12-13 14:05

    Python 3k

    Python 3k (or 3.0 or 3000) has new approach for handling text (unicode) and data:
    Text Vs. Data Instead Of Unicode Vs. 8-bit. See also Unicode HOWTO.

    0 讨论(0)
  • 2020-12-13 14:05

    C/C++

    C

    C before C99 has no built in unicode support. It uses zero terminated character arrays (char* or char[]) as strings. A char is specified to by a byte (8 bits).

    C99 specifies wcs-functions in additions to the old str-functions (e.g. strlen -> wcslen). These functions take wchar_t* instead of char*. wchar_t stands for wide character type. The size of wchar_t is compiler-specific and can be as small as 8 bits. While different compilers indeed use different sizes, it's usually 16-bit (UTF-16) or 32-bit (UTF-32).

    Most C library functions are transparent to UTF-8. E.g. if your operating system supports UTF-8 (and UTF-8 is configured as your systems charset), then creating a file using fopen passing an UTF-8 encoded string will create a properly named file.

    C++

    The situation in C++ is very similar (std::string -> std::wstring), but there are at least efforts to get some sort of unicode support in the standard library.

    0 讨论(0)
  • 2020-12-13 14:05

    Lua

    Lua 5.3 has a built-in utf8 library, which handles the UTF-8 encoding. It allows you to convert a series of codepoints to the corresponding byte sequence and the other way around, get the length (the number of codepoints in a string), iterate over the codepoints in a string, get the byte position of the nth codepoint. It also provides a pattern, to be used by the pattern-matching functions in the string library, that will match one UTF-8 byte sequence.

    Lua 5.3 has Unicode code point escape sequences that can be used in string literals (for instance, "\u{61}" for "a"). They translate to UTF-8 byte sequences.

    Lua source code can be encoded in UTF-8 or any encoding in which ASCII characters take up one byte. UTF-16 and UTF-32 are not understood by the vanilla Lua interpreter. But strings can contain any encoding, or arbitrary binary data.

    0 讨论(0)
  • 2020-12-13 14:06

    Go

    Google's Go programming language supports Unicode and works with UTF-8.

    0 讨论(0)
  • 2020-12-13 14:08

    Perl

    Perl has built-in Unicode support, mostly. Sort of. From perldoc:

    • perlunitut - Tutorial on using Unicode in Perl. Largely teaches in absolute terms about what you should and should not do as far as Unicode. Covers basics.
    • perlunifaq - Frequently asked questions about Unicode in Perl.
    • perluniintro - Introduction to Unicode in Perl. Less "preachy" than perlunitut.
    • perlunicode - For when you absolutely have to know everything there is to know about Unicode and Perl.
    0 讨论(0)
  • 2020-12-13 14:08

    Rust

    Rust's strings (std::String and &str) are always valid UTF-8, and do not use null terminators, and as a result can not be indexed as an array, like they can be in C/C++, etc. They can be sliced somewhat like Go using .get since 1.20, with the caveat that it will fail if you try slicing the middle of a code point.

    Rust also has OsStr/OsString for interacting with the Host OS. It's byte array on Unix (containing any sequence of bytes). On windows it's WTF-8 (A super-set of UTF-8 that handles the improperly formed Unicode strings that are allowed in Windows and Javascript), &str and String can be freely converted to OsStr or OsString, but require checks to covert the other way. Either by Failing on invalid unicode, or replacing with the Unicode replacement char. (There is also Path/PathBuf, which are just wrappers around OsStr/OsString).

    There is also the CStr and CString types, which represent Null terminated C strings, like OsStr on Unix they can contain arbitrary bytes.

    Rust doesn't directly support UTF-16. But can convert OsStr to UCS-2 on windows.

    0 讨论(0)
提交回复
热议问题