Unicode Support in Various Programming Languages

后端未结

关注

 20  1967

I\'d like to have a canonical place to pool information about Unicode support in various languages. Is it a part of the core language? Is it provided in libraries? Is it not

相关标签:

20条回答

既然无缘

2020-12-13 14:05

Python 3k

Python 3k (or 3.0 or 3000) has new approach for handling text (unicode) and data:
Text Vs. Data Instead Of Unicode Vs. 8-bit. See also Unicode HOWTO.

0 讨论(0)
发布评论:

提交评论
- 加载中...
终归单人心

2020-12-13 14:05

C/C++

C

C before C99 has no built in unicode support. It uses zero terminated character arrays (char* or char[]) as strings. A char is specified to by a byte (8 bits).

C99 specifies wcs-functions in additions to the old str-functions (e.g. strlen -> wcslen). These functions take wchar_t* instead of char*. wchar_t stands for wide character type. The size of wchar_t is compiler-specific and can be as small as 8 bits. While different compilers indeed use different sizes, it's usually 16-bit (UTF-16) or 32-bit (UTF-32).

Most C library functions are transparent to UTF-8. E.g. if your operating system supports UTF-8 (and UTF-8 is configured as your systems charset), then creating a file using fopen passing an UTF-8 encoded string will create a properly named file.

C++

The situation in C++ is very similar (std::string -> std::wstring), but there are at least efforts to get some sort of unicode support in the standard library.

0 讨论(0)
发布评论:

提交评论
- 加载中...
半阙折子戏

2020-12-13 14:05

Lua

Lua 5.3 has a built-in utf8 library, which handles the UTF-8 encoding. It allows you to convert a series of codepoints to the corresponding byte sequence and the other way around, get the length (the number of codepoints in a string), iterate over the codepoints in a string, get the byte position of the nth codepoint. It also provides a pattern, to be used by the pattern-matching functions in the string library, that will match one UTF-8 byte sequence.

Lua 5.3 has Unicode code point escape sequences that can be used in string literals (for instance, "\u{61}" for "a"). They translate to UTF-8 byte sequences.

Lua source code can be encoded in UTF-8 or any encoding in which ASCII characters take up one byte. UTF-16 and UTF-32 are not understood by the vanilla Lua interpreter. But strings can contain any encoding, or arbitrary binary data.

0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2020-12-13 14:06

Go

Google's Go programming language supports Unicode and works with UTF-8.

0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2020-12-13 14:08
Perl

Perl has built-in Unicode support, mostly. Sort of. From perldoc:
- perlunitut - Tutorial on using Unicode in Perl. Largely teaches in absolute terms about what you should and should not do as far as Unicode. Covers basics.
- perlunifaq - Frequently asked questions about Unicode in Perl.
- perluniintro - Introduction to Unicode in Perl. Less "preachy" than perlunitut.
- perlunicode - For when you absolutely have to know everything there is to know about Unicode and Perl.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情歌与酒

2020-12-13 14:08

Rust

Rust's strings (std::String and &str) are always valid UTF-8, and do not use null terminators, and as a result can not be indexed as an array, like they can be in C/C++, etc. They can be sliced somewhat like Go using .get since 1.20, with the caveat that it will fail if you try slicing the middle of a code point.

Rust also has OsStr/OsString for interacting with the Host OS. It's byte array on Unix (containing any sequence of bytes). On windows it's WTF-8 (A super-set of UTF-8 that handles the improperly formed Unicode strings that are allowed in Windows and Javascript), &str and String can be freely converted to OsStr or OsString, but require checks to covert the other way. Either by Failing on invalid unicode, or replacing with the Unicode replacement char. (There is also Path/PathBuf, which are just wrappers around OsStr/OsString).

There is also the CStr and CString types, which represent Null terminated C strings, like OsStr on Unix they can contain arbitrary bytes.

Rust doesn't directly support UTF-16. But can convert OsStr to UCS-2 on windows.

0 讨论(0)
发布评论:

提交评论
- 加载中...

Unicode Support in Various Programming Languages

Python 3k

C/C++

C

C++

Lua

Go

Perl

Rust