I\'d like to have a canonical place to pool information about Unicode support in various languages. Is it a part of the core language? Is it provided in libraries? Is it not
There is already an entire thread on this on SO!
The Q command has complete Unicode support in most implementations.
None built-in, aside from whatever happens to be available as part of the C string library.
However, once you add frameworks…
NSString and CFString each implement a fully Unicode-based string class (actually several classes, as an implementation detail). The two are “toll-free-bridged” so that the API for one can be used with instances of the other, and vice versa.
For data that doesn't necessarily represent text, there's NSData and CFData. NSString provides methods and CFString provides functions to encode text into data and decode text from data. Core Foundation supports more than a hundred different encodings, including all forms of the UTFs. The encodings are divided into two groups: built-in encodings, which are supported everywhere, and external encodings, which are at least supported on Mac OS X.
NSString provides methods for normalizing to forms D, KD, C, or KC. Each returns a new string.
Both NSString and CFString provide a wide variety of comparison/collation options. Here are Foundation's comparison-option flags and Core Foundation's comparison-option flags. They are not all synonymous; for example, Core Foundation makes literal (strict code-point-based) comparison the default, whereas Foundation makes non-literal comparison (allowing characters with accents to compare equal) the default.
Note that Core Foundation does not require Objective-C; indeed, it was created pretty much to provide most of the features of Foundation to Carbon programmers, who used straight C or C++. However, I suspect most modern usage of it is in Cocoa or Cocoa Touch programs, which are all written in Objective-C or Objective-C++.
Arc doesn't have any unicode support. Yet.
Same as with .NET, Java uses UTF-16 internally: java.lang.String
A
Stringrepresents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in theCharacterclass for more information). Index values refer tocharcode units, so a supplementary character uses two positions in aString.
Python 2 has the classes str and unicode. str objects store bytes, unicode objects store UTF-16 characters. Most library functions support both (e.g. os.listdir('.') returns a list of str, os.listdir(u'.') returns a list of unicode objects). Both have encode and decode methods.
Python 3 basically renamed unicode to str. The Python 3 equivalent to str would be the type bytes. bytes has a decode and str an encode method. Since Python 3.3 str objects internally use one of several encodings in order to save memory. For a Python programmer it still looks like an abstract unicode sequence.
Python supports:
Python does not support/has limited support for:
See also: The Truth about Unicode in Python