I\'d like to have a canonical place to pool information about Unicode support in various languages. Is it a part of the core language? Is it provided in libraries? Is it not
There is already an entire thread on this on SO!
The Q command has complete Unicode support in most implementations.
None built-in, aside from whatever happens to be available as part of the C string library.
However, once you add frameworks…
NSString and CFString each implement a fully Unicode-based string class (actually several classes, as an implementation detail). The two are “toll-free-bridged” so that the API for one can be used with instances of the other, and vice versa.
For data that doesn't necessarily represent text, there's NSData and CFData. NSString provides methods and CFString provides functions to encode text into data and decode text from data. Core Foundation supports more than a hundred different encodings, including all forms of the UTFs. The encodings are divided into two groups: built-in encodings, which are supported everywhere, and external encodings, which are at least supported on Mac OS X.
NSString provides methods for normalizing to forms D, KD, C, or KC. Each returns a new string.
Both NSString and CFString provide a wide variety of comparison/collation options. Here are Foundation's comparison-option flags and Core Foundation's comparison-option flags. They are not all synonymous; for example, Core Foundation makes literal (strict code-point-based) comparison the default, whereas Foundation makes non-literal comparison (allowing characters with accents to compare equal) the default.
Note that Core Foundation does not require Objective-C; indeed, it was created pretty much to provide most of the features of Foundation to Carbon programmers, who used straight C or C++. However, I suspect most modern usage of it is in Cocoa or Cocoa Touch programs, which are all written in Objective-C or Objective-C++.
Arc doesn't have any unicode support. Yet.
Same as with .NET, Java uses UTF-16 internally: java.lang.String
A
String
represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in theCharacter
class for more information). Index values refer tochar
code units, so a supplementary character uses two positions in aString
.
Python 2 has the classes str
and unicode
. str
objects store bytes, unicode
objects store UTF-16 characters. Most library functions support both (e.g. os.listdir('.')
returns a list of str
, os.listdir(u'.')
returns a list of unicode
objects). Both have encode
and decode
methods.
Python 3 basically renamed unicode
to str
. The Python 3 equivalent to str
would be the type bytes
. bytes
has a decode
and str
an encode
method. Since Python 3.3 str
objects internally use one of several encodings in order to save memory. For a Python programmer it still looks like an abstract unicode sequence.
Python supports:
Python does not support/has limited support for:
See also: The Truth about Unicode in Python