The ICU project (which also now has a PHP library) contains the classes needed to help normalize UTF-8 strings to make it easier to compare values when searching.
Ho
The problem of compare strings: two strings with content that is equivalent for the purposes of most applications may contain differing character sequences.
See Unicode's canonical equivalence: if the comparison algorithm is simple (or must be fast), the Unicode equivalence is not performed. This problem occurs, for instance, in XML canonical comparison, see http://www.w3.org/TR/xml-c14n
To avoid this problem... What standard to use? "expanded UTF8" or "compact UTF8"?
Use "ç" or "c+◌̧."?
W3C and others (ex. file names) suggest to use the "composed as canonical" (take in mind C of "most compact" shorter strings)... So,
For interoperability, and for "convention over configuration" choices, the recommendation is the use of NFC, to "canonize" external strings. To store canonical XML, for example, store it in the "FORM_C". The W3C's CSV on the Web Working Group also recomend NFC (section 7.2).
PS: de "FORM_C" is the default form in most of libraries. Ex. in PHP's normalizer.isnormalized().
Ther term "compostion form" (FORM_C
) is used to both, to say that "a string is in the C-canonical form" (the result of a NFC transformation) and to say that a transforming algorithm is used... See http://www.macchiato.com/unicode/nfc-faq
(...) each of the following sequences (the first two being single-character sequences) represent the same character:
- U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE
- U+212B ( Å ) ANGSTROM SIGN
- U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE
These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for compostion. (...) A function transforming a string S into the NFC form can be abbreviated as
toNFC(S)
, while one that tests whether S is in NFC is abbreviated asisNFC(S)
.
Note: to test of normalization of little strings (pure UTF-8 or XML-entity references), you can use this test/normalize online converter.