I fall into a surprising issue.
I loaded a text file in my application and I have some logic which compares the value having µ.
And I realized that even if
EDIT After the merge of this question with How to compare 'μ' and 'µ' in C#
Original answer posted:
"μ".ToUpper().Equals("µ".ToUpper()); //This always return true.
EDIT After reading the comments, yes it is not good to use the above method because it may provide wrong results for some other type of inputs, for this we should use normalize using full compatibility decomposition as mentioned in wiki. (Thanks to the answer posted by BoltClock)
static string GREEK_SMALL_LETTER_MU = new String(new char[] { '\u03BC' });
static string MICRO_SIGN = new String(new char[] { '\u00B5' });
public static void Main()
{
string Mus = "µμ";
string NormalizedString = null;
int i = 0;
do
{
string OriginalUnicodeString = Mus[i].ToString();
if (OriginalUnicodeString.Equals(GREEK_SMALL_LETTER_MU))
Console.WriteLine(" INFORMATIO ABOUT GREEK_SMALL_LETTER_MU");
else if (OriginalUnicodeString.Equals(MICRO_SIGN))
Console.WriteLine(" INFORMATIO ABOUT MICRO_SIGN");
Console.WriteLine();
ShowHexaDecimal(OriginalUnicodeString);
Console.WriteLine("Unicode character category " + CharUnicodeInfo.GetUnicodeCategory(Mus[i]));
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormC);
Console.Write("Form C Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormD);
Console.Write("Form D Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKC);
Console.Write("Form KC Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKD);
Console.Write("Form KD Normalized: ");
ShowHexaDecimal(NormalizedString);
Console.WriteLine("_______________________________________________________________");
i++;
} while (i < 2);
Console.ReadLine();
}
private static void ShowHexaDecimal(string UnicodeString)
{
Console.Write("Hexa-Decimal Characters of " + UnicodeString + " are ");
foreach (short x in UnicodeString.ToCharArray())
{
Console.Write("{0:X4} ", x);
}
Console.WriteLine();
}
Output
INFORMATIO ABOUT MICRO_SIGN
Hexa-Decimal Characters of µ are 00B5
Unicode character category LowercaseLetter
Form C Normalized: Hexa-Decimal Characters of µ are 00B5
Form D Normalized: Hexa-Decimal Characters of µ are 00B5
Form KC Normalized: Hexa-Decimal Characters of µ are 03BC
Form KD Normalized: Hexa-Decimal Characters of µ are 03BC
________________________________________________________________
INFORMATIO ABOUT GREEK_SMALL_LETTER_MU
Hexa-Decimal Characters of µ are 03BC
Unicode character category LowercaseLetter
Form C Normalized: Hexa-Decimal Characters of µ are 03BC
Form D Normalized: Hexa-Decimal Characters of µ are 03BC
Form KC Normalized: Hexa-Decimal Characters of µ are 03BC
Form KD Normalized: Hexa-Decimal Characters of µ are 03BC
________________________________________________________________
While reading information in Unicode_equivalence I found
The choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ffi), ..... so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03.
So to compare equivalence we should normally use FormKC i.e. NFKC normalization or FormKD i.e NFKD normalization.
I was little curious to know more about all the Unicode characters so I made sample which would iterate over all the Unicode character in UTF-16 and I got some results I want to discuss
FormC and FormD normalized values were not equivalentTotal: 12,118Character (int value): 192-197, 199-207, 209-214, 217-221, 224-253, ..... 44032-55203FormKC and FormKD normalized values were not equivalentTotal: 12,245Character (int value): 192-197, 199-207, 209-214, 217-221, 224-228, ..... 44032-55203, 64420-64421, 64432-64433, 64490-64507, 64512-64516, 64612-64617, 64663-64667, 64735-64736, 65153-65164, 65269-65274FormC and FormD normalized value were not equivalent, there FormKC and FormKD normalized values were also not equivalent except these characters901 '΅', 8129 '῁', 8141 '῍', 8142 '῎', 8143 '῏', 8157 '῝', 8158 '῞', 8159 '῟', 8173 '῭', 8174 '΅'FormKC and FormKD normalized value were not equivalent, but there FormC and FormD normalized values were equivalentTotal: 119452 'DŽ' 453 'Dž' 454 'dž' 12814 '㈎' 12815 '㈏' 12816 '㈐' 12817 '㈑' 12818 '㈒'
12819 '㈓' 12820 '㈔' 12821 '㈕', 12822 '㈖' 12823 '㈗' 12824 '㈘' 12825 '㈙' 12826 '㈚'
12827 '㈛' 12828 '㈜' 12829 '㈝' 12830 '㈞' 12910 '㉮' 12911 '㉯' 12912 '㉰' 12913 '㉱'
12914 '㉲' 12915 '㉳' 12916 '㉴' 12917 '㉵' 12918 '㉶' 12919 '㉷' 12920 '㉸' 12921 '㉹' 12922 '㉺' 12923 '㉻' 12924 '㉼' 12925 '㉽' 12926 '㉾' 13056 '㌀' 13058 '㌂' 13060 '㌄' 13063 '㌇' 13070 '㌎' 13071 '㌏' 13072 '㌐' 13073 '㌑' 13075 '㌓' 13077 '㌕' 13080 '㌘' 13081 '㌙' 13082 '㌚' 13086 '㌞' 13089 '㌡' 13092 '㌤' 13093 '㌥' 13094 '㌦' 13099 '㌫' 13100 '㌬' 13101 '㌭' 13102 '㌮' 13103 '㌯' 13104 '㌰' 13105 '㌱' 13106 '㌲' 13108 '㌴' 13111 '㌷' 13112 '㌸' 13114 '㌺' 13115 '㌻' 13116 '㌼' 13117 '㌽' 13118 '㌾' 13120 '㍀' 13130 '㍊' 13131 '㍋' 13132 '㍌' 13134 '㍎' 13139 '㍓' 13140 '㍔' 13142 '㍖' .......... ﺋ' 65164 'ﺌ' 65269 'ﻵ' 65270 'ﻶ' 65271 'ﻷ' 65272 'ﻸ' 65273 'ﻹ' 65274'ArgumentException if triedTotal:2081
Characters(int value): 55296-57343, 64976-65007, 65534This links can be really helpful to understand what rules govern for Unicode equivalence