How to compare Unicode characters that “look alike”?

前端 未结 10 1158
情歌与酒
情歌与酒 2020-11-27 10:42

I fall into a surprising issue.

I loaded a text file in my application and I have some logic which compares the value having µ.

And I realized that even if

10条回答
  •  天涯浪人
    2020-11-27 11:15

    EDIT After the merge of this question with How to compare 'μ' and 'µ' in C#
    Original answer posted:

     "μ".ToUpper().Equals("µ".ToUpper()); //This always return true.
    

    EDIT After reading the comments, yes it is not good to use the above method because it may provide wrong results for some other type of inputs, for this we should use normalize using full compatibility decomposition as mentioned in wiki. (Thanks to the answer posted by BoltClock)

        static string GREEK_SMALL_LETTER_MU = new String(new char[] { '\u03BC' });
        static string MICRO_SIGN = new String(new char[] { '\u00B5' });
    
        public static void Main()
        {
            string Mus = "µμ";
            string NormalizedString = null;
            int i = 0;
            do
            {
                string OriginalUnicodeString = Mus[i].ToString();
                if (OriginalUnicodeString.Equals(GREEK_SMALL_LETTER_MU))
                    Console.WriteLine(" INFORMATIO ABOUT GREEK_SMALL_LETTER_MU");
                else if (OriginalUnicodeString.Equals(MICRO_SIGN))
                    Console.WriteLine(" INFORMATIO ABOUT MICRO_SIGN");
    
                Console.WriteLine();
                ShowHexaDecimal(OriginalUnicodeString);                
                Console.WriteLine("Unicode character category " + CharUnicodeInfo.GetUnicodeCategory(Mus[i]));
    
                NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormC);
                Console.Write("Form C Normalized: ");
                ShowHexaDecimal(NormalizedString);               
    
                NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormD);
                Console.Write("Form D Normalized: ");
                ShowHexaDecimal(NormalizedString);               
    
                NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKC);
                Console.Write("Form KC Normalized: ");
                ShowHexaDecimal(NormalizedString);                
    
                NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKD);
                Console.Write("Form KD Normalized: ");
                ShowHexaDecimal(NormalizedString);                
                Console.WriteLine("_______________________________________________________________");
                i++;
            } while (i < 2);
            Console.ReadLine();
        }
    
        private static void ShowHexaDecimal(string UnicodeString)
        {
            Console.Write("Hexa-Decimal Characters of " + UnicodeString + "  are ");
            foreach (short x in UnicodeString.ToCharArray())
            {
                Console.Write("{0:X4} ", x);
            }
            Console.WriteLine();
        }
    

    Output

    INFORMATIO ABOUT MICRO_SIGN    
    Hexa-Decimal Characters of µ  are 00B5
    Unicode character category LowercaseLetter
    Form C Normalized: Hexa-Decimal Characters of µ  are 00B5
    Form D Normalized: Hexa-Decimal Characters of µ  are 00B5
    Form KC Normalized: Hexa-Decimal Characters of µ  are 03BC
    Form KD Normalized: Hexa-Decimal Characters of µ  are 03BC
     ________________________________________________________________
     INFORMATIO ABOUT GREEK_SMALL_LETTER_MU    
    Hexa-Decimal Characters of µ  are 03BC
    Unicode character category LowercaseLetter
    Form C Normalized: Hexa-Decimal Characters of µ  are 03BC
    Form D Normalized: Hexa-Decimal Characters of µ  are 03BC
    Form KC Normalized: Hexa-Decimal Characters of µ  are 03BC
    Form KD Normalized: Hexa-Decimal Characters of µ  are 03BC
     ________________________________________________________________
    

    While reading information in Unicode_equivalence I found

    The choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ffi), ..... so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03.

    So to compare equivalence we should normally use FormKC i.e. NFKC normalization or FormKD i.e NFKD normalization.
    I was little curious to know more about all the Unicode characters so I made sample which would iterate over all the Unicode character in UTF-16 and I got some results I want to discuss

    • Information about characters whose FormC and FormD normalized values were not equivalent
      Total: 12,118
      Character (int value): 192-197, 199-207, 209-214, 217-221, 224-253, ..... 44032-55203
    • Information about characters whose FormKC and FormKD normalized values were not equivalent
      Total: 12,245
      Character (int value): 192-197, 199-207, 209-214, 217-221, 224-228, ..... 44032-55203, 64420-64421, 64432-64433, 64490-64507, 64512-64516, 64612-64617, 64663-64667, 64735-64736, 65153-65164, 65269-65274
    • All the character whose FormC and FormD normalized value were not equivalent, there FormKC and FormKD normalized values were also not equivalent except these characters
      Characters: 901 '΅', 8129 '῁', 8141 '῍', 8142 '῎', 8143 '῏', 8157 '῝', 8158 '῞'
      , 8159 '῟', 8173 '῭', 8174 '΅'
    • Extra character whose FormKC and FormKD normalized value were not equivalent, but there FormC and FormD normalized values were equivalent
      Total: 119
      Characters: 452 'DŽ' 453 'Dž' 454 'dž' 12814 '㈎' 12815 '㈏' 12816 '㈐' 12817 '㈑' 12818 '㈒' 12819 '㈓' 12820 '㈔' 12821 '㈕', 12822 '㈖' 12823 '㈗' 12824 '㈘' 12825 '㈙' 12826 '㈚' 12827 '㈛' 12828 '㈜' 12829 '㈝' 12830 '㈞' 12910 '㉮' 12911 '㉯' 12912 '㉰' 12913 '㉱' 12914 '㉲' 12915 '㉳' 12916 '㉴' 12917 '㉵' 12918 '㉶' 12919 '㉷' 12920 '㉸' 12921 '㉹' 12922 '㉺' 12923 '㉻' 12924 '㉼' 12925 '㉽' 12926 '㉾' 13056 '㌀' 13058 '㌂' 13060 '㌄' 13063 '㌇' 13070 '㌎' 13071 '㌏' 13072 '㌐' 13073 '㌑' 13075 '㌓' 13077 '㌕' 13080 '㌘' 13081 '㌙' 13082 '㌚' 13086 '㌞' 13089 '㌡' 13092 '㌤' 13093 '㌥' 13094 '㌦' 13099 '㌫' 13100 '㌬' 13101 '㌭' 13102 '㌮' 13103 '㌯' 13104 '㌰' 13105 '㌱' 13106 '㌲' 13108 '㌴' 13111 '㌷' 13112 '㌸' 13114 '㌺' 13115 '㌻' 13116 '㌼' 13117 '㌽' 13118 '㌾' 13120 '㍀' 13130 '㍊' 13131 '㍋' 13132 '㍌' 13134 '㍎' 13139 '㍓' 13140 '㍔' 13142 '㍖' .......... ﺋ' 65164 'ﺌ' 65269 'ﻵ' 65270 'ﻶ' 65271 'ﻷ' 65272 'ﻸ' 65273 'ﻹ' 65274'
    • There are some characters which can not be normalized, they throw ArgumentException if tried
      Total:2081 Characters(int value): 55296-57343, 64976-65007, 65534

    This links can be really helpful to understand what rules govern for Unicode equivalence

    1. Unicode_equivalence
    2. Unicode_compatibility_characters

提交回复
热议问题