Why doesn't Đ get flattened to D when Removing Accents/Diacritics

后端 未结 5 1913
你的背包
你的背包 2021-01-04 01:37

I\'m using this method to remove accents from my strings:

static string RemoveAccents(string input)
{
    string normalized = input.Normalize(NormalizationFo         


        
5条回答
  •  佛祖请我去吃肉
    2021-01-04 02:14

    The answer for why it doesn't work is that the statement that "d is its base char" is false. U+0111 (LATIN SMALL LETTER D WITH STROKE) has Unicode category "Letter, Lowercase" and has no decomposition mapping (i.e., it doesn't decompose to "d" followed by a combining mark).

    "đ".Normalize(NormalizationForm.FormD) simply returns "đ", which is not stripped out by the loop because it is not a non-spacing mark.

    A similar issue will exist for "ø" and other letters for which Unicode provides no decomposition mapping. (And if you're trying to find the "best" ASCII character to represent a Unicode letter, this approach won't work at all for Cyrillic, Greek, Chinese or other non-Latin alphabets; you'll also run into problems if you wanted to transliterate "ß" into "ss", for example. Using a library like UnidecodeSharp may help.)

提交回复
热议问题