Why doesn't Đ get flattened to D when Removing Accents/Diacritics

半世苍凉 提交于 2019-11-30 17:36:07

The answer for why it doesn't work is that the statement that "d is its base char" is false. U+0111 (LATIN SMALL LETTER D WITH STROKE) has Unicode category "Letter, Lowercase" and has no decomposition mapping (i.e., it doesn't decompose to "d" followed by a combining mark).

"đ".Normalize(NormalizationForm.FormD) simply returns "đ", which is not stripped out by the loop because it is not a non-spacing mark.

A similar issue will exist for "ø" and other letters for which Unicode provides no decomposition mapping. (And if you're trying to find the "best" ASCII character to represent a Unicode letter, this approach won't work at all for Cyrillic, Greek, Chinese or other non-Latin alphabets; you'll also run into problems if you wanted to transliterate "ß" into "ss", for example. Using a library like UnidecodeSharp may help.)

I have to admit that I'm not sure why this works but it sure seems to

var str = "æøåáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str)); 

=> "aoaaaaalccceeeeiiddnnooooruuuuyt"

"D with stroke" (Wikipedia) is used in several languages, and appears to be considered a distinct letter in all of them -- and that is why it remains unchanged.

this should work

    private static String RemoveDiacritics(string text)
    {
        String normalized = text.Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for (int i = 0; i < normalized.Length; i++)
        {
            Char c = normalized[i];
            if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
                sb.Append(c);
        }

        return sb.ToString();
    }
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!