问题
I'm using this method to remove accents from my strings:
static string RemoveAccents(string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
StringBuilder builder = new StringBuilder();
foreach (char c in normalized)
{
if (char.GetUnicodeCategory(c) !=
UnicodeCategory.NonSpacingMark)
{
builder.Append(c);
}
}
return builder.ToString();
}
but this method leaves đ as đ and doesn't change it to d, even though d is its base char. you can try it with this input string "æøåáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ"
What's so special in letter đ?
回答1:
The answer for why it doesn't work is that the statement that "d is its base char" is false. U+0111 (LATIN SMALL LETTER D WITH STROKE) has Unicode category "Letter, Lowercase" and has no decomposition mapping (i.e., it doesn't decompose to "d" followed by a combining mark).
"đ".Normalize(NormalizationForm.FormD)
simply returns "đ"
, which is not stripped out by the loop because it is not a non-spacing mark.
A similar issue will exist for "ø" and other letters for which Unicode provides no decomposition mapping. (And if you're trying to find the "best" ASCII character to represent a Unicode letter, this approach won't work at all for Cyrillic, Greek, Chinese or other non-Latin alphabets; you'll also run into problems if you wanted to transliterate "ß" into "ss", for example. Using a library like UnidecodeSharp may help.)
回答2:
I have to admit that I'm not sure why this works but it sure seems to
var str = "æøåáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str));
=> "aoaaaaalccceeeeiiddnnooooruuuuyt"
回答3:
"D with stroke" (Wikipedia) is used in several languages, and appears to be considered a distinct letter in all of them -- and that is why it remains unchanged.
回答4:
this should work
private static String RemoveDiacritics(string text)
{
String normalized = text.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < normalized.Length; i++)
{
Char c = normalized[i];
if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
sb.Append(c);
}
return sb.ToString();
}
来源:https://stackoverflow.com/questions/2362810/why-doesnt-%c4%90-get-flattened-to-d-when-removing-accents-diacritics