How to change diacritic characters to non-diacritic ones [duplicate]

半世苍凉 提交于 2019-11-27 01:10:01
CesarB

Copying from my own answer to another question:

Instead of creating your own table, you could instead convert the text to normalization form D, where the characters are represented as a base character plus the diacritics (for instance, "á" will be replaced by "a" followed by a combining acute accent). You can then strip everything which is not an ASCII letter.

The tables still exist, but are now the ones from the Unicode standard.

You could also try NFKD instead of NFD, to catch even more cases.

References:

dan

Since no one has ever bothered to post the code to do this, here it is:

    // \p{Mn} or \p{Non_Spacing_Mark}: 
    //   a character intended to be combined with another 
    //   character without taking up extra space 
    //   (e.g. accents, umlauts, etc.). 
    private readonly static Regex nonSpacingMarkRegex = 
        new Regex(@"\p{Mn}", RegexOptions.Compiled);

    public static string RemoveDiacritics(string text)
    {
        if (text == null)
            return string.Empty;

        var normalizedText = 
            text.Normalize(NormalizationForm.FormD);

        return nonSpacingMarkRegex.Replace(normalizedText, string.Empty);
    }

Note: a big reason for needing to do this is when you are integrating to a 3rd party system that only does ascii, but your data is in unicode. This is common. Your options are basically: remove accented characters, or attempt to remove accents from the accented characters to attempt to preserve as much as you can of the original input. Obviously, this is not a perfect solution but it is 80% better than simply removing any character above ascii 127.

It might also be worthwhile to step back and consider why you want to do this. If you are trying to remove character differences you consider insignificant, you should look at the Unicode collation algorithm. This is the standard way to disregard differences such as case or diacritics when comparing strings for searching or sorting.

If you plan to display the modified text, consider your audience. What you can safely filter away is locale sensitive. In US English, "Igloo" = "igloo", and "resume" = "résumé", but in Turkish, a lower case I is ı (dotless), and in French, cote means quote, côté means side, and côte means coast. So, the collation language determines what differences are significant.

If removing diacritics is the right solution for your application, it is safest to produce your own table to which you explicitly add the characters you want to convert.

A general, automated approach could be devised using Unicode decomposition. With this, you can decompose a character with diacritics to "combining" characters (the diacritic marks) and the base character with which they are combined. Filter out any thing that is a combining character, and you should have the "non-diacritic" ones.

The lack of discrimination in the automated method, however, could have some unexpected effects. I'd recommend a lot of testing on a representative body of text.

For a simple example:

To remove diacritics from a string:

string newString = myDiacriticsString.Normalize(NormalizationForm.FormD);
happytrails

My site inputs data from external sources which have many strange characters. I wrote the following C# function to replace accented characters and strip out non-US keyboard characters using Regex:

    using System.Text;
    using System.Text.RegularExpressions;

    internal static string SanitizeString(string source)
    {
        return Regex.Replace(source.Normalize(NormalizationForm.FormD), @"[^A-Za-z 0-9 \.,\?'""!@#\$%\^&\*\(\)-_=\+;:<>\/\\\|\}\{\[\]`~]*", string.Empty).Trim();    
    }

Hope it helps.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!