What does .NET's String.Normalize do?

后端 未结 4 420
被撕碎了的回忆
被撕碎了的回忆 2020-12-01 07:01

The MSDN article on String.Normalize states simply:

Returns a new string whose binary representation is in a particular Unicode normalization form.

4条回答
  •  孤城傲影
    2020-12-01 07:51

    One difference between form C and form D is how letters with accents are represented: form C uses a single letter-with-accent codepoint, while form D separates that into a letter and an accent.

    For instance, an "à" can be codepoint 224 ("Latin small letter A with grave"), or codepoint 97 ("Latin small letter A") followed by codepoint 786 ("Combining grave accent"). A char-by-char comparison would see these as different. Normalisation lets the comparison succeed.

    A side-effect is that this makes it possible to easily create a "remove accents" method.

    public static string RemoveAccents(string input)
    {
        return new string(input
            .Normalize(System.Text.NormalizationForm.FormD)
            .ToCharArray()
            .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            .ToArray());
        // the normalization to FormD splits accented letters in letters+accents
        // the rest removes those accents (and other non-spacing characters)
        // and creates a new string from the remaining chars
    }
    

提交回复
热议问题