How to convert UTF8 combined Characters into single UTF8 characters in ruby?

后端 未结 3 1058
野趣味
野趣味 2021-01-01 18:25

Some characters such as the Unicode Character \'LATIN SMALL LETTER C WITH CARON\' can be encoded as 0xC4 0x8D, but can also be represented with the two code poi

3条回答
  •  暖寄归人
    2021-01-01 19:05

    Generally, you use Unicode Normalization to do this.

    Using UnicodeUtils.nfkc using the gem unicode_utils (https://github.com/lang/unicode_utils) should get you the specific behavior you're asking for; unicode normalization form kC will use a compatibility decomposition followed by converting the string to a composed form, if available (basically what you asked for by your example). (You may also get close to what you want with normalization form c, sometimes acronymized NFC).

    How to replace the Unicode gem on Ruby 1.9? has additional details.

    In Ruby 1.8.7, you'd need do gem install Unicode, for which there is a similar function available.

    Edited to add: The main reason why you'll probably want normalization form kC instead of just normalization form C is that ligatures (characters that are squeezed together for historical/typographical reasons) will first be decomposed to the individual characters, which is sometimes desirable if you're doing lexicographic ordering or searching).

提交回复
热议问题