Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?

后端 未结 1 1487
长发绾君心
长发绾君心 2020-12-14 10:48

In Unicode, letters with accents can be represented in two ways: the accentuated letter itself, and the combination of the bare letter plus the accent. For example, é (+U00E

相关标签:
1条回答
  • 2020-12-14 11:19

    Ok, it appears that a package has been developed to enhance and simplify the string manipulation toolbox in R (finally!). It is called stringi and looks very promising. Its documentation is very well written, and in particular I find the pages about encodings and locales much more enlightening than some of the standard R documentation on the subject.

    It has Unicode normalization functions, as I was looking for (here form C):

    > stri_trans_nfc('\u00e9') == stri_trans_nfc('\u0065\u0301')
    [1] TRUE
    

    It also contains a smart comparison function which integrates these normalization questions and lessens the pain of having to think about them:

    > stri_compare('\u00e9', '\u0065\u0301')
    [1] 0
    # i.e. equal ;
    # otherwise it returns 1 or -1, i.e. greater or lesser, in the alphabetic order.
    

    Thanks to the developers, Marek Gągolewski and Bartek Tartanus, and to Kurt Hornik for the info!

    0 讨论(0)
提交回复
热议问题