Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?

后端未结

关注

 1  1487

In Unicode, letters with accents can be represented in two ways: the accentuated letter itself, and the combination of the bare letter plus the accent. For example, é (+U00E

相关标签:

1条回答

无人共我

2020-12-14 11:19
Ok, it appears that a package has been developed to enhance and simplify the string manipulation toolbox in R (finally!). It is called stringi and looks very promising. Its documentation is very well written, and in particular I find the pages about encodings and locales much more enlightening than some of the standard R documentation on the subject.

It has Unicode normalization functions, as I was looking for (here form C):
```
> stri_trans_nfc('\u00e9') == stri_trans_nfc('\u0065\u0301')
[1] TRUE
```
It also contains a smart comparison function which integrates these normalization questions and lessens the pain of having to think about them:
```
> stri_compare('\u00e9', '\u0065\u0301')
[1] 0
# i.e. equal ;
# otherwise it returns 1 or -1, i.e. greater or lesser, in the alphabetic order.
```
Thanks to the developers, Marek Gągolewski and Bartek Tartanus, and to Kurt Hornik for the info!
0 讨论(0)
发布评论:

提交评论
- 加载中...