Decapitalize UTF-8 special characters in R

大城市里の小女人 提交于 2021-02-17 04:10:06

问题


After I scraped a list of names, I have the following name in R:

DAPHN\303\211 DE MEULEMEESTER

If I use the function tolower, all the letters are set to lowercase, but not the special characters. What is the best way to achieve this?


回答1:


The reason is that your locale is C. Non-ASCII special characters and their letter-case classifications are not recognized under that locale. You should be able to get it to work by switching to a UTF-8 locale:

Sys.setlocale(locale='C');
## [1] "C/C/C/C/C/en_CA.utf-8"
tolower('DAPHN\303\211 DE MEULEMEESTER');
## [1] "daphn\303\211 de meulemeester"
Sys.setlocale(locale='en_CA.UTF-8');
## [1] "en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.utf-8"
tolower('DAPHN\303\211 DE MEULEMEESTER');
## [1] "daphné de meulemeester"

en_CA.UTF-8 makes sense for me because I'm in Canada, but if you're in the United States (for example) you'll probably want en_US.UTF-8. I think for any country you should be able to replace the CA/US with your two-letter country code to get the most appropriate locale for your location.




回答2:


Without changing your system locale, you can do locale-aware text transformation using the stringi package:

library(stringi)
her_name <- "DAPHN\303\211 DE MEULEMEESTER"
stri_trans_tolower(her_name, locale="en_CA")



回答3:


My problem has been moved here because there is a similar problem. You can also solve the problem by converting the character to a known character.

x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-tolower(x)
x
[1] "sn. İletİşİm bİlgİlerİnİz guncellenmistir."

Let me add it as a picture. Because it may not be the same on every computer.

Actually expected output:

When I suggested @drammock, I saw this.

x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-stri_trans_tolower(x, locale="tr_TR")
x
[1] "sn. iletişim bilgileriniz guncellenmıstır."

Again, I added the output of @drammock 's suggestion as a picture. The yellow areas in the picture are not the expected output.

As a result, I found the UTF code of the character that could not be corrected by "tolower ()" and turned it into a character that was smoothly corrected by "tolower ()". Then I used "tolower ()" again and got the expected output. Thank you to everyone.

x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-gsub("\u0130","I",x,useBytes = FALSE)
x<-tolower(x)
x
[1] "sn. iletişim bilgileriniz guncellenmistir."



来源:https://stackoverflow.com/questions/29692198/decapitalize-utf-8-special-characters-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!