问题
After I scraped a list of names, I have the following name in R:
DAPHN\303\211 DE MEULEMEESTER
If I use the function tolower, all the letters are set to lowercase, but not the special characters. What is the best way to achieve this?
回答1:
The reason is that your locale is C. Non-ASCII special characters and their letter-case classifications are not recognized under that locale. You should be able to get it to work by switching to a UTF-8 locale:
Sys.setlocale(locale='C');
## [1] "C/C/C/C/C/en_CA.utf-8"
tolower('DAPHN\303\211 DE MEULEMEESTER');
## [1] "daphn\303\211 de meulemeester"
Sys.setlocale(locale='en_CA.UTF-8');
## [1] "en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.utf-8"
tolower('DAPHN\303\211 DE MEULEMEESTER');
## [1] "daphné de meulemeester"
en_CA.UTF-8 makes sense for me because I'm in Canada, but if you're in the United States (for example) you'll probably want en_US.UTF-8. I think for any country you should be able to replace the CA/US with your two-letter country code to get the most appropriate locale for your location.
回答2:
Without changing your system locale, you can do locale-aware text transformation using the stringi package:
library(stringi)
her_name <- "DAPHN\303\211 DE MEULEMEESTER"
stri_trans_tolower(her_name, locale="en_CA")
回答3:
My problem has been moved here because there is a similar problem. You can also solve the problem by converting the character to a known character.
x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-tolower(x)
x
[1] "sn. İletİşİm bİlgİlerİnİz guncellenmistir."
Let me add it as a picture. Because it may not be the same on every computer.
Actually expected output:
When I suggested @drammock, I saw this.
x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-stri_trans_tolower(x, locale="tr_TR")
x
[1] "sn. iletişim bilgileriniz guncellenmıstır."
Again, I added the output of @drammock 's suggestion as a picture. The yellow areas in the picture are not the expected output.
As a result, I found the UTF code of the character that could not be corrected by "tolower ()" and turned it into a character that was smoothly corrected by "tolower ()". Then I used "tolower ()" again and got the expected output. Thank you to everyone.
x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-gsub("\u0130","I",x,useBytes = FALSE)
x<-tolower(x)
x
[1] "sn. iletişim bilgileriniz guncellenmistir."
来源:https://stackoverflow.com/questions/29692198/decapitalize-utf-8-special-characters-in-r