问题
I got a data.table base. I got a term column in this data.table
class(base$term)
[1] character
length(base$term)
[1] 27486
I'm able to remove accents from a string. I'm able to remove accents from a vector of string.
iconv("Millésime",to="ASCII//TRANSLIT")
[1] "Millesime"
iconv(c("Millésime","boulangère"),to="ASCII//TRANSLIT")
[1] "Millesime" "boulangere"
But for some reason, it does not work when I apply the very same function on my term column
base$terme[2]
[1] "Millésime"
iconv(base$terme[2],to="ASCII//TRANSLIT")
[1] "MillACsime"
Does anybody know what is going on here?
回答1:
Ok the way to solve the problem :
Encoding(base$terme[2])
[1] "UTF-8"
iconv(base$terme[2],from="UTF-8",to="ASCII//TRANSLIT")
[1] "Millesime"
Thanks to @nicola
回答2:
It might be easier to use the stringi package. This way, you don't need to check the encoding beforehand. Furthermore stringi is consistent across operating systems and inconv
is not.
library(stringi)
base <- data.table(terme = c("Millésime",
"boulangère",
"üéâäàåçêëèïîì"))
base[, terme := stri_trans_general(str = terme,
id = "Latin-ASCII")]
> base
terme
1: Millesime
2: boulangere
3: ueaaaaceeeiii
回答3:
You can apply this function
rm_accent <- function(str,pattern="all") {
if(!is.character(str))
str <- as.character(str)
pattern <- unique(pattern)
if(any(pattern=="Ç"))
pattern[pattern=="Ç"] <- "ç"
symbols <- c(
acute = "áéíóúÁÉÍÓÚýÝ",
grave = "àèìòùÀÈÌÒÙ",
circunflex = "âêîôûÂÊÎÔÛ",
tilde = "ãõÃÕñÑ",
umlaut = "äëïöüÄËÏÖÜÿ",
cedil = "çÇ"
)
nudeSymbols <- c(
acute = "aeiouAEIOUyY",
grave = "aeiouAEIOU",
circunflex = "aeiouAEIOU",
tilde = "aoAOnN",
umlaut = "aeiouAEIOUy",
cedil = "cC"
)
accentTypes <- c("´","`","^","~","¨","ç")
if(any(c("all","al","a","todos","t","to","tod","todo")%in%pattern)) # opcao retirar todos
return(chartr(paste(symbols, collapse=""), paste(nudeSymbols, collapse=""), str))
for(i in which(accentTypes%in%pattern))
str <- chartr(symbols[i],nudeSymbols[i], str)
return(str)
}
来源:https://stackoverflow.com/questions/39148759/remove-accents-from-a-dataframe-column-in-r