Remove accents from a dataframe column in R

旧城冷巷雨未停 提交于 2019-12-06 23:17:08

问题


I got a data.table base. I got a term column in this data.table

class(base$term)
[1] character
length(base$term)
[1] 27486

I'm able to remove accents from a string. I'm able to remove accents from a vector of string.

iconv("Millésime",to="ASCII//TRANSLIT")
[1] "Millesime"
iconv(c("Millésime","boulangère"),to="ASCII//TRANSLIT")
[1] "Millesime" "boulangere"

But for some reason, it does not work when I apply the very same function on my term column

base$terme[2]
[1] "Millésime"
iconv(base$terme[2],to="ASCII//TRANSLIT")
[1] "MillACsime"

Does anybody know what is going on here?


回答1:


Ok the way to solve the problem :

Encoding(base$terme[2])
[1] "UTF-8"
iconv(base$terme[2],from="UTF-8",to="ASCII//TRANSLIT")
[1] "Millesime"

Thanks to @nicola




回答2:


It might be easier to use the stringi package. This way, you don't need to check the encoding beforehand. Furthermore stringi is consistent across operating systems and inconv is not.

library(stringi)

base <- data.table(terme = c("Millésime", 
                             "boulangère", 
                             "üéâäàåçêëèïîì"))

base[, terme := stri_trans_general(str = terme, 
                                   id = "Latin-ASCII")]

> base
           terme
1:     Millesime
2:    boulangere
3: ueaaaaceeeiii



回答3:


You can apply this function

    rm_accent <- function(str,pattern="all") {
   if(!is.character(str))
    str <- as.character(str)

  pattern <- unique(pattern)

  if(any(pattern=="Ç"))
    pattern[pattern=="Ç"] <- "ç"

  symbols <- c(
    acute = "áéíóúÁÉÍÓÚýÝ",
    grave = "àèìòùÀÈÌÒÙ",
    circunflex = "âêîôûÂÊÎÔÛ",
    tilde = "ãõÃÕñÑ",
    umlaut = "äëïöüÄËÏÖÜÿ",
    cedil = "çÇ"
  )

  nudeSymbols <- c(
    acute = "aeiouAEIOUyY",
    grave = "aeiouAEIOU",
    circunflex = "aeiouAEIOU",
    tilde = "aoAOnN",
    umlaut = "aeiouAEIOUy",
    cedil = "cC"
  )

  accentTypes <- c("´","`","^","~","¨","ç")

  if(any(c("all","al","a","todos","t","to","tod","todo")%in%pattern)) # opcao retirar todos
    return(chartr(paste(symbols, collapse=""), paste(nudeSymbols, collapse=""), str))

  for(i in which(accentTypes%in%pattern))
    str <- chartr(symbols[i],nudeSymbols[i], str) 

  return(str)
}


来源:https://stackoverflow.com/questions/39148759/remove-accents-from-a-dataframe-column-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!