replace words in R data.frames (Text Mining)

拟墨画扇 提交于 2019-12-03 21:40:30

gsub is vectorised, so you don't need the loop.

a[[1]] <- gsub( " kd | kdin " , " kunde " , a[[1]])

is quicker.


Also, are you sure you want spaces inside your regexes? That way you won't match words at the start or end of lines.

Alternative approach: avoid regexes altogether. This works best when you have a lot of different words to search, because you'll avoid the text manipulation except for the first time.

a1 <- c("lorem ipsum li ld ee wö wo di dd","la kdin di da dogs chicken","kd good i need some help")
x <- strsplit(a1, " ",fixed=TRUE) # fixed option avoids regexes which will  be slower

replfxn <- function(vec,word.in,word.out) {
  vec[vec %in% word.in] <- word.out
  vec
}

word.in <- "kdin"
word.out <- "kunde"

replfxn(x[[2]],word.in,word.out)

lapply(x,replfxn,word.in=word.in,word.out=word.out)
[[1]]
[1] "lorem" "ipsum" "li"    "ld"    "ee"    "wö"    "wo"    "di"    "dd"   

[[2]]
[1] "la"      "kunde"   "di"      "da"      "dogs"    "chicken"

[[3]]
[1] "kd"   "good" "i"    "need" "some" "help"

For a large number of words to search over, I'd guess this is faster than regexes. It's also more amenable to data-code separation, since it lends itself to writing a merge or similar function to read in the dictionary from a file rather than embedding it in code.

If you really need it back in the original format (as a space-separated character vector), you can apply a paste to the result.

And here are timing results. I stand corrected: looks like gsub is faster!

library(microbenchmark)
microbenchmark(
  gsub( word.in , word.out , a1) ,
  lapply(x,replfxn,word.in=word.in,word.out=word.out) ,
  times = 1000
  )

                                                        expr    min     lq
1                                gsub(word.in, word.out, a1)  42772  44484
2 lapply(x, replfxn, word.in = word.in, word.out = word.out) 102653 106075
  median       uq    max
1  47905  48761.0 691193
2 109496 111635.5 970065
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!