replace words in R data.frames (Text Mining)

问题

I'm working on a Text Mining Solution with SQL and R.

First I Import Data into R from my SQL selection and than I do data mining stuff with it.

Here is what I got:

rawData = sqlQuery(dwhConnect,sqlString) 
a = data.frame(rawData$ENNOTE_NEU)

If I do a

a[[1]][1:3]

you see the structure:

[1] lorem ipsum li ld ee wö wo di dd
[2] la kdin di da dogs chicken
[3] kd good i need some help

Now I want to do some data cleaning with my own dictionary. An Example would be to replace li with lorem ipsum and kd as well as kdin with kunde

My Problem is how to do that for the whole Data Frame.

 for(i in 1:(nrow(a)))
    {
        a[[1]][i]=gsub( " kd | kdin " , " kunde " ,a[[1]][i])
        a[[1]][i]=gsub( " li " , " lorem ipsum " ,a[[1]][i])
...
    }

works but is slow for a lot of data.

Is there a better way to do that?

cheers The Captain

回答1:

gsub is vectorised, so you don't need the loop.

a[[1]] <- gsub( " kd | kdin " , " kunde " , a[[1]])

is quicker.

Also, are you sure you want spaces inside your regexes? That way you won't match words at the start or end of lines.

回答2:

Alternative approach: avoid regexes altogether. This works best when you have a lot of different words to search, because you'll avoid the text manipulation except for the first time.

a1 <- c("lorem ipsum li ld ee wö wo di dd","la kdin di da dogs chicken","kd good i need some help")
x <- strsplit(a1, " ",fixed=TRUE) # fixed option avoids regexes which will  be slower

replfxn <- function(vec,word.in,word.out) {
  vec[vec %in% word.in] <- word.out
  vec
}

word.in <- "kdin"
word.out <- "kunde"

replfxn(x[[2]],word.in,word.out)

lapply(x,replfxn,word.in=word.in,word.out=word.out)
[[1]]
[1] "lorem" "ipsum" "li"    "ld"    "ee"    "wö"    "wo"    "di"    "dd"   

[[2]]
[1] "la"      "kunde"   "di"      "da"      "dogs"    "chicken"

[[3]]
[1] "kd"   "good" "i"    "need" "some" "help"

For a large number of words to search over, I'd guess this is faster than regexes. It's also more amenable to data-code separation, since it lends itself to writing a merge or similar function to read in the dictionary from a file rather than embedding it in code.

If you really need it back in the original format (as a space-separated character vector), you can apply a paste to the result.

And here are timing results. I stand corrected: looks like gsub is faster!

library(microbenchmark)
microbenchmark(
  gsub( word.in , word.out , a1) ,
  lapply(x,replfxn,word.in=word.in,word.out=word.out) ,
  times = 1000
  )

                                                        expr    min     lq
1                                gsub(word.in, word.out, a1)  42772  44484
2 lapply(x, replfxn, word.in = word.in, word.out = word.out) 102653 106075
  median       uq    max
1  47905  48761.0 691193
2 109496 111635.5 970065

来源：https://stackoverflow.com/questions/6845443/replace-words-in-r-data-frames-text-mining

标签

replace

dataframe

gsub