Stemming with R Text Analysis

前端 未结 3 2011
花落未央
花落未央 2020-12-08 08:34

I am doing a lot of analysis with the TM package. One of my biggest problems are related to stemming and stemming-like transformations.

Let\'s say I hav

3条回答
  •  佛祖请我去吃肉
    2020-12-08 09:05

    We could set up a list of synonyms and replace those values. For example

    synonyms <- list(
        list(word="account", syns=c("acount", "accounnt"))
    )
    

    This says we want to replace "acount" and "accounnt" with "account" (i'm assuming we're doing this after stemming). Now let's create test data.

    raw<-c("accounts", "account", "accounting", "acounting", 
         "acount", "acounts", "accounnt")
    

    And now let's define a transformation function that will replace the words in our list with the primary synonym.

    library(tm)
    replaceSynonyms <- content_transformer(function(x, syn=NULL) { 
        Reduce(function(a,b) {
            gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word, a)}, syn, x)   
    })
    

    Here we use the content_transformer function to define a custom transformation. And basically we just do a gsub to replace each of the words. We can then use this on a corpus

    tm <- Corpus(VectorSource(raw))
    tm <- tm_map(tm, stemDocument)
    tm <- tm_map(tm, replaceSynonyms, synonyms)
    inspect(tm)
    

    and we can see all these values are transformed into "account" as desired. To add other synonyms, just add additional lists to the main synonyms list. Each sub-list should have the names "word" and "syns".

提交回复
热议问题