发表新帖

发表新帖

Stemming with R Text Analysis

前端未结

关注

 3  2011

花落未央 2020-12-08 08:34

I am doing a lot of analysis with the TM package. One of my biggest problems are related to stemming and stemming-like transformations.

Let\'s say I hav

3条回答

佛祖请我去吃肉 (楼主)

2020-12-08 09:05
We could set up a list of synonyms and replace those values. For example
```
synonyms <- list(
    list(word="account", syns=c("acount", "accounnt"))
)
```
This says we want to replace "acount" and "accounnt" with "account" (i'm assuming we're doing this after stemming). Now let's create test data.
```
raw<-c("accounts", "account", "accounting", "acounting", 
     "acount", "acounts", "accounnt")
```
And now let's define a transformation function that will replace the words in our list with the primary synonym.
```
library(tm)
replaceSynonyms <- content_transformer(function(x, syn=NULL) { 
    Reduce(function(a,b) {
        gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word, a)}, syn, x)   
})
```
Here we use the content_transformer function to define a custom transformation. And basically we just do a gsub to replace each of the words. We can then use this on a corpus
```
tm <- Corpus(VectorSource(raw))
tm <- tm_map(tm, stemDocument)
tm <- tm_map(tm, replaceSynonyms, synonyms)
inspect(tm)
```
and we can see all these values are transformed into "account" as desired. To add other synonyms, just add additional lists to the main synonyms list. Each sub-list should have the names "word" and "syns".
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题