Text-mining with the tm-package - word stemming

梦想与她 提交于 2019-11-27 12:30:42

I'm not 100% what you're after and don't totally get how tm_map works. If I understand then the following works. As I understand you want to supply a list of words that should not be stemmed. I'm using the qdap package mostly because I'm lazy and it has a function mgsub I like.

Note that I got frustrated with using mgsub and tm_map as it kept throwing an error so I just used lapply instead.

texts <- c("i am member of the XYZ association",
    "apply for our open associate position", 
    "xyz memorial lecture takes place on wednesday", 
    "vote for the most popular lecturer")

library(tm)
# Step 1: Create corpus
corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts)))

library(qdap)
# Step 2: list to retain and indentifier keys
retain <- c("lecturer", "lecture")
replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_")

# Step 3: sub the words you want to retain with identifier keys
corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace)

# Step 4: Stem it
corpus.temp <- tm_map(corpus, stemDocument, language = "english")  

# Step 5: reverse -> sub the identifier keys with the words you want to retain
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)

inspect(corpus)       #inspect the pieces for the folks playing along at home
inspect(corpus.copy)
inspect(corpus.temp)

# Step 6: complete the stem
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)  
inspect(corpus.final)

Basically it works by:

  1. subbing out a unique identifier key for the supplied "NO STEM" words (the mgsub)
  2. then you stem (using stemDocument)
  3. next you reverse it and sub the identifier keys with the "NO STEM" words (the mgsub)
  4. last complete the Stem (stemCompletion)

Here's the output:

## >     inspect(corpus.final)
## A corpus with 4 text documents
## 
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
##   create_date creator 
## Available variables in the data frame are:
##   MetaID 
## 
## $`1`
## i am member of the XYZ associate
## 
## $`2`
##  for our open associate position
## 
## $`3`
## xyz memorial lecture takes place on wednesday
## 
## $`4`
## vote for the most popular lecturer

You can also use the following package for steeming words: https://cran.r-project.org/web/packages/SnowballC/SnowballC.pdf.

You just need to use the function wordStem, passing the vector of words to be stemmed and also the language you are dealing with. To know the exactly language string you need to use, you can refer to the method getStemLanguages, which will return all possible options for it.

Kind Regards

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!