multiple results of one variable when applying tm method “stemCompletion”

回眸只為那壹抹淺笑 提交于 2019-12-01 08:51:20
Ben

There seems to be something odd about the stemCompletion function. It's not obvious how to use stemCompletion in the tm version 0.6. There is a nice workaround here that I've used for this answer.

First, make the CSV file that you have:

dat <- read.csv2( text = 
                  "ID;Text A;Text B
1;Below is the first title;Innovation and Knowledge Management
2;And now the second Title;Organizational Performance and Learning are very important
3;The third title;Knowledge plays an important rule in organizations")

write.csv2(dat, "Test.csv", row.names = FALSE)

Read it in, transform to a corpus, and stem the words:

data = read.csv2("Test.csv")
data[,2]=as.character(data[,2])
data[,3]=as.character(data[,3])

corpus <- Corpus(DataframeSource(data)) 
corpuscopy <- corpus
library(SnowballC)
corpus <- tm_map(corpus, stemDocument)

Have a look to see that it's worked:

inspect(corpus)

<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
1
Below is the first titl
Innovat and Knowledg Manag

[[2]]
<<PlainTextDocument (metadata: 7)>>
2
And now the second Titl
Organiz Perform and Learn are veri import

[[3]]
<<PlainTextDocument (metadata: 7)>>
3
The third titl
Knowledg play an import rule in organ

Here's the nice workaround to get stemCompletion working:

stemCompletion_mod <- function(x,dict=corpuscopy) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

Inspect the output to see if the stems were completed ok:

lapply(corpus, stemCompletion_mod)

[[1]]
<<PlainTextDocument (metadata: 7)>>
1 Below is the first title Innovation and Knowledge Management

[[2]]
<<PlainTextDocument (metadata: 7)>>
2 And now the second Title Organizational Performance and Learning are NA important

[[3]]
<<PlainTextDocument (metadata: 7)>>
3 The third title Knowledge plays an important rule in organizations

Success!

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!