How is the correct use of stemDocument?

微笑、不失礼 提交于 2019-12-11 11:58:45

问题


I have already read this and this questions, but I still didn't understand the use of stemDocument in tm_map. Let's follow this example:

q17 <- VCorpus(VectorSource(x = c("poder", "pode")),
               readerControl = list(language = "pt",
                                    load = TRUE))
lapply(q17, content)
$`character(0)`
[1] "poder"

$`character(0)`
[1] "pode"

If I use:

> stemDocument("poder", language = "portuguese")
[1] "pod"
> stemDocument("pode", language = "portuguese")
[1] "pod"

it does work! But if I use:

> q17 <- tm_map(q17, FUN = stemDocument, language = "portuguese")
> lapply(q17, content)
$`character(0)`
[1] "poder"

$`character(0)`
[1] "pode"

it doesn't work. Why so?


回答1:


Unfortunately you stumbled on a bug. stemDocument works if you pass on the language when you do:

stemDocument(x = c("poder", "pode"), language = "pt")
[1] "pod" "pod"

But when using this in tm_map, the function starts of with stemDocument.PlainTextDocument. In this function the language of the corpus is checked against the language you supply in the function. This works correctly. But at the end of this function everything is passed on to the function stemDocument.character, but without the language component. In stemDocument.character the default language is specified as English. So within the tm_map call (or the DocumentTermMatrix) the language you supply with it will revert back to English and the stemming doesn't work correctly.

A workaround could be using the package quanteda:

library(quanteda)
my_dfm <- dfm(x = c("poder", "pode"))
my_dfm <- dfm_wordstem(my_dfm, language = "pt")

my_dfm

Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
2 x 1 sparse Matrix of class "dfm"
       features
docs    pod
  text1   1
  text2   1

Since you are working with Portuguese, I suggest using the packages quanteda, udpipe, or both. Both packages handle non-English languages a lot better than tm.



来源:https://stackoverflow.com/questions/54197636/how-is-the-correct-use-of-stemdocument

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!