Why isn't stemDocument stemming?

放肆的年华 提交于 2019-12-08 04:52:59

问题


I am using the 'tm' package in R to create a term document matrix using stemmed terms. The process is completing, but the resulting matrix includes terms that don't appear to have been stemmed, and I'm trying to understand why that is and how to fix it.

Here is the script for the process, which uses a couple of online news stories as the sandbox:

library(boilerpipeR)
library(RCurl)
library(tm)

# Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl'
url <- "http://blogs.wsj.com/digits/2015/07/14/google-mozilla-disable-flash-over-security-concerns/"
extract <- LargestContentExtractor(getURL(url))
url2 <- "http://www.cnet.com/news/startup-lands-100-million-to-challenge-smartphone-superpowers-apple-and-google/"
extract2 <- LargestContentExtractor(getURL(url2))

# Now put those text vectors in a corpus and create a tdm
news.corpus <- VCorpus(VectorSource(c(extract, extract2)))
news.tdm <- TermDocumentMatrix(news.corpus,
  control = list(removePunctuation = TRUE,
                 stopwords = TRUE,
                 stripWhitespace = TRUE,
                 stemDocument = TRUE))

# Now inspect the result
findFreqTerms(news, 4)

Here is the output that last line produces:

[1] "acadine"       "adobe"         "android"       "browser"       "challenge"     "companies"     "company"       "devices"       "firefox"       "flash"        
[11] "funding"       "gong"          "hackers"       "international" "ios"           "like"          "million"       "mobile"        "mozilla"       "mozillas"     
[21] "new"           "online"        "operating"     "said"          "security"      "smartphones"   "software"      "startup"       "system"        "systems"      
[31] "tsinghua"      "unigroup"      "used"          "users"         "videos"        "web"           "will"  

In line 1, for example, we see "companies" and "company", and we see "devices". I thought stemming would reduce "companies" and "company" to the same stem ("compani"?), and I thought it would trim the "s" off plurals like "devices". Am I wrong about that? If not, why isn't this code producing the desired result here?


回答1:


Use stemming = TRUE or stemming = stemDocument instead of stemDocument = TRUE. (?termFreq shows that stemDocument is no valid control parameter.)



来源:https://stackoverflow.com/questions/31438688/why-isnt-stemdocument-stemming

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!