snowball

stemDocment in tm package not working on past tense word

て烟熏妆下的殇ゞ 提交于 2019-11-29 16:37:37
I have a file 'check_text.txt' that contains " said say says make made ". I'd like to perform stemming on it to get "say say say make make". I tried to use stemDocument in tm package, as the following, but only get "said say say make made". Is there a way to perform stemming on past tense words? Is it necessary to do so in real-world natural language processing? Thanks! filename = 'check_text.txt' con <- file(filename, "rb") text_data <- readLines(con,skipNul = TRUE) close(con) text_VS <- VectorSource(text_data) text_corpus <- VCorpus(text_VS) text_corpus <- tm_map(text_corpus, stemDocument,

Is there a java implementation of Porter2 stemmer

痞子三分冷 提交于 2019-11-27 21:31:30
Do you know any java implementation of the Porter2 stemmer(or any better stemmer written in java)? I know that there is a java version of Porter(not Porter2) here : http://tartarus.org/~martin/PorterStemmer/java.txt but on http://tartarus.org/~martin/PorterStemmer/ the author mentions that the Porter is bit outdated and recommends to use Porter2, available at http://snowball.tartarus.org/algorithms/english/stemmer.html However, the problem with me is that this Porter2 is written in snowball(I never heard of it before, so don't know anything about it). What I am exactly looking for is a java

Stemming algorithm that produces real words

纵然是瞬间 提交于 2019-11-27 16:53:52
I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way): http://tartarus.org/~martin/PorterStemmer/php.txt This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun". I've tried "Snowball" (suggested within another Stack Overflow thread). http://snowball.tartarus.org/demo.php For my example

Stemming algorithm that produces real words

醉酒当歌 提交于 2019-11-27 04:10:26
问题 I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way): http://tartarus.org/~martin/PorterStemmer/php.txt This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun". I've tried "Snowball"