snowball | 易学教程

Snowball Stemming: defining Regions

阅读更多关于 Snowball Stemming: defining Regions

问题 I'm trying to understand the snoball stemming algorithmus. The algorithmus is using two regions R1 and R2 that are definied as follows: R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel. R2 is the region after the first non-vowel following a vowel in R1, or is the null region at the end of the word if there is no such non-vowel. http://snowball.tartarus.org/texts/r1r2.html Examples are b e a u t i f u l |<-

stemDocment in tm package not working on past tense word

阅读更多关于 stemDocment in tm package not working on past tense word

问题 I have a file 'check_text.txt' that contains " said say says make made ". I'd like to perform stemming on it to get "say say say make make". I tried to use stemDocument in tm package, as the following, but only get "said say say make made". Is there a way to perform stemming on past tense words? Is it necessary to do so in real-world natural language processing? Thanks! filename = 'check_text.txt' con <- file(filename, "rb") text_data <- readLines(con,skipNul = TRUE) close(con) text_VS <-

Is there a java implementation of Porter2 stemmer

阅读更多关于 Is there a java implementation of Porter2 stemmer

问题 Do you know any java implementation of the Porter2 stemmer(or any better stemmer written in java)? I know that there is a java version of Porter(not Porter2) here : http://tartarus.org/~martin/PorterStemmer/java.txt but on http://tartarus.org/~martin/PorterStemmer/ the author mentions that the Porter is bit outdated and recommends to use Porter2, available at http://snowball.tartarus.org/algorithms/english/stemmer.html However, the problem with me is that this Porter2 is written in snowball(I

How is the correct use of stemDocument?

阅读更多关于 How is the correct use of stemDocument?

问题 I have already read this and this questions, but I still didn't understand the use of stemDocument in tm_map . Let's follow this example: q17 <- VCorpus(VectorSource(x = c("poder", "pode")), readerControl = list(language = "pt", load = TRUE)) lapply(q17, content) $`character(0)` [1] "poder" $`character(0)` [1] "pode" If I use: > stemDocument("poder", language = "portuguese") [1] "pod" > stemDocument("pode", language = "portuguese") [1] "pod" it does work! But if I use: > q17 <- tm_map(q17,

Making a wordcloud, but with combined words?

阅读更多关于 Making a wordcloud, but with combined words?

问题 I am trying to make a word cloud of publications keywords. for example: Educational data mining; collaborative learning; computer science...etc My current code is as the following: KeywordsCorpus <- Corpus(VectorSource(subset(Words$Author.Keywords, Words$Year==2012))) KeywordsCorpus <- tm_map(KeywordsCorpus, removePunctuation) KeywordsCorpus <- tm_map(KeywordsCorpus, removeNumbers) # added tolower KeywordsCorpus <- tm_map(KeywordsCorpus, tolower) KeywordsCorpus <- tm_map(KeywordsCorpus,

SnowballStemmer for Russian words list

阅读更多关于 SnowballStemmer for Russian words list

问题 I do know how to perform SnowballStemmer on a single word (in my case, on russian one). Doing the next things: from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer("russian") stemmer.stem("Василий") 'Васил' How can I do the following if I have a list of words like ['Василий', 'Геннадий', 'Виталий']? My approach using for loop seems to be not working :( l=[stemmer.stem(word) for word in l] 回答1: Your variable l is not pre-defined, causing the name error. See my last two

Making a wordcloud, but with combined words?

阅读更多关于 Making a wordcloud, but with combined words?

I am trying to make a word cloud of publications keywords. for example: Educational data mining; collaborative learning; computer science...etc My current code is as the following: KeywordsCorpus <- Corpus(VectorSource(subset(Words$Author.Keywords, Words$Year==2012))) KeywordsCorpus <- tm_map(KeywordsCorpus, removePunctuation) KeywordsCorpus <- tm_map(KeywordsCorpus, removeNumbers) # added tolower KeywordsCorpus <- tm_map(KeywordsCorpus, tolower) KeywordsCorpus <- tm_map(KeywordsCorpus, removeWords, stopwords("english")) # moved stripWhitespace KeywordsCorpus <- tm_map(KeywordsCorpus,

SnowballStemmer for Russian words list

阅读更多关于 SnowballStemmer for Russian words list

I do know how to perform SnowballStemmer on a single word (in my case, on russian one). Doing the next things: from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer("russian") stemmer.stem("Василий") 'Васил' How can I do the following if I have a list of words like ['Василий', 'Геннадий', 'Виталий']? My approach using for loop seems to be not working :( l=[stemmer.stem(word) for word in l] Your variable l is not pre-defined, causing the name error. See my last two lines for fix. >>> from nltk.stem.snowball import SnowballStemmer >>> stemmer = SnowballStemmer("russian") >>>

Lucene Standard Analyzer vs Snowball

阅读更多关于 Lucene Standard Analyzer vs Snowball

问题 Just getting started with Lucene.Net. I indexed 100,000 rows using standard analyzer, ran some test queries, and noticed plural queries don't return results if the original term was singular. I understand snowball analyzer adds stemming support, which sounds nice. However, I'm wondering if there are any drawbacks to gong with snowball over standard? Am I losing anything by going with it? Are there any other analyzers out there to consider? 回答1: Yes, by using a stemmer such as Snowball, you

Lucene Standard Analyzer vs Snowball

阅读更多关于 Lucene Standard Analyzer vs Snowball

Just getting started with Lucene.Net. I indexed 100,000 rows using standard analyzer, ran some test queries, and noticed plural queries don't return results if the original term was singular. I understand snowball analyzer adds stemming support, which sounds nice. However, I'm wondering if there are any drawbacks to gong with snowball over standard? Am I losing anything by going with it? Are there any other analyzers out there to consider? Yes, by using a stemmer such as Snowball, you are losing information about the original form of your text. Sometimes this will be useful, sometimes not. For