stemming | 易学教程

stemDocment in tm package not working on past tense word

阅读更多关于 stemDocment in tm package not working on past tense word

I have a file 'check_text.txt' that contains " said say says make made ". I'd like to perform stemming on it to get "say say say make make". I tried to use stemDocument in tm package, as the following, but only get "said say say make made". Is there a way to perform stemming on past tense words? Is it necessary to do so in real-world natural language processing? Thanks! filename = 'check_text.txt' con <- file(filename, "rb") text_data <- readLines(con,skipNul = TRUE) close(con) text_VS <- VectorSource(text_data) text_corpus <- VCorpus(text_VS) text_corpus <- tm_map(text_corpus, stemDocument,

Lucene Hebrew analyzer

阅读更多关于 Lucene Hebrew analyzer

问题 Does anybody know whether one exists? I've been googling this for monthes... Thanks 回答1: Update HebMorph Out of curiosity sparked by your question, I contacted Itamar Syn-Hershko who was active on Lucene mailing lists about a year ago when he was working on a Hebrew analyzer for Lucene. I asked him if he completed his analyzer. Here are some relevant bits from his response: To make a long story short, no I didn't. There is no decent free / open-source Hebrew analyzer for Lucene, that I can

Import WordNet In NLTK

阅读更多关于 Import WordNet In NLTK

问题 I want to import wordnet dictionary but when i import Dictionary form wordnet i see this error : for l in open(WNSEARCHDIR+'/lexnames').readlines(): IOError: [Errno 2] No such file or directory: 'C:\\Program Files\\WordNet\\2.0\\dict/lexnames' I install wordnet2.1 in this directory but i cant import please help me to solve this problem import nltk from nltk import * from nltk.corpus import wordnet from wordnet import Dictionary print '-----------------------------------------' print

Effects of Stemming on the term frequency?

阅读更多关于 Effects of Stemming on the term frequency?

问题 How are the term frequencies (TF), and inverse document frequency (IDF), affected by stop-word removal and stemming? Thanks! 回答1: tf is term frequency idf is inverse document frequency which is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. stemming effect is grouping all words which are derived from the same stem (ex: played, play,..), this grouping will increase the occurrence of this stem

MySQL fulltext with stems

阅读更多关于 MySQL fulltext with stems

I am building a little search function for my site. I am taking my user's query, stemming the keywords and then running a fulltext MySQL search against the stemmed keywords. The problem is that MySQL is treating the stems as literal. Here is the process that is happening: user searches for a word like "baseballs" my stemming algorithm (Porter Stemmer) turns "baseballs" into "basebal" fulltext does not find anything matching "basebal", even though there SHOULD be matches for "baseball" and "baseballs" How do I do the equivalent of LIKE 'basebal%' with fulltext? EDIT: Here is my current query:

Can you programmatically detect pluralizations of English words, and derive the singular form?

阅读更多关于 Can you programmatically detect pluralizations of English words, and derive the singular form?

Given some (English) word that we shall assume is a plural , is it possible to derive the singular form? I'd like to avoid lookup/dictionary tables if possible. Some examples: Examples -> Example a simple 's' suffix Glitch -> Glitches 'es' suffix, as opposed to above Countries -> Country 'ies' suffix. Sheep -> Sheep no change: possible fallback for indeterminate values Or, this seems to be a fairly exhaustive list. Suggestions of libraries in language x are fine, as long as they are open-source (ie, so that someone can examine them to determine how to do it in language y ) It really depends on

Tokenizer, Stop Word Removal, Stemming in Java

阅读更多关于 Tokenizer, Stop Word Removal, Stemming in Java

I am looking for a class or method that takes a long string of many 100s of words and tokenizes, removes the stop words and stems for use in an IR system. For example: "The big fat cat, said 'your funniest guy i know' to the kangaroo..." the tokenizer would remove the punctuation and return an ArrayList of words the stop word remover would remove words like "the", "to", etc the stemmer would reduce each word the their 'root', for example 'funniest' would become funny Many thanks in advance. jitter AFAIK Lucene can do what you want. With StandardAnalyzer and StopAnalyzer you can to the stop

Stemming some plurals with wordnet lemmatizer doesn't work

阅读更多关于 Stemming some plurals with wordnet lemmatizer doesn't work

问题 Hi i've a problem with nltk (2.0.4): I'm trying to stemming the word 'men' or 'teeth' but it doesn't seem to work. Here's my code: ############################################################################ import nltk from nltk.corpus import wordnet as wn from nltk.stem.wordnet import WordNetLemmatizer lmtzr=WordNetLemmatizer() words_raw = "men teeth" words = nltk.word_tokenize(words_raw) for word in words: print 'WordNet Lemmatizer NOUN: ' + lmtzr.lemmatize(word, wn.NOUN) #################

Stemming with R Text Analysis

阅读更多关于 Stemming with R Text Analysis

I am doing a lot of analysis with the TM package. One of my biggest problems are related to stemming and stemming-like transformations. Let's say I have several accounting related terms (I am aware of the spelling issues). After stemming we have: accounts -> account account -> account accounting -> account acounting -> acount acount -> acount acounts -> acount accounnt -> accounnt Result : 3 Terms (account, acount, account) where I would have liked 1 (account) as all these relate to the same term. 1) To correct spelling is a possibility, but I have never attempted that in R. Is that even

Java library for keywords extraction from input text

阅读更多关于 Java library for keywords extraction from input text

I'm looking for a Java library to extract keywords from a block of text. The process should be as follows: stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more times in the text than in the English language in terms of probability than it's a keyword candidate. Is there a library that performs this task? sp00m Here is a possible solution using Apache Lucene . I didn't use the last version but the 3.6.2 one , since this is the one I know the best. Besides the /lucene-core-x.x.x.jar , don't forget to add