stemming

stemDocment in tm package not working on past tense word

て烟熏妆下的殇ゞ 提交于 2019-11-29 16:37:37
I have a file 'check_text.txt' that contains " said say says make made ". I'd like to perform stemming on it to get "say say say make make". I tried to use stemDocument in tm package, as the following, but only get "said say say make made". Is there a way to perform stemming on past tense words? Is it necessary to do so in real-world natural language processing? Thanks! filename = 'check_text.txt' con <- file(filename, "rb") text_data <- readLines(con,skipNul = TRUE) close(con) text_VS <- VectorSource(text_data) text_corpus <- VCorpus(text_VS) text_corpus <- tm_map(text_corpus, stemDocument,

Lucene Hebrew analyzer

梦想与她 提交于 2019-11-29 16:02:10
问题 Does anybody know whether one exists? I've been googling this for monthes... Thanks 回答1: Update HebMorph Out of curiosity sparked by your question, I contacted Itamar Syn-Hershko who was active on Lucene mailing lists about a year ago when he was working on a Hebrew analyzer for Lucene. I asked him if he completed his analyzer. Here are some relevant bits from his response: To make a long story short, no I didn't. There is no decent free / open-source Hebrew analyzer for Lucene, that I can

Import WordNet In NLTK

依然范特西╮ 提交于 2019-11-29 13:09:36
问题 I want to import wordnet dictionary but when i import Dictionary form wordnet i see this error : for l in open(WNSEARCHDIR+'/lexnames').readlines(): IOError: [Errno 2] No such file or directory: 'C:\\Program Files\\WordNet\\2.0\\dict/lexnames' I install wordnet2.1 in this directory but i cant import please help me to solve this problem import nltk from nltk import * from nltk.corpus import wordnet from wordnet import Dictionary print '-----------------------------------------' print

Effects of Stemming on the term frequency?

若如初见. 提交于 2019-11-29 08:54:27
问题 How are the term frequencies (TF), and inverse document frequency (IDF), affected by stop-word removal and stemming? Thanks! 回答1: tf is term frequency idf is inverse document frequency which is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. stemming effect is grouping all words which are derived from the same stem (ex: played, play,..), this grouping will increase the occurrence of this stem

MySQL fulltext with stems

随声附和 提交于 2019-11-29 02:36:13
I am building a little search function for my site. I am taking my user's query, stemming the keywords and then running a fulltext MySQL search against the stemmed keywords. The problem is that MySQL is treating the stems as literal. Here is the process that is happening: user searches for a word like "baseballs" my stemming algorithm (Porter Stemmer) turns "baseballs" into "basebal" fulltext does not find anything matching "basebal", even though there SHOULD be matches for "baseball" and "baseballs" How do I do the equivalent of LIKE 'basebal%' with fulltext? EDIT: Here is my current query:

Can you programmatically detect pluralizations of English words, and derive the singular form?

我是研究僧i 提交于 2019-11-29 00:21:34
Given some (English) word that we shall assume is a plural , is it possible to derive the singular form? I'd like to avoid lookup/dictionary tables if possible. Some examples: Examples -> Example a simple 's' suffix Glitch -> Glitches 'es' suffix, as opposed to above Countries -> Country 'ies' suffix. Sheep -> Sheep no change: possible fallback for indeterminate values Or, this seems to be a fairly exhaustive list. Suggestions of libraries in language x are fine, as long as they are open-source (ie, so that someone can examine them to determine how to do it in language y ) It really depends on

Tokenizer, Stop Word Removal, Stemming in Java

时间秒杀一切 提交于 2019-11-28 17:05:20
I am looking for a class or method that takes a long string of many 100s of words and tokenizes, removes the stop words and stems for use in an IR system. For example: "The big fat cat, said 'your funniest guy i know' to the kangaroo..." the tokenizer would remove the punctuation and return an ArrayList of words the stop word remover would remove words like "the", "to", etc the stemmer would reduce each word the their 'root', for example 'funniest' would become funny Many thanks in advance. jitter AFAIK Lucene can do what you want. With StandardAnalyzer and StopAnalyzer you can to the stop

Stemming some plurals with wordnet lemmatizer doesn't work

萝らか妹 提交于 2019-11-28 10:29:42
问题 Hi i've a problem with nltk (2.0.4): I'm trying to stemming the word 'men' or 'teeth' but it doesn't seem to work. Here's my code: ############################################################################ import nltk from nltk.corpus import wordnet as wn from nltk.stem.wordnet import WordNetLemmatizer lmtzr=WordNetLemmatizer() words_raw = "men teeth" words = nltk.word_tokenize(words_raw) for word in words: print 'WordNet Lemmatizer NOUN: ' + lmtzr.lemmatize(word, wn.NOUN) #################

Stemming with R Text Analysis

大兔子大兔子 提交于 2019-11-28 05:32:42
I am doing a lot of analysis with the TM package. One of my biggest problems are related to stemming and stemming-like transformations. Let's say I have several accounting related terms (I am aware of the spelling issues). After stemming we have: accounts -> account account -> account accounting -> account acounting -> acount acount -> acount acounts -> acount accounnt -> accounnt Result : 3 Terms (account, acount, account) where I would have liked 1 (account) as all these relate to the same term. 1) To correct spelling is a possibility, but I have never attempted that in R. Is that even

Java library for keywords extraction from input text

若如初见. 提交于 2019-11-27 18:00:56
I'm looking for a Java library to extract keywords from a block of text. The process should be as follows: stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more times in the text than in the English language in terms of probability than it's a keyword candidate. Is there a library that performs this task? sp00m Here is a possible solution using Apache Lucene . I didn't use the last version but the 3.6.2 one , since this is the one I know the best. Besides the /lucene-core-x.x.x.jar , don't forget to add