stemming

Why is the number of stem from NLTK Stemmer outputs different from expected output?

限于喜欢 提交于 2020-07-23 06:41:07
问题 I have to perform Stemming on a text. The questions are as follows : Tokenize all the words given in tc . The word should contain alphabets or numbers or underscore. Store the tokenized list of words in tw Convert all the words into lowercase. Store the result into the variable tw Remove all the stop words from the unique set of tw . Store the result into the variable fw Stem each word present in fw with PorterStemmer, and store the result in the list psw Below is my code : import re import

Why is the number of stem from NLTK Stemmer outputs different from expected output?

杀马特。学长 韩版系。学妹 提交于 2020-07-23 06:39:29
问题 I have to perform Stemming on a text. The questions are as follows : Tokenize all the words given in tc . The word should contain alphabets or numbers or underscore. Store the tokenized list of words in tw Convert all the words into lowercase. Store the result into the variable tw Remove all the stop words from the unique set of tw . Store the result into the variable fw Stem each word present in fw with PorterStemmer, and store the result in the list psw Below is my code : import re import

R text mining: grouping similar words using stemDocuments in tm package

半世苍凉 提交于 2020-04-18 06:10:15
问题 I am doing text mining of around 30000 tweets, Now the problem is to make results more reliable i want to convert "synonyms" to similar words for ex. some user use words "girl", some use "girls", some use "gal". similarly "give","gave" means only one thing. same for "come,"came". some user use short-form like "plz","pls" etc. Also, "stemdocument" from tm package is not working properly. it's is converting dance to danc, table to tabl.....is there any other good package for stemming. I want to

English lemmatizer databases?

二次信任 提交于 2020-01-31 18:07:10
问题 Do you know any big enough lemmatizer database that returns correct result for following sample words: geese: goose plantes: //not found Wordnet's morphological analyzer is not sufficient, since it gives the following incorrect results: geese: //not found plantes: plant 回答1: MorphAdorner seems to be better at this, but it still finds the incorrect result for "plantes" plantes: plante geese: goose Maybe you'd like to use MorphAdorner to do the lemmatization, and then check its results against

stemDocument R text mining

牧云@^-^@ 提交于 2020-01-15 05:44:07
问题 My data is a txt file and looks as follows: words number_doc overwiew 1 client 1 store 1 marge 1 price 2 stock 2 economics 2 The numbers of the documents are sorted (from the smallest to the largest). Now I want for each document all the words that belongs to the document. Now they stand in a column, but I want al the words in a textDocument (from the package tm, because it is neccesary for some functions in that package). I did this as follows: data <- read.table("poging.txt", header = TRUE)

How to configure stemming in Solr?

不问归期 提交于 2020-01-10 19:35:32
问题 I add to solr index: "American". When I search by "America" there is no results. How should schema.xml be configured to get results? current configuration: <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />

Getting the closest noun from a stemmed word

限于喜欢 提交于 2020-01-05 10:09:46
问题 Short version: If I have a stemmed word: Say 'comput' for 'computing', or 'sugari' for 'sugary' Is there a way to construct it's closest noun form? That is 'computer', or 'sugar' respectively Longer version: I'm using python and NLTK, Wordnet to perform a few semantic similarity tasks on a bunch of words. I noticed that most sem-sim scores work well only for nouns, while adjectives and verbs don't give any results. Understanding the inaccuracies involved, I wanted to convert a word from its

Getting the closest noun from a stemmed word

牧云@^-^@ 提交于 2020-01-05 10:09:10
问题 Short version: If I have a stemmed word: Say 'comput' for 'computing', or 'sugari' for 'sugary' Is there a way to construct it's closest noun form? That is 'computer', or 'sugar' respectively Longer version: I'm using python and NLTK, Wordnet to perform a few semantic similarity tasks on a bunch of words. I noticed that most sem-sim scores work well only for nouns, while adjectives and verbs don't give any results. Understanding the inaccuracies involved, I wanted to convert a word from its

Word Base/Stem Dictionary

﹥>﹥吖頭↗ 提交于 2020-01-04 02:35:07
问题 It seems my Google-fu is failing me. Does anyone know of a freely available word base dictionary that just contains bases of words? So, for something like strawberries, it would have strawberry. But does NOT contain abbreviations or misspellings or alternate spellings (like UK versus US)? Anything quickly usable in Java would be good but just a text file of mappings or anything that could be read in would be helpful. 回答1: This is called lemmatization, and what you call the "base of a word" is

Snowball Stemming: defining Regions

眉间皱痕 提交于 2020-01-03 21:09:32
问题 I'm trying to understand the snoball stemming algorithmus. The algorithmus is using two regions R1 and R2 that are definied as follows: R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel. R2 is the region after the first non-vowel following a vowel in R1, or is the null region at the end of the word if there is no such non-vowel. http://snowball.tartarus.org/texts/r1r2.html Examples are b e a u t i f u l |<-