stemming | 易学教程

Why is the number of stem from NLTK Stemmer outputs different from expected output?

阅读更多关于 Why is the number of stem from NLTK Stemmer outputs different from expected output?

问题 I have to perform Stemming on a text. The questions are as follows : Tokenize all the words given in tc . The word should contain alphabets or numbers or underscore. Store the tokenized list of words in tw Convert all the words into lowercase. Store the result into the variable tw Remove all the stop words from the unique set of tw . Store the result into the variable fw Stem each word present in fw with PorterStemmer, and store the result in the list psw Below is my code : import re import

Why is the number of stem from NLTK Stemmer outputs different from expected output?

阅读更多关于 Why is the number of stem from NLTK Stemmer outputs different from expected output?

R text mining: grouping similar words using stemDocuments in tm package

阅读更多关于 R text mining: grouping similar words using stemDocuments in tm package

问题 I am doing text mining of around 30000 tweets, Now the problem is to make results more reliable i want to convert "synonyms" to similar words for ex. some user use words "girl", some use "girls", some use "gal". similarly "give","gave" means only one thing. same for "come,"came". some user use short-form like "plz","pls" etc. Also, "stemdocument" from tm package is not working properly. it's is converting dance to danc, table to tabl.....is there any other good package for stemming. I want to

English lemmatizer databases?

阅读更多关于 English lemmatizer databases?

问题 Do you know any big enough lemmatizer database that returns correct result for following sample words: geese: goose plantes: //not found Wordnet's morphological analyzer is not sufficient, since it gives the following incorrect results: geese: //not found plantes: plant 回答1: MorphAdorner seems to be better at this, but it still finds the incorrect result for "plantes" plantes: plante geese: goose Maybe you'd like to use MorphAdorner to do the lemmatization, and then check its results against

stemDocument R text mining

阅读更多关于 stemDocument R text mining

问题 My data is a txt file and looks as follows: words number_doc overwiew 1 client 1 store 1 marge 1 price 2 stock 2 economics 2 The numbers of the documents are sorted (from the smallest to the largest). Now I want for each document all the words that belongs to the document. Now they stand in a column, but I want al the words in a textDocument (from the package tm, because it is neccesary for some functions in that package). I did this as follows: data <- read.table("poging.txt", header = TRUE)

How to configure stemming in Solr?

阅读更多关于 How to configure stemming in Solr?

问题 I add to solr index: "American". When I search by "America" there is no results. How should schema.xml be configured to get results? current configuration: <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />

Getting the closest noun from a stemmed word

阅读更多关于 Getting the closest noun from a stemmed word

问题 Short version: If I have a stemmed word: Say 'comput' for 'computing', or 'sugari' for 'sugary' Is there a way to construct it's closest noun form? That is 'computer', or 'sugar' respectively Longer version: I'm using python and NLTK, Wordnet to perform a few semantic similarity tasks on a bunch of words. I noticed that most sem-sim scores work well only for nouns, while adjectives and verbs don't give any results. Understanding the inaccuracies involved, I wanted to convert a word from its

Getting the closest noun from a stemmed word

阅读更多关于 Getting the closest noun from a stemmed word

Word Base/Stem Dictionary

阅读更多关于 Word Base/Stem Dictionary

问题 It seems my Google-fu is failing me. Does anyone know of a freely available word base dictionary that just contains bases of words? So, for something like strawberries, it would have strawberry. But does NOT contain abbreviations or misspellings or alternate spellings (like UK versus US)? Anything quickly usable in Java would be good but just a text file of mappings or anything that could be read in would be helpful. 回答1: This is called lemmatization, and what you call the "base of a word" is

Snowball Stemming: defining Regions

阅读更多关于 Snowball Stemming: defining Regions

问题 I'm trying to understand the snoball stemming algorithmus. The algorithmus is using two regions R1 and R2 that are definied as follows: R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel. R2 is the region after the first non-vowel following a vowel in R1, or is the null region at the end of the word if there is no such non-vowel. http://snowball.tartarus.org/texts/r1r2.html Examples are b e a u t i f u l |<-