stemming | 易学教程

MySQL fulltext with stems

阅读更多关于 MySQL fulltext with stems

问题 I am building a little search function for my site. I am taking my user's query, stemming the keywords and then running a fulltext MySQL search against the stemmed keywords. The problem is that MySQL is treating the stems as literal. Here is the process that is happening: user searches for a word like "baseballs" my stemming algorithm (Porter Stemmer) turns "baseballs" into "basebal" fulltext does not find anything matching "basebal", even though there SHOULD be matches for "baseball" and

Stemming algorithm that produces real words

阅读更多关于 Stemming algorithm that produces real words

I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way): http://tartarus.org/~martin/PorterStemmer/php.txt This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun". I've tried "Snowball" (suggested within another Stack Overflow thread). http://snowball.tartarus.org/demo.php For my example

Can you programmatically detect pluralizations of English words, and derive the singular form?

阅读更多关于 Can you programmatically detect pluralizations of English words, and derive the singular form?

问题 Given some (English) word that we shall assume is a plural , is it possible to derive the singular form? I'd like to avoid lookup/dictionary tables if possible. Some examples: Examples -> Example a simple 's' suffix Glitch -> Glitches 'es' suffix, as opposed to above Countries -> Country 'ies' suffix. Sheep -> Sheep no change: possible fallback for indeterminate values Or, this seems to be a fairly exhaustive list. Suggestions of libraries in language x are fine, as long as they are open

Stemming English words with Lucene

阅读更多关于 Stemming English words with Lucene

I'm processing some English texts in a Java application, and I need to stem them. For example, from the text "amenities/amenity" I need to get "amenit". The function looks like: String stemTerm(String term){ ... } I've found the Lucene Analyzer, but it looks way too complicated for what I need. http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/PorterStemFilter.html Is there a way to use it to stem words without building an Analyzer? I don't understand all the Analyzer business... EDIT : I actually need a stemming + lemmatization. Can Lucene do this? import org.apache.lucene

What is the best stemming method in Python?

阅读更多关于 What is the best stemming method in Python?

问题 I tried all the nltk methods for stemming but it gives me weird results with some words. Examples It often cut end of words when it shouldn't do it : poodle => poodl article articl or doesn't stem very good : easily and easy are not stemmed in the same word leaves, grows, fairly are not stemmed Do you know other stemming libs in python, or a good dictionary? Thank you 回答1: Python implementations of the Porter, Porter2, Paice-Husk, and Lovins stemming algorithms for English are available in

TreeTagger installation successful but cannot open .par file

阅读更多关于 TreeTagger installation successful but cannot open .par file

Do anyone know how to resolve this file reading error in TreeTagger that is a common Natural Language Processing tool used to POS tag, lemmatize and chunk sentences? alvas@ikoma:~/treetagger$ echo 'Hello world!' | cmd/tree-tagger-english reading parameters ... ERROR: Can't open for reading: /home/alvas/treetagger/lib/english.par aborted. I didn't encounter any possible installation problems as hinted on http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/installation-hints.txt . I've followed the instructions on the webpage and it's installed properly ( http://www.ims.uni-stuttgart.de

Stemming with R Text Analysis

阅读更多关于 Stemming with R Text Analysis

问题 I am doing a lot of analysis with the TM package. One of my biggest problems are related to stemming and stemming-like transformations. Let's say I have several accounting related terms (I am aware of the spelling issues). After stemming we have: accounts -> account account -> account accounting -> account acounting -> acount acount -> acount acounts -> acount accounnt -> accounnt Result : 3 Terms (account, acount, account) where I would have liked 1 (account) as all these relate to the same

Stemming algorithm that produces real words

阅读更多关于 Stemming algorithm that produces real words

问题 I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way): http://tartarus.org/~martin/PorterStemmer/php.txt This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun". I've tried "Snowball"

How do I do word Stemming or Lemmatization?

阅读更多关于 How do I do word Stemming or Lemmatization?

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones. My test words are: " cats running ran cactus cactuses cacti community communities ", and both get less than half right. See also: Stemming algorithm that produces real words Stemming - code examples or open source projects? theycallmemorty If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet . Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by: >>> import

TreeTagger installation successful but cannot open .par file

阅读更多关于 TreeTagger installation successful but cannot open .par file

问题 Do anyone know how to resolve this file reading error in TreeTagger that is a common Natural Language Processing tool used to POS tag, lemmatize and chunk sentences? alvas@ikoma:~/treetagger$ echo 'Hello world!' | cmd/tree-tagger-english reading parameters ... ERROR: Can't open for reading: /home/alvas/treetagger/lib/english.par aborted. I didn't encounter any possible installation problems as hinted on http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/installation-hints.txt. I've