问题
I need a good python module for stemming text documents in the pre-processing stage.
I found this one
http://pypi.python.org/pypi/PyStemmer/1.0.1
but i cannot find the documentation int the link provided.
I anyone knows where to find the documentation or any other good stemming algorithm please help.
回答1:
You may want to try NLTK
>>> from nltk import PorterStemmer
>>> PorterStemmer().stem('complications')
回答2:
Python stemming module has implementations of various stemming algorithms like Porter, Porter2, Paice-Husk, and Lovins. http://pypi.python.org/pypi/stemming/1.0
>> from stemming.porter2 import stem
>> stem("factionally")
faction
回答3:
All these stemmers that have been discussed here are algorithmic stemmer,hence they can always produce unexpected results such as
In [3]: from nltk.stem.porter import *
In [4]: stemmer = PorterStemmer()
In [5]: stemmer.stem('identified')
Out[5]: u'identifi'
In [6]: stemmer.stem('nonsensical')
Out[6]: u'nonsens'
To correctly get the root words one need a dictionary based stemmer such as Hunspell Stemmer.Here is a python implementation of it in the following link. Example code is here
>>> import hunspell
>>> hobj = hunspell.HunSpell('/usr/share/myspell/en_US.dic', '/usr/share/myspell/en_US.aff')
>>> hobj.spell('spookie')
False
>>> hobj.suggest('spookie')
['spookier', 'spookiness', 'spooky', 'spook', 'spoonbill']
>>> hobj.spell('spooky')
True
>>> hobj.analyze('linked')
[' st:link fl:D']
>>> hobj.stem('linked')
['link']
回答4:
The gensim package for topic modelling comes with a Porter Stemmer algorithm:
>>> from gensim import parsing
>>> gensim.parsing.stem_text("trying writing nonsense")
'try write nonsens'
The PorterStemmer is the only stemming option implemented in gensim
.
An a side note: I can imagine (without further references) that most text-mining-related modules have their own implementations for simple pre-processing procedures like Porter's stemming, white-space removal and stop-word removal.
回答5:
PyStemmer is a Python interface to the Snowball stemming library.
Documentation can be found here: https://github.com/snowballstem/pystemmer/blob/master/docs/quickstart.txt https://github.com/snowballstem/pystemmer/blob/master/docs/quickstart_python3.txt
来源:https://stackoverflow.com/questions/10369393/need-a-python-module-for-stemming-of-text-documents