Stemming algorithm that produces real words

前端未结

关注

 3  1402

I need to take a paragraph of text and extract from it a list of \"tags\". Most of this is quite straight forward. However I need some help now stemming the resulting word

相关标签:

3条回答

没有蜡笔的小新

2020-12-04 06:50
The core issue here is that stemming algorithms operate ~~on a phonetic basis~~ purely based on the language's spelling rules with no actual understanding of the language they're working with. To produce real words, you'll probably have to merge the stemmer's output with some form of lookup function to convert the stems back to real words. I can basically see two potential ways to do this:
1. Locate or create a large dictionary which maps each possible stem back to an actual word. (e.g., communiti -> community)
2. Create a function which compares each stem to a list of the words that were reduced to that stem and attempts to determine which is most similar. (e.g., comparing "communiti" against "community" and "communities" in such a way that "community" will be recognized as the more similar option)
Personally, I think the way I would do it would be a dynamic form of #1, building up a custom dictionary database by recording every word examined along with what it stemmed to and then assuming that the most common word is the one that should be used. (e.g., If my body of source text uses "communities" more often than "community", then map communiti -> communities.) A dictionary-based approach will be more accurate in general and building it based on the stemmer input will provide results customized to your texts, with the primary drawback being the space required, which is generally not an issue these days.
0 讨论(0)
发布评论:

提交评论
- 加载中...
小蘑菇

2020-12-04 07:14
If I understand correctly, then what you need is not a stemmer but a lemmatizer. Lemmatizer is a tool with knowledge about endings like -ies, -ed, etc., and exceptional wordforms like written, etc. Lemmatizer maps the input wordform to its lemma, which is guaranteed to be a "real" word.

There are many lemmatizers for English, I've only used morpha though. Morpha is just a big lex-file which you can compile into an executable. Usage example:
```
$ cat test.txt 
Community
Communities
$ cat test.txt | ./morpha -uc
Community
Community
```
You can get morpha from http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2020-12-04 07:15

Hey I don't know if that's perhaps too late, but there is only one PHP stemming script that produces real words: http://phpmorphy.sourceforge.net/ – it took me ages to find it. All other stemmers have to be compiled and even after that they only work according to Porter algorithm, which produces stems, not lemmas (i.e. community = communiti). PhpMorphy one works perfectly well, it's easy to install and initialize, and has English, Russian, German, Ukrainian and Estonian dictionaries. It also comes with a script that you can use to compile other dictionaries. The documentation is in Russian, but put it through Google translate and it should be easy.

0 讨论(0)
发布评论:

提交评论
- 加载中...