Stemming algorithm that produces real words

前端 未结 3 1381
刺人心
刺人心 2020-12-04 06:48

I need to take a paragraph of text and extract from it a list of \"tags\". Most of this is quite straight forward. However I need some help now stemming the resulting word

3条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-12-04 07:14

    If I understand correctly, then what you need is not a stemmer but a lemmatizer. Lemmatizer is a tool with knowledge about endings like -ies, -ed, etc., and exceptional wordforms like written, etc. Lemmatizer maps the input wordform to its lemma, which is guaranteed to be a "real" word.

    There are many lemmatizers for English, I've only used morpha though. Morpha is just a big lex-file which you can compile into an executable. Usage example:

    $ cat test.txt 
    Community
    Communities
    $ cat test.txt | ./morpha -uc
    Community
    Community
    

    You can get morpha from http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html

提交回复
热议问题