Stemming algorithm that produces real words

前端 未结 3 1409
刺人心
刺人心 2020-12-04 06:48

I need to take a paragraph of text and extract from it a list of \"tags\". Most of this is quite straight forward. However I need some help now stemming the resulting word

3条回答
  •  没有蜡笔的小新
    2020-12-04 06:50

    The core issue here is that stemming algorithms operate on a phonetic basis purely based on the language's spelling rules with no actual understanding of the language they're working with. To produce real words, you'll probably have to merge the stemmer's output with some form of lookup function to convert the stems back to real words. I can basically see two potential ways to do this:

    1. Locate or create a large dictionary which maps each possible stem back to an actual word. (e.g., communiti -> community)
    2. Create a function which compares each stem to a list of the words that were reduced to that stem and attempts to determine which is most similar. (e.g., comparing "communiti" against "community" and "communities" in such a way that "community" will be recognized as the more similar option)

    Personally, I think the way I would do it would be a dynamic form of #1, building up a custom dictionary database by recording every word examined along with what it stemmed to and then assuming that the most common word is the one that should be used. (e.g., If my body of source text uses "communities" more often than "community", then map communiti -> communities.) A dictionary-based approach will be more accurate in general and building it based on the stemmer input will provide results customized to your texts, with the primary drawback being the space required, which is generally not an issue these days.

提交回复
热议问题