Lucene Standard Analyzer vs Snowball

隐身守侯 提交于 2019-12-02 20:30:14

Yes, by using a stemmer such as Snowball, you are losing information about the original form of your text. Sometimes this will be useful, sometimes not.

For example, Snowball will stem "organization" into "organ", so a search for "organization" will return results with "organ", without any scoring penalty.

Whether or not this is appropriate to you depends on your content, and on the type of queries you are supporting (for example, are the searches very basic, or are users very sophisticated and using your search to accurately filter down the results). You may also want to look into less aggressive stemmers, such as KStem.

Skarab

The snowball analyzer will increase your recall, because it is much more aggressive than standard analyzer. So you need to evaluate your search results to see if for your data you need to increase recall or precision.

I just finished an analyzer that performs lemmatization. That's similar to stemming, except that it uses context to determine a word's type (noun, verb, etc.) and uses that information to derive the stem. It also keeps the original form of the word in the index. Maybe my library can be of use to you. It requires Lucene Java, though, and I'm not aware of any C#/.NET lemmatizers.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!