Slovenian stemmer for Sphinx

て烟熏妆下的殇ゞ 提交于 2019-11-30 16:00:43

问题


I am searching stemming algorithm for Slovenian language that I can use with Sphinx search.

What I'm trying to achieve is for example when searching for 'jabolka', I also want results for documents containing 'jabolko', 'jabolki', 'jabolk', etc.

I found some references about existence of Slovenian stemmer, but I can't find where to download it, it's not even for sale anywhere...

Another option I've came across is using option wordforms in Sphinx source config (http://sphinxsearch.com/docs/manual-0.9.9.html#conf-wordforms), but building my own dictionary would be too difficult, so I'm wondering are there any publicly accessible dictionaries available already?


In case Slovenian stemmer is not available, can somebody suggest some other approach of achieving similar search results?


回答1:


I managed to compile slovenian stemmer in following steps:

  1. Download http://snowball.tartarus.org/dist/snowball_code.tgz (source code for snowball) and unpack it
  2. Download slovenian algorithm from http://snowball.tartarus.org/archives/snowball-discuss/0725.html and save it to unpacked project from step 1 in folder /algorithms/slovene. Name of the file has to be stem_ISO_8859_2.sbl
  3. Algorithm is in ISO encoding, so I converted it to UTF8 and saved it as stem_Unicode.sbl (you have to find utf char codes for slovenian special chars like ČŠŽĆ)
  4. Edit both of .txt files in /libstemmer folder and add entries for slovenian:

    slovene         UTF_8,ISO_8859_2        slovene,sl,slv
    
  5. Edit /GNUmakefile and add slovene (once to list of languages for utf and once for ISO_8859_2_algorithms)
  6. go to folder /libstemmer and run:

    ./mkmodules.pl modules.h src_c modules.txt ../mkinc.mak
    ./mkmodules.pl modules_utf8.h src_c  modules_utf8.txt ../mkinc_utf8.mak
    

    This will generate files needed for compiling later.

  7. run make (from root of unpacked files)
  8. If there were no errors during compile you should have /src_c folder and code for slovenian stemmer in them (next to others)

    stem_UTF_8_slovene.c
    stem_ISO_8859_2_slovene.c
    ...
    
  9. Unpack latest sphinx and copy all files from your snowball project to sphinx /libstemmer_c folder (excluding libstemmer.o and GNUmakefile)

  10. compile sphinx:

    touch NEWS README AUTHORS ChangeLog
    autoreconf --force --install
    ./configure --with-libstemmer
    make
    make install
    
  11. if all went fine you should have slovene stemmer for sphinx working, you just have to enable it in you sphinx index configuratiun (on my Debian it is in /usr/local/etc/sphinx.conf):

    charset_type = utf-8
    morphology = libstemmer_slovene
    

Hope this helps someone, I had no prior experience with autoconf so it took me a while to figure this out.

This slovene stemmer is not officially released on http://snowball.tartarus.org, but from my tests it works good enough for my project.




回答2:


I'm not sure if this will do what you want, but I came across this reference to a tool called spelldump in the Sphinx documentation:

spelldump is one of the helper tools within the Sphinx package.

It is used to extract the contents of a dictionary file that uses ispell or MySpell format, which can help build word lists for wordforms - all of the possible forms are pre-built for you.

http://sphinxsearch.com/docs/current.html#ref-spelldump

It requires "a dictionary file that uses ispell or MySpell" - I found a reference to a Slovenian ispell dictionary file, which might be suitable.

Good luck!



来源:https://stackoverflow.com/questions/8714040/slovenian-stemmer-for-sphinx

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!