Postgresql fulltext search for Czech language (no default language config)

问题

I am trying to setup fulltext search for Czech language. I am little bit confused, because I see some cs_cz.affix and cs_cz.dict files inside tsearch_data folder, but there is no Czech language configuration (it's probably not shipped with Postgres).

So should I create one? Which dics do I have to create/config? Is there some support for Czech language at all? Should I use all possible dicts? (Synonym Dictionary, Thesaurus Dictionary, Ispell Dictionary, Snowball Dictionary)

I am able to create Czech configuration for ispell dict and it works fine, bud I am not sure if it's enough (just ispell configuration).

Thanks a lot I tried to read https://www.postgresql.org/docs/9.5/static/textsearch.html but I am little bit confused.

回答1:

I have never tried it, but you should be able to create a Czech Snowball stemmer as long as you are ready to compile PostgreSQL from source.

There is an explanation in src/backend/snowball/README:

The files under src/backend/snowball/libstemmer/ and src/include/snowball/libstemmer/ are taken directly from their libstemmer_c distribution, with only some minor adjustments of file inclusions. Note that most of these files are in fact derived files, not master source. The master sources are in the Snowball language, and are available along with the Snowball-to-C compiler from the Snowball project. We choose to include the derived files in the PostgreSQL distribution because most installations will not have the Snowball compiler available.

To update the PostgreSQL sources from a new Snowball libstemmer_c distribution:
Copy the *.c files in libstemmer_c/src_c/ to src/backend/snowball/libstemmer with replacement of "../runtime/header.h" by "header.h", for example
for f in libstemmer_c/src_c/*.c
do
    sed 's|\.\./runtime/header\.h|header.h|' $f >libstemmer/`basename $f`
done
(Alternatively, if you rebuild the stemmer files from the master Snowball sources, just omit "-r ../runtime" from the Snowball compiler switches.)
Copy the *.c files in libstemmer_c/runtime/ to src/backend/snowball/libstemmer, and edit them to remove direct inclusions of system headers such as <stdio.h> – they should only include "header.h". (This removal avoids portability problems on some platforms where <stdio.h> is sensitive to largefile compilation options.)

Copy the *.h files in libstemmer_c/src_c/ and libstemmer_c/runtime/ to src/include/snowball/libstemmer. At this writing the header files do not require any changes.

Check whether any stemmer modules have been added or removed. If so, edit the OBJS list in Makefile, the list of #include's in dict_snowball.c, and the stemmer_modules[] table in dict_snowball.c.

The various stopword files in stopwords/ must be downloaded individually from pages on the snowball.tartarus.org website. Be careful that these files must be stored in UTF-8 encoding.

Now there is a Czech Snowball stemmer available here, it was contributed to the project. There is no stop word dictionary available, but I am sure you can either find one or create one yourself.

The real work would be to install Snowball and use the Snowball-to-C compiler to create the C and header files to add to the PostgreSQL source. These files should then remain stable, so it shouldn't be difficult to upgrade to a new PostgreSQL version.

If you are willing to do the work, but don't want to patch PostgreSQL and build it from source every time, you could also consider submitting a patch to PostgreSQL. As long as the stemmer works fine, I don't expect that you will much resistance there (but the patch submission process is still tedious).

来源：https://stackoverflow.com/questions/42540638/postgresql-fulltext-search-for-czech-language-no-default-language-config

标签

postgresql

full-text-search