Similar UTF-8 strings for autocomplete field

前端 未结 2 1506
情歌与酒
情歌与酒 2020-12-10 21:51

Background

Users can type in a name and the system should match the text, even if the either the user input or the database field contains accented (UTF-8) charact

相关标签:
2条回答
  • 2020-12-10 22:23

    You are not using the operator class provided by the pg_trgm module. I would create an index like this:

    CREATE INDEX label_Lower_unaccent_trgm_idx
    ON test_trgm USING gist (lower(unaccent_text(label)) gist_trgm_ops);
    

    Originally, I had a GIN index here, but I later learned that a GiST is probably even better suited for this kind of query because it can return values sorted by similarity. More details:

    • Postgresql: Matching Patterns between Two Columns
    • Finding similar strings with PostgreSQL quickly

    Your query has to match the index expression to be able to make use of it.

    SELECT label
    FROM   the_table
    WHERE  lower(unaccent_text(label)) % 'fil'
    ORDER  BY similarity(label, 'fil') DESC -- it's ok to use original string here
    

    However, "filbert" and "filé powder" are not actually very similar to "fil" according to the % operator. I suspect what you really want is this:

    SELECT label
    FROM   the_table
    WHERE  lower(unaccent_text(label)) ~~ '%fil%'
    ORDER  BY similarity(label, 'fil') DESC -- it's ok to use original string here
    

    This will find all strings containing the search string, and sort the best matches according to the % operator first.

    And the juicy part: the expression can use a GIN or GiST index since PostgreSQL 9.1! I quote the manual on the pg_trgm moule:

    Beginning in PostgreSQL 9.1, these index types also support index searches for LIKE and ILIKE, for example


    If you actually meant to use the % operator:

    Have you tried lowering the threshold for the similarity operator % with set_limit():

    SELECT set_limit(0.1);
    

    or even lower? The default is 0.3. Just to see whether its the threshold that filters additional matches.

    0 讨论(0)
  • 2020-12-10 22:28

    A solution for PostgreSQL 9.1:

    -- Install the requisite extensions.
    CREATE EXTENSION pg_trgm;
    CREATE EXTENSION unaccent;
    
    -- Function fixes STABLE vs. IMMUTABLE problem of the unaccent function.
    CREATE OR REPLACE FUNCTION unaccent_text(text)
      RETURNS text AS
    $BODY$
      -- unaccent is STABLE, but indexes must use IMMUTABLE functions.
      SELECT unaccent($1); 
    $BODY$
      LANGUAGE sql IMMUTABLE
      COST 1;
    
    -- Create an unaccented index.
    CREATE INDEX the_table_label_unaccent_idx
    ON the_table USING gin (lower(unaccent_text(label)) gin_trgm_ops);
    
    -- Define the matching threshold.
    SELECT set_limit(0.175);
    
    -- Test the query (matching against the index expression).
    SELECT
      label
    FROM
      the_table
    WHERE
      lower(unaccent_text(label)) % 'fil'
    ORDER BY
      similarity(label, 'fil') DESC 
    

    Returns "filbert", "fish fillet", and "filé powder".

    Without calling SELECT set_limit(0.175);, you can use the double tilde (~~) operator:

    -- Test the query (matching against the index expression).
    SELECT
      label
    FROM
      the_table
    WHERE
      lower(unaccent_text(label)) ~~ 'fil'
    ORDER BY
      similarity(label, 'fil') DESC 
    

    Also returns "filbert", "fish fillet", and "filé powder".

    0 讨论(0)
提交回复
热议问题