Improving performance with a Similarity Postgres fuzzy self join query

空扰寡人 提交于 2019-12-05 23:26:48
Erwin Brandstetter

Indices

The partial GiST index is good, I would at least test these additional two indices:

A GIN index:

CREATE INDEX ref_name_trgm_gin_idx ON ref_name
USING gin (ref_name gin_trgm_ops)
WHERE ref_name_type = 'E';

This may or may not be used. If you upgrade to Postgres 9.4, chances are much better because there have been major improvements to GIN indexes.

A varchar_pattern_ops index:

CREATE INDEX ref_name_pattern_ops_idx
ON ref_name (ref_name varchar_pattern_ops)
WHERE ref_name_type = 'E';

Query

The problem at the heart of this query that you are running into a cross join with O(N²) when checking all rows against all rows. Performance becomes unbearable with a very big number of rows. You seem to be well aware of the dynamic. The defense is to limit possible combinations. You took a step in that direction already with limiting to the same first letter.

A very good option here is build on a special talent of GiST indices for nearest neighbour search. There is an hint in the manual for this query technique:

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.

A GIN index may still get used in addition to the GiST index. You have to weigh cost and benefit. May be cheaper overall to stick with one big index in versions before 9.4. But it's probably worth it in pg 9.4.

Postgres 9.2

Use correlated subqueries to substitute for the not yet existing missing LATERAL join:

SELECT a.*
     , b.ref_name     AS match_name
     , b.name_display AS match_name_display
FROM  (
   SELECT ref_name_id
        , ref_name
        , name_display
        , (SELECT ref_name_id AS match_name_id
           FROM   ref_name b
           WHERE  ref_name_type = 'E'
           AND    ref_name ~~ 'A%'
           AND    ref_name_id > a.ref_name_id
           AND    ref_name % a.ref_name
           ORDER  BY ref_name <-> a.ref_name
           LIMIT  1                                -- max. 1 best match
          )
   FROM   ref_name a
   WHERE  ref_name ~~ 'A%'
   AND    ref_name_type = 'E'
   ) a
JOIN   ref_name b ON b.ref_name_id = a.match_name_id
ORDER  BY 1;

Obviously, this also needs an index on ref_name_id, which should normally be the PK and therefore indexed automatically.

I added two more variants in the SQL Fiddle.

Postgres 9.3+

Use a LATERAL join for matching set to set. Similar to chapter 2a in this related answer:

SELECT a.ref_name_id
     , a.ref_name
     , a.name_display
     , b.ref_name_id  AS match_name_id
     , b.ref_name     AS match_name
     , b.name_display AS match_name_display
FROM   ref_name a
,   LATERAL (
   SELECT b.ref_name_id, b.ref_name, b.name_display
   FROM   ref_name b
   WHERE  b.ref_name ~~ 'A%'
   AND    b.ref_name_type = 'E'
   AND    a.ref_name_id < b.ref_name_id
   AND    a.ref_name % b.ref_name  -- also enforce min. similarity
   ORDER  BY a.ref_name <-> b.ref_name
   LIMIT  10                                -- max. 10 best matches
   ) b
WHERE  a.ref_name ~~ 'A%'   -- you can extend the search
AND    a.ref_name_type = 'E'
ORDER  BY 1;

SQL Fiddle with all variants compared to your original query on 40k rows modeled after your case.

Queries are 2 - 5 x faster as your original in the fiddle. And I expect them to scale much better with millions of rows. You'll have to test.

Extending the search for matches in b to all rows (while limiting candidates in a to a reasonable number) is rather cheap, too. I added two other variants to the fiddle.

Aside: I ran all tests with text instead of varchar, but that shouldn't make a difference.

Basics and links:

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!