Slow select distinct query on postgres

后端 未结 4 1439
自闭症患者
自闭症患者 2020-12-30 09:55

I\'m doing the following two queries quite frequently on a table that essentially gathers up logging information. Both select distinct values from a huge number of rows but

相关标签:
4条回答
  • 2020-12-30 10:34

    I have the same problem with tables > 300 millions records and an indexed field with a few distinct values. I couldn't get rid of the seq scan so I made this function to simulate a distinct search using the index if it exists. If your table has a number of distinct values proportional to the total number of records, this function isn't good. It also has to be adjusted for multi-columns distinct values. Warning: This function is wide open to sql injection and should only be used in a securized environment.

    Explain analyze results:
    Query with normal SELECT DISTINCT: Total runtime: 598310.705 ms
    Query with SELECT small_distinct(...): Total runtime: 1.156 ms

    CREATE OR REPLACE FUNCTION small_distinct(
       tableName varchar, fieldName varchar, sample anyelement = ''::varchar)
       -- Search a few distinct values in a possibly huge table
       -- Parameters: tableName or query expression, fieldName,
       --             sample: any value to specify result type (defaut is varchar)
       -- Author: T.Husson, 2012-09-17, distribute/use freely
       RETURNS TABLE ( result anyelement ) AS
    $BODY$
    BEGIN
       EXECUTE 'SELECT '||fieldName||' FROM '||tableName||' ORDER BY '||fieldName
          ||' LIMIT 1'  INTO result;
       WHILE result IS NOT NULL LOOP
          RETURN NEXT;
          EXECUTE 'SELECT '||fieldName||' FROM '||tableName
             ||' WHERE '||fieldName||' > $1 ORDER BY ' || fieldName || ' LIMIT 1'
             INTO result USING result;
       END LOOP;
    END;
    $BODY$ LANGUAGE plpgsql VOLATILE;
    

    Call samples:

    SELECT small_distinct('observations','id_source',1);
    SELECT small_distinct('(select * from obs where id_obs > 12345) as temp',
       'date_valid','2000-01-01'::timestamp);
    SELECT small_distinct('addresses','state');
    
    0 讨论(0)
  • 2020-12-30 10:40

    On PostgreSQL 9.3, starting from the answer from Denis:

        select bundles.bundle_id
        from bundles
        where exists (
          select 1 from audit_records
          where audit_records.bundle_id = bundles.bundle_id
          );
    

    just by adding a 'limit 1' to the subquery, I got a 60x speedup (for my use case, with 8 million records, a composite index and 10k combinations), going from 1800ms to 30ms:

        select bundles.bundle_id
        from bundles
        where exists (
          select 1 from audit_records
          where audit_records.bundle_id = bundles.bundle_id limit 1
          );
    
    0 讨论(0)
  • 2020-12-30 10:41
    BEGIN; 
    CREATE TABLE dist ( x INTEGER NOT NULL ); 
    INSERT INTO dist SELECT random()*50 FROM generate_series( 1, 5000000 ); 
    COMMIT;
    CREATE INDEX dist_x ON dist(x);
    
    
    VACUUM ANALYZE dist;
    EXPLAIN ANALYZE SELECT DISTINCT x FROM dist;
    
    HashAggregate  (cost=84624.00..84624.51 rows=51 width=4) (actual time=1840.141..1840.153 rows=51 loops=1)
       ->  Seq Scan on dist  (cost=0.00..72124.00 rows=5000000 width=4) (actual time=0.003..573.819 rows=5000000 loops=1)
     Total runtime: 1848.060 ms
    

    PG can't (yet) use an index for distinct (skipping the identical values) but you can do this :

    CREATE OR REPLACE FUNCTION distinct_skip_foo()
    RETURNS SETOF INTEGER
    LANGUAGE plpgsql STABLE 
    AS $$
    DECLARE
        _x  INTEGER;
    BEGIN
        _x := min(x) FROM dist;
        WHILE _x IS NOT NULL LOOP
            RETURN NEXT _x;
            _x := min(x) FROM dist WHERE x > _x;
        END LOOP;
    END;
    $$ ;
    
    EXPLAIN ANALYZE SELECT * FROM distinct_skip_foo();
    Function Scan on distinct_skip_foo  (cost=0.00..260.00 rows=1000 width=4) (actual time=1.629..1.635 rows=51 loops=1)
     Total runtime: 1.652 ms
    
    0 讨论(0)
  • 2020-12-30 10:54

    You're selecting distinct values from the whole table, which automatically leads to a seq scan. You've millions rows, so it'll necessarily be slow.

    There's a trick to get the distinct values faster, but it only works when the data has a known (and reasonably small) set of possible values. For instance, I take it that your bundle_id references some kind of bundles table which is a smaller. This means you can write:

    select bundles.bundle_id
    from bundles
    where exists (
          select 1 from audit_records
          where audit_records.bundle_id = bundles.bundle_id
          );
    

    This should lead to a nested loop / seq scan on bundles -> index scan on audit_records using the index on bundle_id.

    0 讨论(0)
提交回复
热议问题