Optimize groupwise maximum query

前端 未结 4 732
温柔的废话
温柔的废话 2020-11-30 14:21
select * 
from records 
where id in ( select max(id) from records group by option_id )

This query works fine even on millions of rows. However as y

4条回答
  •  醉梦人生
    2020-11-30 14:31

    Assuming relatively few rows in options for many rows in records.

    Typically, you would have a look-up table options that is referenced from records.option_id, ideally with a foreign key constraint. If you don't, I suggest to create one to enforce referential integrity:

    CREATE TABLE options (
      option_id int  PRIMARY KEY
    , option    text UNIQUE NOT NULL
    );
    
    INSERT INTO options
    SELECT DISTINCT option_id, 'option' || option_id -- dummy option names
    FROM   records;
    

    Then there is no need to emulate a loose index scan any more and this becomes very simple and fast. Correlated subqueries can use a plain index on (option_id, id).

    SELECT option_id, (SELECT max(id)
                       FROM   records
                       WHERE  option_id = o.option_id) AS max_id
    FROM   options o
    ORDER  BY 1;
    

    This includes options with no match in table records. You get NULL for max_id and you can easily remove such rows in an outer SELECT if needed.

    Or (same result):

    SELECT option_id, (SELECT id
                       FROM   records
                       WHERE  option_id = o.option_id
                       ORDER  BY id DESC NULLS LAST
                       LIMIT  1) AS max_id
    FROM   options o
    ORDER  BY 1;
    

    May be slightly faster. The subquery uses the sort order DESC NULLS LAST - same as the aggregate function max() which ignores NULL values. Sorting just DESC would have NULL first:

    • Why do NULL values come first when ordering DESC in a PostgreSQL query?

    The perfect index for this:

    CREATE INDEX on records (option_id, id DESC NULLS LAST);
    

    Index sort order doesn't matter much while columns are defined NOT NULL.

    There can still be a sequential scan on the small table options, that's just the fastest way to fetch all rows. The ORDER BY may bring in an index (only) scan to fetch pre-sorted rows.
    The big table records is only accessed via (bitmap) index scan or, if possible, index-only scan.

    db<>fiddle here - showing two index-only scans for the simple case
    Old sqlfiddle

    Or use LATERAL joins for a similar effect in Postgres 9.3+:

    • Optimize GROUP BY query to retrieve latest row per user

提交回复
热议问题