Query last N related rows per row

前端 未结 2 734
长情又很酷
长情又很酷 2020-12-10 21:31

I have the following query which fetches the id of the latest N observations for each station:

SELECT id
FROM (
  SELE         


        
2条回答
  •  时光取名叫无心
    2020-12-10 22:24

    Assuming at least Postgres 9.3.

    Index

    First, a multicolumn index will help:

    CREATE INDEX observations_special_idx
    ON observations(station_id, created_at DESC, id)
    

    created_at DESC is a slightly better fit, but the index would still be scanned backwards at almost the same speed without DESC.

    Assuming created_at is defined NOT NULL, else consider DESC NULLS LAST in index and query:

    • PostgreSQL sort by datetime asc, null first?

    The last column id is only useful if you get an index-only scan out of it, which probably won't work if you add lots of new rows constantly. In this case, remove id from the index.

    Simpler query (still slow)

    Simplify your query, the inner subselect doesn't help:

    SELECT id
    FROM  (
      SELECT station_id, id, created_at
           , row_number() OVER (PARTITION BY station_id
                                ORDER BY created_at DESC) AS rn
      FROM   observations
      ) s
    WHERE  rn <= #{n}  -- your limit here
    ORDER  BY station_id, created_at DESC;
    

    Should be a bit faster, but still slow.

    Fast query

    • Assuming you have relatively few stations and relatively many observations per station.
    • Also assuming station_id id defined as NOT NULL.

    To be really fast, you need the equivalent of a loose index scan (not implemented in Postgres, yet). Related answer:

    • Optimize GROUP BY query to retrieve latest record per user

    If you have a separate table of stations (which seems likely), you can emulate this with JOIN LATERAL (Postgres 9.3+):

    SELECT o.id
    FROM   stations s
    CROSS  JOIN LATERAL (
       SELECT o.id
       FROM   observations o
       WHERE  o.station_id = s.station_id  -- lateral reference
       ORDER  BY o.created_at DESC
       LIMIT  #{n}  -- your limit here
       ) o
    ORDER  BY s.station_id, o.created_at DESC;
    

    If you don't have a table of stations, the next best thing would be to create and maintain one. Possibly add a foreign key reference to enforce relational integrity.

    If that's not an option, you can distill such a table on the fly. Simple options would be:

    SELECT DISTINCT station_id FROM observations;
    SELECT station_id FROM observations GROUP BY 1;

    But either would need a sequential scan and be slow. Make Postgres use above index (or any btree index with station_id as leading column) with a recursive CTE:

    WITH RECURSIVE stations AS (
       (                  -- extra pair of parentheses ...
       SELECT station_id
       FROM   observations
       ORDER  BY station_id
       LIMIT  1
       )                  -- ... is required!
       UNION ALL
       SELECT (SELECT o.station_id
               FROM   observations o
               WHERE  o.station_id > s.station_id
               ORDER  BY o.station_id
               LIMIT  1)
       FROM   stations s
       WHERE  s.station_id IS NOT NULL  -- serves as break condition
       )
    SELECT station_id
    FROM   stations
    WHERE  station_id IS NOT NULL;      -- remove dangling row with NULL
    

    Use that as drop-in replacement for the stations table in the above simple query:

    WITH RECURSIVE stations AS (
       (
       SELECT station_id
       FROM   observations
       ORDER  BY station_id
       LIMIT  1
       )
       UNION ALL
       SELECT (SELECT o.station_id
               FROM   observations o
               WHERE  o.station_id > s.station_id
               ORDER  BY o.station_id
               LIMIT  1)
       FROM   stations s
       WHERE  s.station_id IS NOT NULL
       )
    SELECT o.id
    FROM   stations s
    CROSS  JOIN LATERAL (
       SELECT o.id, o.created_at
       FROM   observations o
       WHERE  o.station_id = s.station_id
       ORDER  BY o.created_at DESC
       LIMIT  #{n}  -- your limit here
       ) o
    WHERE  s.station_id IS NOT NULL
    ORDER  BY s.station_id, o.created_at DESC;
    

    This should still be faster than what you had by orders of magnitude.

    SQL Fiddle here (9.6)
    db<>fiddle here

提交回复
热议问题