Proper way to access latest row for each individual identifier?

后端 未结 5 572
轻奢々
轻奢々 2021-01-03 11:14

I have a table core_message in Postgres, with millions of rows that looks like this (simplified):

┌────────────────┬──         


        
5条回答
  •  轮回少年
    2021-01-03 11:54

    You have put existing answers to good use and came up with great solutions in your own answer. Some missing pieces:

    I'm still trying to understand how to properly use his first RECURSIVE solution ...

    You used this query to create the test_boats table with unique mmsi:

    select distinct on (mmsi) mmsi from core_message
    

    For many rows per boat (mmsi), use this faster RECURSIVE solution instead:

    WITH RECURSIVE cte AS (
       (
       SELECT mmsi
       FROM   core_message
       ORDER  BY mmsi
       LIMIT  1
       )
       UNION ALL
       SELECT m.*
       FROM   cte c
       CROSS  JOIN LATERAL (
          SELECT mmsi
          FROM   core_message
          WHERE  mmsi > c.mmsi
          ORDER  BY mmsi
          LIMIT  1
          ) m
       )
    TABLE cte;
    

    This hardly gets any slower with more rows per boat, as opposed to DISTINCT ON which is typically faster with only few rows per boat. Each only needs an index with mmsi as leading column to be fast.

    If possible, create that boats table and add a FK constraint to it. (Means you have to maintain it.) Then you can go on using the optimal LATERAL query you have in your answer and never miss any boats. (Orphaned boats may be worth tracking / removing in the long run.)

    Else, another iteration of that RECURSIVE query is the next best thing to get whole rows for the latest position of each boat quickly:

    WITH RECURSIVE cte AS (
       (
       SELECT *
       FROM   core_message
       ORDER  BY mmsi DESC, time DESC  -- see below
       LIMIT  1
       )
       UNION ALL
       SELECT m.*
       FROM   cte c
       CROSS  JOIN LATERAL (
          SELECT *
          FROM   core_message
          WHERE  mmsi < c.mmsi
          ORDER  BY mmsi DESC, time DESC
          LIMIT  1
          ) m
       )
    TABLE cte;
    

    You have both of these indexes:

    "core_message_uniq_mmsi_time" UNIQUE CONSTRAINT, btree (mmsi, "time")
    "core_messag_mmsi_b36d69_idx" btree (mmsi, "time" DESC)
    

    A UNIQUE constraint is implemented with all columns in default ASC sort order. That cannot be changed. If you don't actually need the constraint, you might replace it with a UNIQUE index, mostly achieving the same. But there you can add any sort order you like. Related:

    • How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?

    But there is no need for the use case at hand. Postgres can scan a b-tree index backwards at practically the same speed. And I see nothing here that would require inverted sort order for the two columns. The additional index core_messag_mmsi_b36d69_idx is expensive dead freight - unless you have other use cases that actually need it. See:

    • Optimizing queries on a range of timestamps (two columns)

    To best use the index core_message_uniq_mmsi_time from the UNIQUE constraint I step through both columns in descending order. That matters.

提交回复
热议问题