Proper way to access latest row for each individual identifier?

后端 未结 5 581
轻奢々
轻奢々 2021-01-03 11:14

I have a table core_message in Postgres, with millions of rows that looks like this (simplified):

┌────────────────┬──         


        
5条回答
  •  陌清茗
    陌清茗 (楼主)
    2021-01-03 12:11

    This answer seems to go in the way of the DISTINCT ON answer here, however it also mentions this :

    For many rows per customer (low cardinality in column customer), a loose index scan (a.k.a. "skip scan") would be (much) more efficient, but that's not implemented up to Postgres 12. (An implementation for index-only scans is in development for Postgres 13. See here and here.)
    For now, there are faster query techniques to substitute for this. In particular if you have a separate table holding unique customers, which is the typical use case. But also if you don't:

    • Optimize GROUP BY query to retrieve latest row per user

    Using this other great answer, I find a way to keep the same performance as a distinct table with the use of LATERAL. By using a new table test_boats I can do something like this :

     CREATE TABLE test_boats AS (select distinct on (mmsi) mmsi from core_message);
    

    This table creation take 40+ seconds which is pretty similar to the time taken by the other answer here.

    Then, with the help of LATERAL :

    SELECT a.mmsi, b.time
    FROM test_boats a
    CROSS JOIN LATERAL(
        SELECT b.time
        FROM core_message b
        WHERE a.mmsi = b.mmsi
        ORDER BY b.time DESC
        LIMIT 1
    ) b LIMIT 10;
    

    This is blazingly fast, 1+ millisecond.

    This will need the modification of my program's logic and the use of a query a bit more complex but I think I can live with that.

    For a fast solution without the need to create a new table, check out the answer of @ErwinBrandstetter below


    UPDATE: I feel this question is not quite answered yet, as it's not very clear why the other solutions proposed perform poorly here.

    I tried the benchmark mentionned here. At first, it would seem that the DISTINCT ON way is fast enough if you do a request like the one proposed in the benchmark : +/- 30ms on my computer. But this is because that request uses index only scan. If you include a field that is not in the index, some_column in the case of the benchmark, the performance will drop to +/- 100ms.

    Not a dramatic drop in performance yet. That is why we need a benchmark with a bigger data set. Something similar to my case : 40K customers and 8M rows. Here

    Let's try again the DISTINCT ON with this new table:

    SELECT DISTINCT ON (customer_id) id, customer_id, total 
    FROM purchases_more 
    ORDER BY customer_id, total DESC, id;
    

    This takes about 1.5 seconds to complete.

    SELECT DISTINCT ON (customer_id) *
    FROM purchases_more 
    ORDER BY customer_id, total DESC, id;
    

    This takes about 35 seconds to complete.

    Now, to come back to my first solution above. It is using an index only scan and a LIMIT, that's one of the reason why it is extremely fast. If I recraft that query to not use index-only scan and dump the limit :

    SELECT b.*
    FROM test_boats a
    CROSS JOIN LATERAL(
        SELECT b.*
        FROM core_message b
        WHERE a.mmsi = b.mmsi
        ORDER BY b.time DESC
        LIMIT 1
    ) b;
    

    This will take about 500ms, which is still pretty fast.

    For a more in-depth benchmark of sort, see my other answer below.

提交回复
热议问题