I have a table core_message in Postgres, with millions of rows that looks like this (simplified):
┌────────────────┬──
This answer seems to go in the way of the DISTINCT ON answer here, however it also mentions this :
For many rows per customer (low cardinality in column
customer), a loose index scan (a.k.a. "skip scan") would be (much) more efficient, but that's not implemented up to Postgres 12. (An implementation for index-only scans is in development for Postgres 13. See here and here.)
For now, there are faster query techniques to substitute for this. In particular if you have a separate table holding unique customers, which is the typical use case. But also if you don't:
- Optimize GROUP BY query to retrieve latest row per user
Using this other great answer, I find a way to keep the same performance as a distinct table with the use of LATERAL.
By using a new table test_boats I can do something like this :
CREATE TABLE test_boats AS (select distinct on (mmsi) mmsi from core_message);
This table creation take 40+ seconds which is pretty similar to the time taken by the other answer here.
Then, with the help of LATERAL :
SELECT a.mmsi, b.time
FROM test_boats a
CROSS JOIN LATERAL(
SELECT b.time
FROM core_message b
WHERE a.mmsi = b.mmsi
ORDER BY b.time DESC
LIMIT 1
) b LIMIT 10;
This is blazingly fast, 1+ millisecond.
This will need the modification of my program's logic and the use of a query a bit more complex but I think I can live with that.
For a fast solution without the need to create a new table, check out the answer of @ErwinBrandstetter below
UPDATE: I feel this question is not quite answered yet, as it's not very clear why the other solutions proposed perform poorly here.
I tried the benchmark mentionned here. At first, it would seem that the DISTINCT ON way is fast enough if you do a request like the one proposed in the benchmark : +/- 30ms on my computer.
But this is because that request uses index only scan. If you include a field that is not in the index, some_column in the case of the benchmark, the performance will drop to +/- 100ms.
Not a dramatic drop in performance yet. That is why we need a benchmark with a bigger data set. Something similar to my case : 40K customers and 8M rows. Here
Let's try again the DISTINCT ON with this new table:
SELECT DISTINCT ON (customer_id) id, customer_id, total
FROM purchases_more
ORDER BY customer_id, total DESC, id;
This takes about 1.5 seconds to complete.
SELECT DISTINCT ON (customer_id) *
FROM purchases_more
ORDER BY customer_id, total DESC, id;
This takes about 35 seconds to complete.
Now, to come back to my first solution above. It is using an index only scan and a LIMIT, that's one of the reason why it is extremely fast. If I recraft that query to not use index-only scan and dump the limit :
SELECT b.*
FROM test_boats a
CROSS JOIN LATERAL(
SELECT b.*
FROM core_message b
WHERE a.mmsi = b.mmsi
ORDER BY b.time DESC
LIMIT 1
) b;
This will take about 500ms, which is still pretty fast.
For a more in-depth benchmark of sort, see my other answer below.