How do I efficiently query versioned rows/entities in PostgreSQL?

问题

Background

I have situation where I store all versions of a given entity in my PostgreSQL database. This is implemented with two tables; one table storing the primary key and immutable properties of the entity and a second table storing the mutable properties of the entity. Both tables are insert-only (enforced by a trigger).

Example

The concept can easily be illustrated with an entity User, stored in the user and user_details tables:

Table user:

id  timestamp
1   2018-04-10T12:00:00
2   2018-04-10T12:00:00

Table user_details:

id  user_id   username  first_name   last_name     timestamp
1   1         bob       Bob          Socks         2018-04-10T12:00:01
2   1         bob       Bobby        Socks         2018-04-10T12:00:02
3   2         alice     Alice        Jones         2018-04-10T12:00:03
4   1         bob       Bobbers      Socks         2018-04-10T12:00:04
5   2         alice     Alicia       Jones         2018-04-10T12:00:05

Both 'id' columns are defined as serial primary keys (strictly incrementing) and I have created an index on user_details (user_id, id DESC).

1 - How do I efficiently query the most recent version of an entity?

Given a user id I need a quick way to fetch the immutable data in user and the most recent entry from user_details. What kind of query would be best suited for this join?

2 - How do I efficiently query versions n and n-1 of an entity?

I am generating audit logs for time intervals by first fetching all rows with timestamp between X and Y and then I fetch the inserted row and its predecessor (same user_id, closest lower id) and produce a diff from these. The number of rows inserted between X and Y is often high, so I need to efficiently fetch the current + previous pairs, i.e. given input user_details(5), I need to select the join of user(2) + user_details(5) and user(2) + user_details(3). What kind of query would be best suited for this join?

Futile attempts

My best results so far has been with these queries:

Query for question 1:

SELECT *
FROM "user" u
JOIN LATERAL (SELECT *
              FROM "user_details" ud
              WHERE u.id = ud.user_id
              ORDER BY id DESC
              LIMIT 1
       ) detail ON TRUE
WHERE u.id IN
      (...);

Query for question 2:

SELECT *
FROM "user" u
JOIN LATERAL (SELECT *
              FROM "user_details" ud
              WHERE u.id = ud.user_id
              AND ud.id IN (...)
              ORDER BY id DESC
              LIMIT 2) ud ON TRUE;

However, both queries end up using nested loops (seen from EXPLAIN ANALYZE) and take a long time to finish when run with a large number of ids (5000+).

Ideas

Can I use the user_details (user_id, id DESC) index in a smart way to first create a CTE of the user_details ids I need and then join user + user_details based on this? Can I create a functional index of some sort? Do I need to maintain a predecessor column in user_details (or another table) do be able to look up relations of this type efficiently?

Thanks!

SQL Fiddle: http://www.sqlfiddle.com/#!17/5f5f0

Solution

Thanks to X and Y for pushing me in the right direction! I ended up using the solution @MichelMilezzi suggested for my first problem and an adaption of the @RadimBača solution for my second problem:

WITH
cte_1 AS (SELECT id, user_id FROM "user_details" WHERE id IN (8999, 9999)),
cte_2 as (SELECT cte_1.id, cte_1.user_id, prev.id AS prev_id, row_number() OVER (PARTITION BY cte_1.id, cte_1.user_id ORDER BY prev.id DESC) AS rownum FROM "user_details" prev, cte_1 WHERE prev.user_id = cte_1.user_id AND prev.id < cte_1.id)
SELECT main.*, detail.*, cte_2.id AS __id, (detail.id <> cte_2.id) AS __is_predecessor FROM "user" main, "user_details" detail, cte_2
WHERE main.id = cte_2.user_id AND cte_2.rownum = 1 AND (detail.id = cte_2.id OR detail.id = cte_2.prev_id);

回答1:

Consider using window functions

SELECT *
FROM "user" u
JOIN
(
    SELECT row_number() over(partition by user_id order by id) rn,
           *
    FROM "user_details" ud
) t ON t.user_id = u.id
WHERE t.rn = 1

DEMO

This solution allows you to query also all N rows per group or N-th row per group.

回答2:

You could use DISTINCT ON to retrieve the most recent version of user as follows:

SELECT 
    DISTINCT ON (u.id) 
    *
FROM
    "user" u
    JOIN user_details d ON (u.id = d.user_id)
WHERE
    d.id IN (100, 200, 300, 400, 500, 600, 700, 800, 900, 1000) 
ORDER BY 
    u.id,
    d.id DESC

A quote from the docs:

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.

Sql fiddle here.

To get an older version you can use a window function as pointed by @Radim.

来源：https://stackoverflow.com/questions/49775834/how-do-i-efficiently-query-versioned-rows-entities-in-postgresql

标签

sql

postgresql

greatest-n-per-group

audit