问题
Background
I have situation where I store all versions of a given entity in my PostgreSQL database. This is implemented with two tables; one table storing the primary key and immutable properties of the entity and a second table storing the mutable properties of the entity. Both tables are insert-only (enforced by a trigger).
Example
The concept can easily be illustrated with an entity User
, stored in the user
and user_details
tables:
Table user
:
id timestamp
1 2018-04-10T12:00:00
2 2018-04-10T12:00:00
Table user_details
:
id user_id username first_name last_name timestamp
1 1 bob Bob Socks 2018-04-10T12:00:01
2 1 bob Bobby Socks 2018-04-10T12:00:02
3 2 alice Alice Jones 2018-04-10T12:00:03
4 1 bob Bobbers Socks 2018-04-10T12:00:04
5 2 alice Alicia Jones 2018-04-10T12:00:05
Both 'id' columns are defined as serial primary keys (strictly incrementing) and I have created an index on user_details (user_id, id DESC)
.
1 - How do I efficiently query the most recent version of an entity?
Given a user id I need a quick way to fetch the immutable data in user
and the most recent entry from user_details
. What kind of query would be best suited for this join?
2 - How do I efficiently query versions n and n-1 of an entity?
I am generating audit logs for time intervals by first fetching all rows with timestamp
between X and Y and then I fetch the inserted row and its predecessor (same user_id
, closest lower id
) and produce a diff from these. The number of rows inserted between X and Y is often high, so I need to efficiently fetch the current + previous pairs, i.e. given input user_details(5)
, I need to select the join of user(2) + user_details(5)
and user(2) + user_details(3)
. What kind of query would be best suited for this join?
Futile attempts
My best results so far has been with these queries:
Query for question 1:
SELECT *
FROM "user" u
JOIN LATERAL (SELECT *
FROM "user_details" ud
WHERE u.id = ud.user_id
ORDER BY id DESC
LIMIT 1
) detail ON TRUE
WHERE u.id IN
(...);
Query for question 2:
SELECT *
FROM "user" u
JOIN LATERAL (SELECT *
FROM "user_details" ud
WHERE u.id = ud.user_id
AND ud.id IN (...)
ORDER BY id DESC
LIMIT 2) ud ON TRUE;
However, both queries end up using nested loops (seen from EXPLAIN ANALYZE
) and take a long time to finish when run with a large number of ids (5000+).
Ideas
Can I use the user_details (user_id, id DESC)
index in a smart way to first create a CTE of the user_details
ids I need and then join user + user_details
based on this? Can I create a functional index of some sort? Do I need to maintain a predecessor
column in user_details
(or another table) do be able to look up relations of this type efficiently?
Thanks!
SQL Fiddle: http://www.sqlfiddle.com/#!17/5f5f0
Solution
Thanks to X and Y for pushing me in the right direction! I ended up using the solution @MichelMilezzi suggested for my first problem and an adaption of the @RadimBača solution for my second problem:
WITH
cte_1 AS (SELECT id, user_id FROM "user_details" WHERE id IN (8999, 9999)),
cte_2 as (SELECT cte_1.id, cte_1.user_id, prev.id AS prev_id, row_number() OVER (PARTITION BY cte_1.id, cte_1.user_id ORDER BY prev.id DESC) AS rownum FROM "user_details" prev, cte_1 WHERE prev.user_id = cte_1.user_id AND prev.id < cte_1.id)
SELECT main.*, detail.*, cte_2.id AS __id, (detail.id <> cte_2.id) AS __is_predecessor FROM "user" main, "user_details" detail, cte_2
WHERE main.id = cte_2.user_id AND cte_2.rownum = 1 AND (detail.id = cte_2.id OR detail.id = cte_2.prev_id);
回答1:
Consider using window functions
SELECT *
FROM "user" u
JOIN
(
SELECT row_number() over(partition by user_id order by id) rn,
*
FROM "user_details" ud
) t ON t.user_id = u.id
WHERE t.rn = 1
DEMO
This solution allows you to query also all N rows per group or N-th row per group.
回答2:
You could use DISTINCT ON
to retrieve the most recent version of user as follows:
SELECT
DISTINCT ON (u.id)
*
FROM
"user" u
JOIN user_details d ON (u.id = d.user_id)
WHERE
d.id IN (100, 200, 300, 400, 500, 600, 700, 800, 900, 1000)
ORDER BY
u.id,
d.id DESC
A quote from the docs:
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.
Sql fiddle here.
To get an older version you can use a window function
as pointed by @Radim.
来源:https://stackoverflow.com/questions/49775834/how-do-i-efficiently-query-versioned-rows-entities-in-postgresql