问题
I need to capture change between 2 datasets based on key(s): one historical and another current version of the same dataset (both datasets share same schema). For example for input tables:
-- Table t_hist
-------
id col
-------
1 A
2 B
3 C
4 D
-- Table t_curr
-------
id col
-------
1 a
2 B
4 d
5 E
Expected result (considering id
as comparison key):
-- Table t_change
----------------
id col change
----------------
1 a modified
2 B same
3 C deleted
4 d modified
5 E inserted
A naive approach that I could think of is:
SELECT id, col, change FROM
(
SELECT t_curr.id, t_curr.col,
CASE t_curr.col = t_hist.col
WHEN true THEN 'same'
ELSE 'modified'
END as change
FROM t_curr INNER JOIN t_hist ON t_curr.id = t_hist.id
)
UNION ALL
(
SELECT t_curr.id, t_curr.col, 'inserted' as change
FROM t_curr WHERE id NOT IN (SELECT id FROM t_hist)
)
UNION ALL
(
SELECT t_hist.id, t_hist.col, 'deleted' as change
FROM t_hist WHERE id NOT IN (SELECT id FROM t_curr)
)
But this approach involves multiple table scans (3 times for each dataset). It is also possible that before querying the 2 datasets, user may need to do some transformation / filter on the datasets, say only fetch rows from both sets where id > 2
. In that case this approach would be more in-efficient. I'm looking for an efficient way of achieving the same result. Thanks in advance.
EDIT
It is also possible that any of the dataset would have duplicates like:
-- Table t_curr
-------
id col
-------
1 A
1 B
2 C
-- Table t_hist
-------
id col
-------
1 B
2 C
2 D
-- Table t_change
----------------
id col change
----------------
1 A modified -- change status is 'modified' as first row for matching key is different
1 B inserted
2 C same
2 D deleted
In such case my query would not produce desired output. Thanks @Gordon Linoff for bringing up the scenario.
回答1:
This answers the original version of the question.
You can use full join
-- assuming id
s are unique in each table:
select id,
(case when h.id is null then 'New'
when c.id is null then 'Deleted'
when h.col <> c.col then 'Modified'
else 'Same'
end)
from t_hist h full join
t_curr c
using (id);
Note: This does not take into account NULL
values for the value column(s). That logic can easily be incorporated but adds an additional complication that doesn't seem necessary based on your sample data.
来源:https://stackoverflow.com/questions/61559925/capture-changes-in-2-datasets