Capture changes in 2 datasets

问题

I need to capture change between 2 datasets based on key(s): one historical and another current version of the same dataset (both datasets share same schema). For example for input tables:

-- Table t_hist
-------
id  col
-------
1   A
2   B
3   C
4   D

-- Table t_curr
-------
id  col
-------
1   a
2   B
4   d
5   E

Expected result (considering id as comparison key):

-- Table t_change
----------------
id  col change
----------------
1   a   modified
2   B   same
3   C   deleted
4   d   modified
5   E   inserted

A naive approach that I could think of is:

SELECT id, col, change FROM
(
    SELECT t_curr.id, t_curr.col, 
        CASE t_curr.col = t_hist.col
        WHEN true THEN 'same'
        ELSE 'modified' 
        END as change
    FROM t_curr INNER JOIN t_hist ON t_curr.id = t_hist.id
) 
UNION ALL
(
    SELECT t_curr.id, t_curr.col, 'inserted' as change
    FROM t_curr WHERE id NOT IN (SELECT id FROM t_hist)
)
UNION ALL
(
    SELECT t_hist.id, t_hist.col, 'deleted' as change
    FROM t_hist WHERE id NOT IN (SELECT id FROM t_curr)
)

But this approach involves multiple table scans (3 times for each dataset). It is also possible that before querying the 2 datasets, user may need to do some transformation / filter on the datasets, say only fetch rows from both sets where id > 2. In that case this approach would be more in-efficient. I'm looking for an efficient way of achieving the same result. Thanks in advance.

EDIT

It is also possible that any of the dataset would have duplicates like:

-- Table t_curr
-------
id  col
-------
1   A
1   B
2   C

-- Table t_hist
-------
id  col
-------
1   B
2   C
2   D

-- Table t_change
----------------
id  col change
----------------
1   A   modified   -- change status is 'modified' as first row for matching key is different
1   B   inserted
2   C   same
2   D   deleted

In such case my query would not produce desired output. Thanks @Gordon Linoff for bringing up the scenario.

回答1:

This answers the original version of the question.

You can use full join -- assuming ids are unique in each table:

select id,
       (case when h.id is null then 'New'
             when c.id is null then 'Deleted'
             when h.col <> c.col then 'Modified'
             else 'Same'
        end)
from t_hist h full join
     t_curr c
     using (id);

Note: This does not take into account NULL values for the value column(s). That logic can easily be incorporated but adds an additional complication that doesn't seem necessary based on your sample data.

来源：https://stackoverflow.com/questions/61559925/capture-changes-in-2-datasets

标签

sql

google-bigquery