Capture changes between 2 datasets with duplicates

问题

This is a follow-up question of Capture changes in 2 datasets. I need to capture change between 2 datasets based on key(s): one historical and another current version of the same dataset (both datasets share same schema). These datasets can have duplicate rows as well. In below example id is considered key for comparison:

-- Table t_curr
-------
id  col
-------
1   A
1   B
2   C
3   F

-- Table t_hist
-------
id  col
-------
1   B
2   C
2   D
4   G
-- Expected output t_change
----------------
id  col change
----------------
1   A   modified   -- change status is 'modified' as first row for id=1 is different for both tables
1   B   inserted
2   C   same
2   D   deleted
3   F   inserted
4   G   deleted

I'm looking for an efficient solution to get the desired output.

EDIT

Explanation: While fetching data from t_curr if records come in the same order as shown and records were ranked wrt to id:

1/A is first and 1/B second records in t_curr
1/B is the first records in t_hist
1st record for both datasets compared ie 1/A in t_curr compared with 1/B of t_hist hence 1/A marked as modified in t_change
Since 1/B present only in t_curr it's marked inserted

回答1:

I was able to do it using full outer join and row_number(). Query:

with t_hist as (
select 1 as id, 'B' as col union all
select 2 as id, 'C' as col union all
select 2 as id, 'D' as col union all
select 4 as id, 'G' as col
),
t_curr as (
select 1 as id1,    'A' as col1 union all
select 1 as id1,    'B' as col1 union all
select 2 as id1,    'C' as col1 union all
select 3 as id1,    'F' as col1
)

select
  case when id1 is null then id else id1 end as id_,
  case when col1 is null then col else col1 end as col_,
  case 
    when id is null then 'inserted'
    when id1 is null then 'deleted'
    when col = col1 then 'same'
    else 'modified'
    end
  as change
from
(select t_curr.*, t_hist.* from (select *, row_number() over (partition by id1 order by id1) r1 from t_curr) t_curr 
full outer join (select *, row_number() over (partition by id) r from t_hist ) t_hist on id1 = id and r1 = r )
order by id_

来源：https://stackoverflow.com/questions/61570412/capture-changes-between-2-datasets-with-duplicates

标签

sql

google-bigquery