How to implement Slowly Changing Dimensions (SCD2) Type 2 in Spark

后端 未结 2 528
逝去的感伤
逝去的感伤 2020-12-11 13:33

We want to implement SCD2 in Spark using SQL Join. i got reference from Github

https://gist.github.com/rampage644/cc4659edd11d9a288c1b

but it\'s not very c

2条回答
  •  无人及你
    2020-12-11 13:50

    Here's the detailed implementation of slowly changing dimension type 2 in Spark (Data frame and SQL) using exclusive join approach.

    Assuming that the source is sending a complete data file i.e. old, updated and new records.

    Steps:

    Load the recent file data to STG table Select all the expired records from HIST table

    1. select * from HIST_TAB where exp_dt != '2099-12-31'
    

    Select all the records which are not changed from STG and HIST using inner join and filter on HIST.column = STG.column as below

    2. select hist.* from HIST_TAB hist inner join STG_TAB stg on hist.key = stg.key where hist.column = stg.column
    

    Select all the new and updated records which are changed from STG_TAB using exclusive left join with HIST_TAB and set expiry and effective date as below

    3. select stg.*, eff_dt (yyyy-MM-dd), exp_dt (2099-12-31) from STG_TAB stg left join (select * from HIST_TAB where exp_dt = '2099-12-31') hist 
    on hist.key = stg.key where hist.key is null or hist.column != stg.column
    

    Select all updated old records from the HIST table using exclusive left join with STG table and set their expiry date as shown below:

    4. select hist.*, exp_dt(yyyy-MM-dd) from (select * from HIST_TAB where exp_dt = '2099-12-31') hist left join STG_TAB stg 
    on hist.key= stg.key where hist.key is null or hist.column!= stg.column
    

    unionall queries from 1-4 and insert overwrite result to HIST table

    More detailed implementation of SCD type 2 in Scala and Pyspark can be found here-

    https://github.com/sahilbhange/spark-slowly-changing-dimension

    Hope this helps!

提交回复
热议问题