How to implement Slowly Changing Dimensions (SCD2) Type 2 in Spark

后端未结

关注

 2  528

逝去的感伤 2020-12-11 13:33

We want to implement SCD2 in Spark using SQL Join. i got reference from Github

https://gist.github.com/rampage644/cc4659edd11d9a288c1b

but it\'s not very c

2条回答

无人及你 (楼主)

2020-12-11 13:50
Here's the detailed implementation of slowly changing dimension type 2 in Spark (Data frame and SQL) using exclusive join approach.

Assuming that the source is sending a complete data file i.e. old, updated and new records.

Steps:

Load the recent file data to STG table Select all the expired records from HIST table
```
1. select * from HIST_TAB where exp_dt != '2099-12-31'
```
Select all the records which are not changed from STG and HIST using inner join and filter on HIST.column = STG.column as below
```
2. select hist.* from HIST_TAB hist inner join STG_TAB stg on hist.key = stg.key where hist.column = stg.column
```
Select all the new and updated records which are changed from STG_TAB using exclusive left join with HIST_TAB and set expiry and effective date as below
```
3. select stg.*, eff_dt (yyyy-MM-dd), exp_dt (2099-12-31) from STG_TAB stg left join (select * from HIST_TAB where exp_dt = '2099-12-31') hist 
on hist.key = stg.key where hist.key is null or hist.column != stg.column
```
Select all updated old records from the HIST table using exclusive left join with STG table and set their expiry date as shown below:
```
4. select hist.*, exp_dt(yyyy-MM-dd) from (select * from HIST_TAB where exp_dt = '2099-12-31') hist left join STG_TAB stg 
on hist.key= stg.key where hist.key is null or hist.column!= stg.column
```
unionall queries from 1-4 and insert overwrite result to HIST table

More detailed implementation of SCD type 2 in Scala and Pyspark can be found here-

https://github.com/sahilbhange/spark-slowly-changing-dimension

Hope this helps!
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...