问题
I am using Data factory to create our staging area, the problem is whenever source data changes, we add a new row to staging tables.
For instance, assume we have the following data:
ID Fields created edited
100 ---------- '2017-07-01' '2017-07-05'
this will be stored in our staging tables like this:
ID Fields created edited
100 ---------- '2017-07-01' null
100 ---------- '2017-07-01' '2017-07-05'
Selecting the most recent row is expensive and We don't want that. How do you think we can avoid storing duplicate IDs in staging?
I assume on creating the pipelines, there should be a way to update the data if the ID already exists in staging.
the query format in data factory is like this:
$$Text.Format('select * from <<table>> where <<column>> >= \'{0:yyyy-MM-dd HH:mm}\' AND <<column>> < \'{1:yyyy-MM-dd HH:mm}\'', WindowStart, WindowEnd)
回答1:
Here i think most recent record is considered by edited column ,Try this below code
;WITH CTE (ID,Fields,created,edited)
AS
(
SELECT 100,'------------' ,'2017-07-01', null UNION ALL
SELECT 100,'------------ ','2017-07-01','2017-07-05'
)
SELECT ID,Fields,created,edited FROM
(
SELECT *, ROW_NUMBER()OVER(Partition by ID ORDER BY edited DESC ) RecentRecord FROM CTE
)DT
WHERE DT.RecentRecord=1
Result
ID Fields created edited
------------------------------------------
100 ------------ 2017-07-01 2017-07-05
回答2:
I found a solution to my problem. It goes like this:
I added a temp table into our data warehouse, before inserting data into Staging, first data comes into these temp tables, I remove those which already exist in staging tables then do a fresh insert. In this way, I will always keep the most recent data in the staging area and I won't have to use row_number() functions when joining staging tables to create DW.
I am not sure if it's the best approach, but it's working for me.
来源:https://stackoverflow.com/questions/45032243/keep-only-the-most-recent-row-of-data-in-data-factory