问题
Have a Type 2 Dimension residing in an Impala table with ~500M rows having 102 columns : ( C1, C2, ..., C8,...C100, Eff_DT, EXP_DT) Need to select only the rows that have distinct combination value of (C1,C2,..,C8). For each selected record, the EFF_DT and EXP_DT must be respectively the min(eff_dt) and max(eff_dt) of the group to which that record belongs ( a group here is defined by a distinct combination (C1,C2,..,C8)
A simple Group By will not solve the problem here because it will omit the time lags for the same group ...
For the sake of simplicity, here is what is required and what i have tried - assuming that only 2 columns are defining a group (not 8 ) Here is an example of input, desired output and output by using only group by ...
--INPUT --DESIRED OUTPUT --OUTPUT of SIMPLE GROUP BY
------------------------------------------------------------------------------------------------------------
C1 C2 EFF_DT EXP_DT C1 C2 Eff_dt EXP_DT C1 C2 EFF_DT EXP_DT
4 8 2013-11-30 2014-01-22 4 8 2013-11-30 2014-01-22 4 8 2013-11-30 2999-12-31
2 8 2014-01-23 2014-01-23 2 8 2014-01-23 2014-01-23 2 8 2014-01-23 2014-01-23
4 8 2014-01-24 2015-12-31 4 8 2014-01-24 2999-12-31
4 8 2016-01-01 2016-12-31
4 8 2017-01-01 2018-03-15
4 8 2018-03-16 2018-07-24
4 8 2018-07-25 2999-12-31
Tried to use a subquery inside select statement to select max(exp_dt) based on current row but didnt work as impala does not support that .....
Here is the query i tried , which is working fine but not in Impala (Because subqueries are not supported inside select statements
SELECT
T0.C1,
T0.C2,
MIN(T0.EFF_DT) AS MIN_EFF_DT,
T0.EXP_DT
FROM (
SELECT
T1.C1,
T1.C2,
T1.EFF_DT,
(
SELECT MAX(T2.EXP_DT)
FROM (select * from TABLE_NAME ) T2
WHERE T2.C1 = T1.C1
AND T2.C2 = T1.C2
AND NOT EXISTS (
SELECT 1 FROM (select * from TABLE_NAME) T3
WHERE T3.EXP_DT < T2.EXP_DT
AND T3.EXP_DT > T1.EXP_DT
AND (T3.C1 <> T2.C1 OR T3.C2 <> T2.C2 )
)
) EXP_DT
FROM (select * from TABLE_NAME) T1
) T0
GROUP BY
T0.C1,
T0.C2,
T0.EXP_DT
ORDER BY MIN_EFF_DT ASC
回答1:
In all likelihood, the previous solution will work when modified for the id
column:
select id, c1, c2, min(eff_dt), max(exp_dt)
from (select t.*,
row_number() over (partition by id order by eff_dt) as seqnum,
row_number() over (partition by id, c1, c2 order by eff_dt) as seqnum_1
from t
) t
group by id, c1, c2, (seqnum - seqnum_1);
You should be able to expand the number of columns as you with.
来源:https://stackoverflow.com/questions/58940019/how-to-solve-a-gap-and-islands-problem-with-a-high-volume-set-of-data-in-impala