How to solve a gap-and-islands problem with a high volume set of data in Impala

点点圈 提交于 2020-01-06 05:34:27

问题


Have a Type 2 Dimension residing in an Impala table with ~500M rows having 102 columns : ( C1, C2, ..., C8,...C100, Eff_DT, EXP_DT) Need to select only the rows that have distinct combination value of (C1,C2,..,C8). For each selected record, the EFF_DT and EXP_DT must be respectively the min(eff_dt) and max(eff_dt) of the group to which that record belongs ( a group here is defined by a distinct combination (C1,C2,..,C8)

A simple Group By will not solve the problem here because it will omit the time lags for the same group ...

For the sake of simplicity, here is what is required and what i have tried - assuming that only 2 columns are defining a group (not 8 ) Here is an example of input, desired output and output by using only group by ...

--INPUT                              --DESIRED OUTPUT                     --OUTPUT of SIMPLE GROUP BY
------------------------------------------------------------------------------------------------------------ 
C1  C2  EFF_DT      EXP_DT           C1   C2  Eff_dt      EXP_DT          C1   C2  EFF_DT       EXP_DT
4   8   2013-11-30  2014-01-22       4    8   2013-11-30  2014-01-22      4    8   2013-11-30   2999-12-31
2   8   2014-01-23  2014-01-23       2    8   2014-01-23  2014-01-23      2    8   2014-01-23   2014-01-23
4   8   2014-01-24  2015-12-31       4    8   2014-01-24  2999-12-31
4   8   2016-01-01  2016-12-31
4   8   2017-01-01  2018-03-15
4   8   2018-03-16  2018-07-24
4   8   2018-07-25  2999-12-31

Tried to use a subquery inside select statement to select max(exp_dt) based on current row but didnt work as impala does not support that .....

Here is the query i tried , which is working fine but not in Impala (Because subqueries are not supported inside select statements

SELECT    
     T0.C1,
     T0.C2,
     MIN(T0.EFF_DT) AS MIN_EFF_DT,
     T0.EXP_DT
FROM (
    SELECT 
    T1.C1,
    T1.C2,
    T1.EFF_DT,
    (
        SELECT MAX(T2.EXP_DT)
        FROM (select * from TABLE_NAME ) T2
        WHERE T2.C1 = T1.C1
        AND   T2.C2 = T1.C2
        AND NOT EXISTS (
        SELECT 1 FROM (select * from TABLE_NAME) T3
            WHERE T3.EXP_DT < T2.EXP_DT 
            AND   T3.EXP_DT > T1.EXP_DT
            AND  (T3.C1 <> T2.C1 OR T3.C2 <> T2.C2 )
        )

    ) EXP_DT
    FROM (select * from TABLE_NAME) T1
) T0 
GROUP BY 
T0.C1,
T0.C2,
T0.EXP_DT
ORDER BY MIN_EFF_DT ASC

回答1:


In all likelihood, the previous solution will work when modified for the id column:

select id, c1, c2, min(eff_dt), max(exp_dt)
from (select t.*,
             row_number() over (partition by id order by eff_dt) as seqnum,
             row_number() over (partition by id, c1, c2 order by eff_dt) as seqnum_1
      from t
     ) t
group by id, c1, c2, (seqnum - seqnum_1);

You should be able to expand the number of columns as you with.



来源:https://stackoverflow.com/questions/58940019/how-to-solve-a-gap-and-islands-problem-with-a-high-volume-set-of-data-in-impala

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!