How to solve a gap-and-islands problem with a high volume set of data in Impala

问题

Have a Type 2 Dimension residing in an Impala table with ~500M rows having 102 columns : ( C1, C2, ..., C8,...C100, Eff_DT, EXP_DT) Need to select only the rows that have distinct combination value of (C1,C2,..,C8). For each selected record, the EFF_DT and EXP_DT must be respectively the min(eff_dt) and max(eff_dt) of the group to which that record belongs ( a group here is defined by a distinct combination (C1,C2,..,C8)

A simple Group By will not solve the problem here because it will omit the time lags for the same group ...

For the sake of simplicity, here is what is required and what i have tried - assuming that only 2 columns are defining a group (not 8 ) Here is an example of input, desired output and output by using only group by ...

--INPUT                              --DESIRED OUTPUT                     --OUTPUT of SIMPLE GROUP BY
------------------------------------------------------------------------------------------------------------ 
C1  C2  EFF_DT      EXP_DT           C1   C2  Eff_dt      EXP_DT          C1   C2  EFF_DT       EXP_DT
4   8   2013-11-30  2014-01-22       4    8   2013-11-30  2014-01-22      4    8   2013-11-30   2999-12-31
2   8   2014-01-23  2014-01-23       2    8   2014-01-23  2014-01-23      2    8   2014-01-23   2014-01-23
4   8   2014-01-24  2015-12-31       4    8   2014-01-24  2999-12-31
4   8   2016-01-01  2016-12-31
4   8   2017-01-01  2018-03-15
4   8   2018-03-16  2018-07-24
4   8   2018-07-25  2999-12-31

Tried to use a subquery inside select statement to select max(exp_dt) based on current row but didnt work as impala does not support that .....

Here is the query i tried , which is working fine but not in Impala (Because subqueries are not supported inside select statements

SELECT    
     T0.C1,
     T0.C2,
     MIN(T0.EFF_DT) AS MIN_EFF_DT,
     T0.EXP_DT
FROM (
    SELECT 
    T1.C1,
    T1.C2,
    T1.EFF_DT,
    (
        SELECT MAX(T2.EXP_DT)
        FROM (select * from TABLE_NAME ) T2
        WHERE T2.C1 = T1.C1
        AND   T2.C2 = T1.C2
        AND NOT EXISTS (
        SELECT 1 FROM (select * from TABLE_NAME) T3
            WHERE T3.EXP_DT < T2.EXP_DT 
            AND   T3.EXP_DT > T1.EXP_DT
            AND  (T3.C1 <> T2.C1 OR T3.C2 <> T2.C2 )
        )

    ) EXP_DT
    FROM (select * from TABLE_NAME) T1
) T0 
GROUP BY 
T0.C1,
T0.C2,
T0.EXP_DT
ORDER BY MIN_EFF_DT ASC

回答1:

In all likelihood, the previous solution will work when modified for the id column:

select id, c1, c2, min(eff_dt), max(exp_dt)
from (select t.*,
             row_number() over (partition by id order by eff_dt) as seqnum,
             row_number() over (partition by id, c1, c2 order by eff_dt) as seqnum_1
      from t
     ) t
group by id, c1, c2, (seqnum - seqnum_1);

You should be able to expand the number of columns as you with.

来源：https://stackoverflow.com/questions/58940019/how-to-solve-a-gap-and-islands-problem-with-a-high-volume-set-of-data-in-impala

标签

sql

impala