Teradata partitioned query … following rows dynamically

孤街醉人 提交于 2021-02-08 08:29:45

问题


I have a table with the following columns and data. Data describes certain customer activity periods

cust_id    s_date       e_date
11111    01.03.2014   31.03.2014
11111    10.04.2014   30.04.2014
11111    01.05.2014   10.05.2014
11111    15.06.2014   31.07.2014
22222    01.04.2014   31.05.2014
22222    01.06.2014   30.06.2014
22222    01.07.2014   15.07.2014

And I want to write a query which gives this result:

cust_id    s_date       e_date
11111    01.03.2014   10.05.2014
11111    15.06.2014   31.07.2014
22222    01.04.2014   15.07.2014

The query result purpose is to "merge" rows into one row when customer IN-activity period is less than 15 days. I can handle with "1 row preceding" but if needed to merge 3 or more rows then it does not work. I run out of ideas how to write this query.

My "half" 1 row preceding query:

SELECT cust_id
     , start_date     as current_period_start_date
     , end_date       as current_period_end_date
     , end_date+15    as current_period_expired_date
     , coalesce(
            min(current_period_expire_date)
           over(partition by cust_id
                    order by start_date
                     rows between 1 preceding and 1 preceding)
               , cast('1900-01-01' as date)) as previous_period_expire_date
     , case 
         when current_period_start_date <= previous_period_expire_date
         then min(current_period_start_date)
             over(partition by cust_id
                      order by start_date
                       rows between 1 preceding and current row)
         else current_period_start_date
       end as new_current_period_start_date

  FROM MY_DB.my_table
     . . .

Also, is it possible to change preceding into dynamical way like this?

... over(partition by ... order by ... rows between X preceding and current row)

回答1:


Gordon's answer can be modified as the basic LAG syntax is easily rewritten:

LAG(col, n) OVER (ORDER BY c) 

is the same as a

MIN(col) OVER (ORDER BY c ROWS BETWEEN n PRECEDING AND n PRECEDING)

The possible default value as a 3rd parameter can be done using a COALESCE(LAG...., default value), only the IGNORE NULLS option is a really tough one.

This results in:

SELECT cust_id, MIN(s_date) AS s_date, MAX(e_date) AS e_date
FROM (SELECT t.*, SUM(GroupStartFlag) OVER (PARTITION BY cust_id ORDER BY s_date ROWS UNBOUNDED PRECEDING) AS grpid
      FROM (SELECT cust_id, s_date, e_date,
                   (CASE WHEN s_date <= MIN(e_date) 
                                        OVER (PARTITION BY cust_id 
                                              ORDER BY s_date
                                              ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) + 15
                         THEN 0
                         ELSE 1
                    END) AS GroupStartFlag
            FROM  vt
           ) t
     ) t
GROUP BY cust_id, grpid;

If you don't need any additional columns (just the cust_id and the dates) you can also utilize a specific TD 13.10 table function for normalizing periods. To include the 15 day difference you might simply subtract/add 15 days:

WITH cte (cust_id, pd)
AS 
 ( SELECT cust_id, PERIOD(s_date-15, e_date) AS pd
   FROM vt
 )
SELECT cust_id,
   BEGIN(pd)+15,
   END(pd),
   cnt
FROM TABLE (TD_NORMALIZE_OVERLAP_MEET
            (NEW VARIANT_TYPE(cte.cust_id)
                ,cte.pd)
        RETURNS (cust_id INTEGER
                ,pd PERIOD(DATE)
                ,cnt INTEGER) --optional: number of rows normalized in one result row
        HASH BY cust_id
        LOCAL ORDER BY cust_id, pd
        ) AS t;

In TD 14.10 there's also a real nice syntax for normalizing periods:

SELECT cust_id, BEGIN (pd)+15, END(pd) 
FROM
 (
   SELECT NORMALIZE
      cust_id, PERIOD(s_date-15, e_date) AS pd
   FROM vt
 ) AS dt

Btw, periods are defined with an inclusive start but exclusive end (i.e. for a gapless period end of the previous period and start of the next got the same value), so you might have to change 15 to 16 to get the desired result.




回答2:


I would approach this using the lag() function. This function can be used to identify each row that begins a new period. Then when this flag is cumulatively summed, it provides a group identifier. Here is what the code looks like:

select cust_id, min(s_date) as s_date, max(e_date) as e_date
from (select t.*, sum(GroupStartFlag) over (partition by cust_id order by s_date rows unbounded preceding) as grpid
      from (select cust_id, s_date, e_date,
                   (case when s_date <= lag(e_date) over (partition by cust_id order by s_date) + 15
                         then 0
                         else 1
                    end) as GroupStartFlag
            from  MY_DB.my_table
           ) t
     ) t
group by cust_id, grpid;

Note: Teradata supports window functions but sometimes has odd requirements on them. I think the above will work directly, but I have no system to test it on.

EDIT:

I'm not sure if Teradata supports the lag() function. You can do the equivalent with a correlated subquery:

select cust_id, min(s_date) as s_date, max(e_date) as e_date
from (select t.*,
             sum(case when s_date <= prev_edate + 15 then 0 else 1 end) over
                 (partition by cust_id order by s_date rows unbounded preceding) as grpid
      from (select cust_id, s_date, e_date,
                   (select max(e_date) 
                    from MY_DB.my_table t2
                    where t2.cust_id = t.cust_id and
                          t2.s_date < t.s_date
                   ) as prev_edate
            from  MY_DB.my_table t
           ) t
     ) t
group by cust_id, grpid;


来源:https://stackoverflow.com/questions/25167738/teradata-partitioned-query-following-rows-dynamically

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!