Get envelope.i.e overlapping time spans

问题

I have a table with online sessions like this (empty rows are just for better visibility):

ip_address  | start_time       | stop_time
------------|------------------|------------------
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 08:12
10.10.10.10 | 2016-04-02 08:11 | 2016-04-02 08:20

10.10.10.10 | 2016-04-02 09:00 | 2016-04-02 09:10
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:08
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:11
10.10.10.10 | 2016-04-02 09:02 | 2016-04-02 09:15
10.10.10.10 | 2016-04-02 09:10 | 2016-04-02 09:12

10.66.44.22 | 2016-04-02 08:05 | 2016-04-02 08:07
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11

And I need the "envelop" online time spans:

ip_address  | full_start_time  | full_stop_time
------------|------------------|------------------
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 08:20
10.10.10.10 | 2016-04-02 09:00 | 2016-04-02 09:15
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11

I have this query which returns desired result:

WITH t AS 
    -- Determine full time-range of each IP
    (SELECT ip_address, MIN(start_time) AS min_start_time, MAX(stop_time) AS max_stop_time FROM IP_SESSIONS GROUP BY ip_address),
t2 AS
    -- compose ticks
    (SELECT DISTINCT ip_address, min_start_time + (LEVEL-1) * INTERVAL '1' MINUTE AS ts
    FROM t
    CONNECT BY min_start_time + (LEVEL-1) * INTERVAL '1' MINUTE <= max_stop_time),
t3 AS 
    -- get all "online" ticks
    (SELECT DISTINCT ip_address, ts
    FROM t2
        JOIN IP_SESSIONS USING (ip_address)
    WHERE ts BETWEEN start_time AND stop_time),
t4 AS
    (SELECT ip_address, ts,
        LAG(ts) OVER (PARTITION BY ip_address ORDER BY ts) AS previous_ts
    FROM t3),
t5 AS 
    (SELECT ip_address, ts, 
        SUM(DECODE(previous_ts,NULL,1,0 + (CASE WHEN previous_ts + INTERVAL '1' MINUTE <> ts THEN 1 ELSE 0 END))) 
            OVER (PARTITION BY ip_address ORDER BY ts ROWS UNBOUNDED PRECEDING) session_no
    FROM t4)
SELECT ip_address, MIN(ts) AS full_start_time, MAX(ts) AS full_stop_time
FROM t5
GROUP BY ip_address, session_no
ORDER BY 1,2;

However, I am concerned about the performance. The table has hundreds of million rows and the time resolution is millisecond (not one Minute as given in example). Thus CTE t3 is gonna be huge. Does anybody have a solution which avoids the Self-Join and "CONNECT BY"?

A single smart Analytic Function would be great.

回答1:

Try this one, too. I tested it the best I could, I believe it covers all the possibilities, including coalescing adjacent intervals (10:15 to 10:30 and 10:30 to 10:40 are combined into a single interval, 10:15 to 10:40). It should also be quite fast, it doesn't use much.

with m as
        (
         select ip_address, start_time,
                   max(stop_time) over (partition by ip_address order by start_time 
                             rows between unbounded preceding and 1 preceding) as m_time
         from ip_sessions
         union all
         select ip_address, NULL, max(stop_time) from ip_sessions group by ip_address
        ),
     n as
        (
         select ip_address, start_time, m_time 
         from m 
         where start_time > m_time or start_time is null or m_time is null
        ),
     f as
        (
         select ip_address, start_time,
            lead(m_time) over (partition by ip_address order by start_time) as stop_time
         from n
        )
select * from f where start_time is not null
/

回答2:

Please test this solution, it works for your examples, but there may be some cases I didn't notice. No connect-by, no self-join.

with io as (
  select * from (
    select ip_address, t1, io, sum(io) over (partition by ip_address order by t1) sio
      from (
        select ip_address, start_time t1, 1 io from ip_sessions
        union all 
        select ip_address, stop_time, -1 io from ip_sessions ) )
    where (io = 1 and sio = 1) or (io = -1 and sio = 0) )
select ip_address, t1, t2
  from (
    select io.*, lead(t1) over (partition by ip_address order by t1) as t2 from io)
  where io = 1

Test data:

create table ip_sessions (ip_address varchar2(15), start_time date, stop_time date);
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 08:00:00', timestamp '2016-04-02 08:12:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 08:11:00', timestamp '2016-04-02 08:20:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:00:00', timestamp '2016-04-02 09:10:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:05:00', timestamp '2016-04-02 09:08:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:02:00', timestamp '2016-04-02 09:15:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:10:00', timestamp '2016-04-02 09:12:00');
insert into ip_sessions values ('10.66.44.22', timestamp '2016-04-02 08:05:00', timestamp '2016-04-02 08:07:00');
insert into ip_sessions values ('10.66.44.22', timestamp '2016-04-02 08:03:00', timestamp '2016-04-02 08:11:00');

Output:

IP_ADDRESS   T1                   T2
-----------  -------------------  -------------------
10.10.10.10  2016-04-02 08:00:00  2016-04-02 08:20:00
10.10.10.10  2016-04-02 09:00:00  2016-04-02 09:15:00
10.66.44.22  2016-04-02 08:03:00  2016-04-02 08:11:00

回答3:

I think using lag() and cumulative sum is going to have much better performance:

select ip_address, min(start_time) as full_start_time,
       max(end_time) as full_end_time
from (select t.*,
             sum(case when prev_et >= start_time then 0 else 1 end) over
                 (partition by ip_address order by start_time) as grp
      from (select s.*,
                   lag(end_time) over (partition by ip_address order by end_time) as prev_et
            from ip_seesions s)
           ) t
group by grp, ip_address
order by 1, 2;

gives result:

ip_address  | full_start_time  | full_stop_time
------------|------------------|------------------
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 09:15
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:12
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11
10.66.44.22 | 2016-04-02 08:05 | 2016-04-02 08:07

来源：https://stackoverflow.com/questions/36387048/get-envelope-i-e-overlapping-time-spans

标签

sql

Oracle

analytics

timespan