问题
I have a table with online sessions like this (empty rows are just for better visibility):
ip_address | start_time | stop_time
------------|------------------|------------------
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 08:12
10.10.10.10 | 2016-04-02 08:11 | 2016-04-02 08:20
10.10.10.10 | 2016-04-02 09:00 | 2016-04-02 09:10
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:08
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:11
10.10.10.10 | 2016-04-02 09:02 | 2016-04-02 09:15
10.10.10.10 | 2016-04-02 09:10 | 2016-04-02 09:12
10.66.44.22 | 2016-04-02 08:05 | 2016-04-02 08:07
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11
And I need the "envelop" online time spans:
ip_address | full_start_time | full_stop_time
------------|------------------|------------------
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 08:20
10.10.10.10 | 2016-04-02 09:00 | 2016-04-02 09:15
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11
I have this query which returns desired result:
WITH t AS
-- Determine full time-range of each IP
(SELECT ip_address, MIN(start_time) AS min_start_time, MAX(stop_time) AS max_stop_time FROM IP_SESSIONS GROUP BY ip_address),
t2 AS
-- compose ticks
(SELECT DISTINCT ip_address, min_start_time + (LEVEL-1) * INTERVAL '1' MINUTE AS ts
FROM t
CONNECT BY min_start_time + (LEVEL-1) * INTERVAL '1' MINUTE <= max_stop_time),
t3 AS
-- get all "online" ticks
(SELECT DISTINCT ip_address, ts
FROM t2
JOIN IP_SESSIONS USING (ip_address)
WHERE ts BETWEEN start_time AND stop_time),
t4 AS
(SELECT ip_address, ts,
LAG(ts) OVER (PARTITION BY ip_address ORDER BY ts) AS previous_ts
FROM t3),
t5 AS
(SELECT ip_address, ts,
SUM(DECODE(previous_ts,NULL,1,0 + (CASE WHEN previous_ts + INTERVAL '1' MINUTE <> ts THEN 1 ELSE 0 END)))
OVER (PARTITION BY ip_address ORDER BY ts ROWS UNBOUNDED PRECEDING) session_no
FROM t4)
SELECT ip_address, MIN(ts) AS full_start_time, MAX(ts) AS full_stop_time
FROM t5
GROUP BY ip_address, session_no
ORDER BY 1,2;
However, I am concerned about the performance. The table has hundreds of million rows and the time resolution is millisecond (not one Minute as given in example). Thus CTE t3
is gonna be huge. Does anybody have a solution which avoids the Self-Join and "CONNECT BY"?
A single smart Analytic Function would be great.
回答1:
Try this one, too. I tested it the best I could, I believe it covers all the possibilities, including coalescing adjacent intervals (10:15 to 10:30 and 10:30 to 10:40 are combined into a single interval, 10:15 to 10:40). It should also be quite fast, it doesn't use much.
with m as
(
select ip_address, start_time,
max(stop_time) over (partition by ip_address order by start_time
rows between unbounded preceding and 1 preceding) as m_time
from ip_sessions
union all
select ip_address, NULL, max(stop_time) from ip_sessions group by ip_address
),
n as
(
select ip_address, start_time, m_time
from m
where start_time > m_time or start_time is null or m_time is null
),
f as
(
select ip_address, start_time,
lead(m_time) over (partition by ip_address order by start_time) as stop_time
from n
)
select * from f where start_time is not null
/
回答2:
Please test this solution, it works for your examples, but there may be some cases I didn't notice. No connect-by, no self-join.
with io as (
select * from (
select ip_address, t1, io, sum(io) over (partition by ip_address order by t1) sio
from (
select ip_address, start_time t1, 1 io from ip_sessions
union all
select ip_address, stop_time, -1 io from ip_sessions ) )
where (io = 1 and sio = 1) or (io = -1 and sio = 0) )
select ip_address, t1, t2
from (
select io.*, lead(t1) over (partition by ip_address order by t1) as t2 from io)
where io = 1
Test data:
create table ip_sessions (ip_address varchar2(15), start_time date, stop_time date);
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 08:00:00', timestamp '2016-04-02 08:12:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 08:11:00', timestamp '2016-04-02 08:20:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:00:00', timestamp '2016-04-02 09:10:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:05:00', timestamp '2016-04-02 09:08:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:02:00', timestamp '2016-04-02 09:15:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:10:00', timestamp '2016-04-02 09:12:00');
insert into ip_sessions values ('10.66.44.22', timestamp '2016-04-02 08:05:00', timestamp '2016-04-02 08:07:00');
insert into ip_sessions values ('10.66.44.22', timestamp '2016-04-02 08:03:00', timestamp '2016-04-02 08:11:00');
Output:
IP_ADDRESS T1 T2
----------- ------------------- -------------------
10.10.10.10 2016-04-02 08:00:00 2016-04-02 08:20:00
10.10.10.10 2016-04-02 09:00:00 2016-04-02 09:15:00
10.66.44.22 2016-04-02 08:03:00 2016-04-02 08:11:00
回答3:
I think using lag()
and cumulative sum is going to have much better performance:
select ip_address, min(start_time) as full_start_time,
max(end_time) as full_end_time
from (select t.*,
sum(case when prev_et >= start_time then 0 else 1 end) over
(partition by ip_address order by start_time) as grp
from (select s.*,
lag(end_time) over (partition by ip_address order by end_time) as prev_et
from ip_seesions s)
) t
group by grp, ip_address
order by 1, 2;
gives result:
ip_address | full_start_time | full_stop_time
------------|------------------|------------------
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 09:15
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:12
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11
10.66.44.22 | 2016-04-02 08:05 | 2016-04-02 08:07
来源:https://stackoverflow.com/questions/36387048/get-envelope-i-e-overlapping-time-spans