Count of overlapping intervals in BigQuery

醉酒当歌 提交于 2021-02-08 10:33:35

问题


Given a table of intervals, can I efficiently query for the number of currently open intervals at the start of each interval (including the current interval itself)?

For example, given the following table:

start_time end_time
         1       10
         2        5
         3        4
         5        6
         7       11
        19       20

I want the following output:

start_time count
         1     1
         2     2
         3     3
         5     3
         7     2
        19     1

On small datasets, I can solve this problem by joining the dataset against itself:

WITH intervals AS (
  SELECT 1 AS start, 10 AS end UNION ALL
  SELECT 2, 5 UNION ALL
  SELECT 3, 4 UNION ALL
  SELECT 5, 6 UNION ALL
  SELECT 7, 11 UNION ALL
  SELECT 19, 20
)
SELECT 
  a.start_time,
  count(*)
FROM 
  intervals a CROSS JOIN intervals b
WHERE
  a.start_time >= b.start_time AND
  a.start_time <= b.end_time
GROUP BY a.start_time
ORDER BY a.start_time

With large datasets the CROSS JOIN is both impractical and unnecessary, since any given answer only depends on a small number of preceding intervals (when sorted by start_time). In fact, on the dataset that I have, it times out. Is there a better way to achieve this?


回答1:


... CROSS JOIN is both impractical and unnecessary ...
Is there a better way to achieve this?

Try below for BigQuery Standard SQL. No JOINs involved

#standardSQL
SELECT 
  start_time,
  (SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt  
FROM (
  SELECT 
    start_time, 
    ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
  FROM intervals
)
-- ORDER BY start_time  

You can test/play with it using below example with dummy data from your question

#standardSQL
WITH intervals AS (
  SELECT 1 AS start_time, 10 AS end_time UNION ALL
  SELECT 2, 5 UNION ALL
  SELECT 3, 4 UNION ALL
  SELECT 5, 6 UNION ALL
  SELECT 7, 11 UNION ALL
  SELECT 19, 20 
)
SELECT 
  start_time,
  (SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt  
FROM (
  SELECT 
    start_time, 
    ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
  FROM intervals
)
-- ORDER BY start_time


来源:https://stackoverflow.com/questions/43327248/count-of-overlapping-intervals-in-bigquery

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!