SQL Server - cumulative sum on overlapping data - getting date that sum reaches a given value

只愿长相守 提交于 2019-12-10 13:53:00

问题


In our company, our clients perform various activities that we log in different tables - Interview attendance, Course Attendance, and other general activities. I have a database view that unions data from all of these tables giving us the ActivityView that looks like this. As you can see some activities overlap - for example while attending an interview, a client may have been performing a CV update activity.

+----------------------+---------------+---------------------+-------------------+
| activity_client_id   | activity_type | activity_start_date | activity_end_date |
+----------------------+---------------+---------------------+-------------------+
|                  112 | Interview     | 2015-06-01 09:00    | 2015-06-01 11:00  |
|                  112 | CV updating   | 2015-06-01 09:30    | 2015-06-01 11:30  |
|                  112 | Course        | 2015-06-02 09:00    | 2015-06-02 16:00  |
|                  112 | Interview     | 2015-06-03 09:00    | 2015-06-03 10:00  |
+----------------------+---------------+---------------------+-------------------+

Each client has a "Sign Up Date", recorded on the client table, which is when they joined our programme. Here it is for our sample client:

+-----------+---------------------+
| client_id | client_sign_up_date |
+-----------+---------------------+
|       112 | 2015-05-20          |
+-----------+---------------------+

I need to create a report that will show the following columns:

+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+

We need this report in order to see how effective our programme is. An important aim of the programme is that we get every client to complete at least 5 hours of activity as quickly as possible. So this report will tell us how long from sign up does it take each client to achieve this figure.

What makes this even trickier is that when we calculate 5 hours of total activity, we must discount overlapping activities:

In the sample data above the client attended an interview between 09:00 and 11:00.
On the same day they also performed CV updating activity from 09:30 to 11:30. For our calculation, this would give them total activity for the day of 2.5 hours (150 minutes) - we would only count 30 minutes of the CV updating as the Interview overlaps it up to 11:00.

So the report for our sample client would give the following result:

+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+
|       112 | 2015-05-20          | 2015-06-02                                 |
+-----------+---------------------+--------------------------------------------+

So my question is how can I create the report using a select statement ? I can work out how to do this by writing a stored procedure that will loop through the view and write the result to a report table. But I would much prefer to avoid a stored procedure and have a select statement that will give me the report on the fly.

I am using SQL Server 2005.


回答1:


This is one way to do it:

;WITH CTErn AS (
   SELECT activity_client_id, activity_type,
          activity_start_date, activity_end_date,
          ROW_NUMBER() OVER (PARTITION BY activity_client_id 
                             ORDER BY activity_start_date) AS rn
   FROM activities
),   
CTEdiff AS (
   SELECT c1.activity_client_id, c1.activity_type,
          x.activity_start_date, c1.activity_end_date,
          DATEDIFF(mi, x.activity_start_date, c1.activity_end_date) AS diff,
          ROW_NUMBER() OVER (PARTITION BY c1.activity_client_id 
                             ORDER BY x.activity_start_date) AS seq
   FROM CTErn AS c1
   LEFT JOIN CTErn AS c2 ON c1.rn = c2.rn + 1
   CROSS APPLY (SELECT CASE 
                          WHEN c1.activity_start_date < c2.activity_end_date
                             THEN c2.activity_end_date
                          ELSE c1.activity_start_date
                       END) x(activity_start_date)    
)
SELECT TOP 1 client_id, client_sign_up_date, activity_start_date, 
             hoursOfActivicty               
FROM CTEdiff AS c1
INNER JOIN clients AS c2 ON c1.activity_client_id = c2.client_id                     
CROSS APPLY (SELECT SUM(diff) / 60.0
             FROM CTEdiff AS c3
             WHERE c3.seq <= c1.seq) x(hoursOfActivicty)
WHERE hoursOfActivicty >= 5
ORDER BY seq

Common Table Expressions and ROW_NUMBER() were introduced with SQL Server 2005, so the above query should work for that version.

Demo here

The first CTE, i.e. CTErn, produces the following output:

client_id   activity_type   start_date          end_date          rn
112         Interview       2015-06-01 09:00    2015-06-01 11:00  1
112         CV updating     2015-06-01 09:30    2015-06-01 11:30  2
112         Course          2015-06-02 09:00    2015-06-02 16:00  3
112         Interview       2015-06-03 09:00    2015-06-03 10:00  4

The second CTE, i.e. CTEdiff, uses the above table expression in order to calculate time difference for each record, taking into consideration any overlapps with the previous record:

client_id activity_type start_date       end_date         diff  seq
112       Interview     2015-06-01 09:00 2015-06-01 11:00 120   1
112       CV updating   2015-06-01 11:00 2015-06-01 11:30 30    2
112       Course        2015-06-02 09:00 2015-06-02 16:00 420   3
112       Interview     2015-06-03 09:00 2015-06-03 10:00 60    4

The final query calculates the cumulative sum of time difference and selects the first record that exceeds 5 hours of activity.

The above query will work for simple interval overlaps, i.e. when just the end date of an activity overlaps the start date of the next activity.




回答2:


See SQL Fiddle here.

with tbl as (
  -- this will generate daily merged ovelaping time
  select distinct
    a.id
    ,(
        select min(x.starttime) 
        from act x 
        where x.id=a.id and ( x.starttime between a.starttime and a.endtime
          or a.starttime between x.starttime and x.endtime )
    ) start1
    ,(
        select max(x.endtime) 
        from act x 
        where x.id=a.id and ( x.endtime between a.starttime and a.endtime
          or a.endtime between x.starttime and x.endtime )
    ) end1
  from act a

), tbl2 as 
(
  -- this will add minute and total minute column
  select 
    * 
    ,datediff(mi,t.start1,t.end1) mi
    ,(select sum(datediff(mi,x.start1,x.end1)) from tbl x where x.id=t.id and x.end1<=t.end1) totalmi
  from tbl t
), tbl3 as 
(
  -- now final query showing starttime and endtime for 5 hours other wise null in case not completed 5(300 minutes) hours
  select 
    t.id
    ,min(t.start1) starttime
    ,min(case when t.totalmi>300 then t.end1 else null end) endtime
  from tbl2 t
  group by t.id
)
-- final result 
select *
from tbl3
where endtime is not null


来源:https://stackoverflow.com/questions/30641259/sql-server-cumulative-sum-on-overlapping-data-getting-date-that-sum-reaches

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!