Aggregate adjacent only records with T-SQL

前端 未结 4 1861
孤街浪徒
孤街浪徒 2020-12-17 04:09

I have (simplified for the example) a table with the following data

Row Start       Finish       ID  Amount
--- ---------   ----------   --  ------
  1 2008-         


        
4条回答
  •  孤街浪徒
    2020-12-17 05:02

    If you read the book "Developing Time-Oriented Database Applications in SQL" by R T Snodgrass (the pdf of which is available from his web site under publications), and get as far as Figure 6.25 on p165-166, you will find the non-trivial SQL which can be used in the current example to group the various rows with the same ID value and continuous time intervals.

    The query development below is close to correct, but there is a problem spotted right at the end, that has its source in the first SELECT statement. I've not yet tracked down why the incorrect answer is being given. [If someone can test the SQL on their DBMS and tell me whether the first query works correctly there, it would be a great help!]

    It looks something like:

    -- Derived from Figure 6.25 from Snodgrass "Developing Time-Oriented
    -- Database Applications in SQL"
    CREATE TABLE Data
    (
        Start   DATE,
        Finish  DATE,
        ID      CHAR(2),
        Amount  INT
    );
    
    INSERT INTO Data VALUES('2008-10-01', '2008-10-02', '01', 10);
    INSERT INTO Data VALUES('2008-10-02', '2008-10-03', '02', 20);
    INSERT INTO Data VALUES('2008-10-03', '2008-10-04', '01', 38);
    INSERT INTO Data VALUES('2008-10-04', '2008-10-05', '01', 23);
    INSERT INTO Data VALUES('2008-10-05', '2008-10-06', '03', 14);
    INSERT INTO Data VALUES('2008-10-06', '2008-10-07', '02',  3);
    INSERT INTO Data VALUES('2008-10-07', '2008-10-08', '02',  8);
    INSERT INTO Data VALUES('2008-10-08', '2008-11-08', '03', 19);
    
    SELECT DISTINCT F.ID, F.Start, L.Finish
        FROM Data AS F, Data AS L
        WHERE F.Start < L.Finish
          AND F.ID = L.ID
          -- There are no gaps between F.Finish and L.Start
          AND NOT EXISTS (SELECT *
                            FROM Data AS M
                            WHERE M.ID = F.ID
                            AND F.Finish < M.Start
                            AND M.Start < L.Start
                            AND NOT EXISTS (SELECT *
                                                FROM Data AS T1
                                                WHERE T1.ID = F.ID
                                                  AND T1.Start <  M.Start
                                                  AND M.Start  <= T1.Finish))
          -- Cannot be extended further
          AND NOT EXISTS (SELECT *
                              FROM Data AS T2
                              WHERE T2.ID = F.ID
                                AND ((T2.Start <  F.Start  AND F.Start  <= T2.Finish)
                                  OR (T2.Start <= L.Finish AND L.Finish <  T2.Finish)));
    

    The output from that query is:

    01  2008-10-01      2008-10-02
    01  2008-10-03      2008-10-05
    02  2008-10-02      2008-10-03
    02  2008-10-06      2008-10-08
    03  2008-10-05      2008-10-06
    03  2008-10-05      2008-11-08
    03  2008-10-08      2008-11-08
    

    Edited: There's a problem with the penultimate row - it should not be there. And I'm not clear (yet) where it is coming from.

    Now we need to treat that complex expression as a query expression in the FROM clause of another SELECT statement, which will sum the amount values for a given ID over the entries that overlap with the maximal ranges shown above.

    SELECT M.ID, M.Start, M.Finish, SUM(D.Amount)
        FROM Data AS D,
             (SELECT DISTINCT F.ID, F.Start, L.Finish
                  FROM Data AS F, Data AS L
                  WHERE F.Start < L.Finish
                    AND F.ID = L.ID
                    -- There are no gaps between F.Finish and L.Start
                    AND NOT EXISTS (SELECT *
                                        FROM Data AS M
                                        WHERE M.ID = F.ID
                                        AND F.Finish < M.Start
                                        AND M.Start < L.Start
                                        AND NOT EXISTS (SELECT *
                                                            FROM Data AS T1
                                                            WHERE T1.ID = F.ID
                                                              AND T1.Start <  M.Start
                                                              AND M.Start  <= T1.Finish))
                      -- Cannot be extended further
                    AND NOT EXISTS (SELECT *
                                        FROM Data AS T2
                                        WHERE T2.ID = F.ID
                                          AND ((T2.Start <  F.Start  AND F.Start  <= T2.Finish)
                                            OR (T2.Start <= L.Finish AND L.Finish <  T2.Finish)))) AS M
        WHERE D.ID = M.ID
          AND M.Start  <= D.Start
          AND M.Finish >= D.Finish
        GROUP BY M.ID, M.Start, M.Finish
        ORDER BY M.ID, M.Start;
    

    This gives:

    ID  Start        Finish       Amount
    01  2008-10-01   2008-10-02   10
    01  2008-10-03   2008-10-05   61
    02  2008-10-02   2008-10-03   20
    02  2008-10-06   2008-10-08   11
    03  2008-10-05   2008-10-06   14
    03  2008-10-05   2008-11-08   33              -- Here be trouble!
    03  2008-10-08   2008-11-08   19
    

    Edited: This is almost the correct data set on which to do the COUNT and SUM aggregation requested by the original question, so the final answer is:

    SELECT I.ID, COUNT(*) AS Number, SUM(I.Amount) AS Amount
        FROM (SELECT M.ID, M.Start, M.Finish, SUM(D.Amount) AS Amount
                FROM Data AS D,
                     (SELECT DISTINCT F.ID, F.Start, L.Finish
                          FROM  Data AS F, Data AS L
                          WHERE F.Start < L.Finish
                            AND F.ID = L.ID
                            -- There are no gaps between F.Finish and L.Start
                            AND NOT EXISTS
                                (SELECT *
                                    FROM  Data AS M
                                    WHERE M.ID = F.ID
                                      AND F.Finish < M.Start
                                      AND M.Start < L.Start
                                      AND NOT EXISTS
                                          (SELECT *
                                              FROM Data AS T1
                                              WHERE T1.ID = F.ID
                                                AND T1.Start <  M.Start
                                                AND M.Start  <= T1.Finish))
                              -- Cannot be extended further
                            AND NOT EXISTS
                                (SELECT *
                                    FROM  Data AS T2
                                    WHERE T2.ID = F.ID
                                      AND ((T2.Start <  F.Start  AND F.Start  <= T2.Finish) OR
                                           (T2.Start <= L.Finish AND L.Finish <  T2.Finish)))
                     ) AS M
                WHERE D.ID = M.ID
                  AND M.Start  <= D.Start
                  AND M.Finish >= D.Finish
                GROUP BY M.ID, M.Start, M.Finish
              ) AS I
            GROUP BY I.ID
            ORDER BY I.ID;
    
    id     number  amount
    01      2      71
    02      2      31
    03      3      66
    

    Review: Oh! Drat...the entry for 3 has twice the 'amount' that it should have. Previous 'edited' parts indicate where things started to go wrong. It looks as though either the first query is subtly wrong (maybe it is intended for a different question), or the optimizer I'm working with is misbehaving. Nevertheless, there should be an answer closely related to this that will give the correct values.

    For the record: tested on IBM Informix Dynamic Server 11.50 on Solaris 10. However, should work fine on any other moderately standard-conformant SQL DBMS.

提交回复
热议问题