merge DATE-rows if episodes are in direct succession or overlapping

后端 未结 4 884
我寻月下人不归
我寻月下人不归 2021-01-14 07:47

I have a table like this:

ID    BEGIN    END

If there are overlapping episodes for the same ID (like 2000-01-01 - 2001-1

4条回答
  •  無奈伤痛
    2021-01-14 08:26

    Edit: That is great news that your DBA agreed to upgrade to a newer version of PostgreSQL. The windowing functions alone make the upgrade a worthwhile investment.

    My original answer—as you note—has a major flaw: a limitation of one row per id.
    Below is a better solution without such a limitation.
    I have tested it using test tables on my system (8.4).

    If / when you get a moment I would like to know how it performs on your data.
    I also wrote up an explanation here: https://www.mechanical-meat.com/1/detail

    WITH RECURSIVE t1_rec ( id, "begin", "end", n ) AS (
        SELECT id, "begin", "end", n
          FROM (
            SELECT
                id, "begin", "end",
                CASE 
                    WHEN LEAD("begin") OVER (
                    PARTITION BY    id
                    ORDER BY        "begin") <= ("end" + interval '2' day) 
                    THEN 1 ELSE 0 END AS cl,
                ROW_NUMBER() OVER (
                    PARTITION BY    id
                    ORDER BY        "begin") AS n
            FROM mytable 
        ) s
        WHERE s.cl = 1
      UNION ALL
        SELECT p1.id, p1."begin", p1."end", a.n
          FROM t1_rec a 
               JOIN mytable p1 ON p1.id = a.id
           AND p1."begin" > a."begin"
           AND (a."begin",  a."end" + interval '2' day) OVERLAPS 
               (p1."begin", p1."end")
    )
    SELECT t1.id, min(t1."begin"), max(t1."end")
      FROM t1_rec t1
           LEFT JOIN t1_rec t2 ON t1.id = t2.id 
           AND t2."end" = t1."end"
           AND t2.n < t1.n
     WHERE t2.n IS NULL
     GROUP BY t1.id, t1.n
     ORDER BY t1.id, t1.n;
    

    Original (deprecated) answer follows;
    note: limitation of one row per id.


    Denis is probably right about using lead() and lag(), but there is yet another way!
    You can also solve this problem using so-called recursive SQL.
    The overlaps function also comes in handy.

    I have fully tested this solution on my system (8.4).
    It works well.

    WITH RECURSIVE rec_stmt ( id, begin, end ) AS (
        /* seed statement: 
               start with only first start and end dates for each id 
        */
          SELECT id, MIN(begin), MIN(end)
            FROM mytable seed_stmt
        GROUP BY id
    
        UNION ALL
    
        /* iterative (not really recursive) statement: 
               append qualifying rows to resultset 
        */
          SELECT t1.id, t1.begin, t1.end
            FROM rec_stmt r
                 JOIN mytable t1 ON t1.id = r.id
             AND t1.begin > r.end
             AND (r.begin, r.end + INTERVAL '1' DAY) OVERLAPS 
                 (t1.begin - INTERVAL '1' DAY, t1.end)
    )
      SELECT MIN(begin), MAX(end) 
        FROM rec_stmt
    GROUP BY id;
    

提交回复
热议问题