SQL combine multiple identifiers to create a group id for duplicate records

放肆的年华 提交于 2019-12-03 17:33:47

This is truly an interesting problem. Still, I think we are missing a definition of a "group". Since in your example (p,null,null) (row1) and (null,r,null) (row4) share no common identifier and belong to the same group I'll go with this definition for grouping:

A row belongs to a group if it shares at least one identifier with at least one row of this group.

This means we can "chain" rows. This naturally leads to a hierarchical solution:

SQL> SELECT ID, a, b, c, MIN(grp) grp
  2    FROM (SELECT connect_by_root(id) ID,
  3                 connect_by_root(a) a,
  4                 connect_by_root(b) b,
  5                 connect_by_root(c) c,
  6                 ID grp
  7             FROM a
  8           CONNECT BY NOCYCLE(PRIOR a = a
  9                           OR PRIOR b = b
 10                           OR PRIOR c = c))
 11   GROUP BY ID, a, b, c
 12   ORDER BY ID;

        ID A          B          C                 GRP
---------- ---------- ---------- ---------- ----------
         1 p                                         1
         2 p          r                              1
         3 q                                         3
         4            r                              1
         5                       s                   3
         6 q                     s                   3

6 rows selected

You can execute the subquery to understand the construction:

SQL> SELECT connect_by_root(id) ID,
  2         connect_by_root(a) a,
  3         connect_by_root(b) b,
  4         connect_by_root(c) c,
  5         substr(sys_connect_by_path(ID, '->'), 3) path,
  6         ID grp
  7    FROM a
  8  CONNECT BY NOCYCLE(a = PRIOR a
  9                  OR b = PRIOR b
 10                  OR c = PRIOR c);

        ID A          B          C          PATH            GRP
---------- ---------- ---------- ---------- -------- ----------
         1 p                                1                 1
         1 p                                1->2              2
         1 p                                1->2->4           4
         2 p          r                     2                 2
         2 p          r                     2->1              1
         2 p          r                     2->4              4
         3 q                                3                 3
         3 q                                3->6              6
         3 q                                3->6->5           5
         4            r                     4                 4
         4            r                     4->2              2
         4            r                     4->2->1           1
         5                       s          5                 5
         5                       s          5->6              6
         5                       s          5->6->3           3
         6 q                     s          6                 6
         6 q                     s          6->3              3
         6 q                     s          6->5              5

18 rows selected

Use merge instead of loop:

Table a(a,b,c,groupId)

Statement:

   merge into a
   USING (SELECT RANK() OVER(ORDER BY a,b,c) g, ROWID rid FROM a) SOURCE
   ON (a.ROWID = SOURCE.rid)
   WHEN MATCHED THEN UPDATE SET a.GroupId = SOURCE.g

It is same as:

    BEGIN
        FOR x IN ( SELECT RANK() OVER(ORDER BY a,b,c) g, ROWID rid FROM a)
        LOOP
             UPDATE a
                SET GroupId  = x.g
             WHERE a.RowId = x.rid;
        END LOOP;
    END;
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!