SQL grouping interescting/overlapping rows

时间秒杀一切 提交于 2019-12-23 10:21:45

问题


I have the following table in Postgres that has overlapping data in the two columns a_sno and b_sno.

create table data
( a_sno integer not null,  
  b_sno integer not null,
  PRIMARY KEY (a_sno,b_sno)
);

insert into data (a_sno,b_sno) values
  ( 4, 5 )
, ( 5, 4 )
, ( 5, 6 )
, ( 6, 5 )
, ( 6, 7 )
, ( 7, 6 )
, ( 9, 10)
, ( 9, 13)
, (10, 9 )
, (13, 9 )
, (10, 13)
, (13, 10)
, (10, 14)
, (14, 10)
, (13, 14)
, (14, 13)
, (11, 15)
, (15, 11);

As you can see from the first 6 rows data values 4,5,6 and 7 in the two columns intersects/overlaps that need to partitioned to a group. Same goes for rows 7-16 and rows 17-18 which will be labeled as group 2 and 3 respectively.

The resulting output should look like this:

group | value
------+------
1     | 4
1     | 5
1     | 6
1     | 7
2     | 9
2     | 10
2     | 13
2     | 14
3     | 11
3     | 15

回答1:


Assuming that all pairs exists in their mirrored combination as well (4,5) and (5,4). But the following solutions work without mirrored dupes just as well.

Simple case

All connections can be lined up in a single ascending sequence and complications like I added in the fiddle are not possible, we can use this solution without duplicates in the rCTE:

I start by getting minimum a_sno per group, with the minimum associated b_sno:

SELECT row_number() OVER (ORDER BY a_sno) AS grp
     , a_sno, min(b_sno) AS b_sno
FROM   data d
WHERE  a_sno < b_sno
AND    NOT EXISTS (
   SELECT 1 FROM data
   WHERE  b_sno = d.a_sno
   AND    a_sno < b_sno
   )
GROUP  BY a_sno;

This only needs a single query level since a window function can be built on an aggregate:

  • Get the distinct sum of a joined table column

Result:

grp  a_sno  b_sno
1    4      5
2    9      10
3    11     15

I avoid branches and duplicated (multiplicated) rows - potentially much more expensive with long chains. I use ORDER BY b_sno LIMIT 1 in a correlated subquery to make this fly in a recursive CTE.

  • Create a unique index on a non-unique column

Key to performance is a matching index, which is already present provided by the PK constraint PRIMARY KEY (a_sno,b_sno): not the other way round (b_sno, a_sno):

  • Is a composite index also good for queries on the first field?

WITH RECURSIVE t AS (
   SELECT row_number() OVER (ORDER BY d.a_sno) AS grp
        , a_sno, min(b_sno) AS b_sno  -- the smallest one
   FROM   data d
   WHERE  a_sno < b_sno
   AND    NOT EXISTS (
      SELECT 1 FROM data
      WHERE  b_sno = d.a_sno
      AND    a_sno < b_sno
      )
   GROUP  BY a_sno
   )

, cte AS (
   SELECT grp, b_sno AS sno FROM t

   UNION ALL
   SELECT c.grp
       , (SELECT b_sno  -- correlated subquery
          FROM   data
          WHERE  a_sno = c.sno
          AND    a_sno < b_sno
          ORDER  BY b_sno
          LIMIT  1)
   FROM   cte  c
   WHERE  c.sno IS NOT NULL
   )
SELECT * FROM cte
WHERE  sno IS NOT NULL   -- eliminate row with NULL
UNION  ALL               -- no duplicates
SELECT grp, a_sno FROM t
ORDER  BY grp, sno;

Less simple case

All nodes can be reached in ascending order with one or more branches from the root (smallest sno).

This time, get all greater sno and de-duplicate nodes that may be visited multiple times with UNION at the end:

WITH RECURSIVE t AS (
   SELECT rank() OVER (ORDER BY d.a_sno) AS grp
        , a_sno, b_sno  -- get all rows for smallest a_sno
   FROM   data d
   WHERE  a_sno < b_sno
   AND    NOT EXISTS (
      SELECT 1 FROM data
      WHERE  b_sno = d.a_sno
      AND    a_sno < b_sno
      )
   )   
, cte AS (
   SELECT grp, b_sno AS sno FROM t

   UNION ALL
   SELECT c.grp, d.b_sno
   FROM   cte  c
   JOIN   data d ON d.a_sno = c.sno
                AND d.a_sno < d.b_sno  -- join to all connected rows
   )
SELECT grp, sno FROM cte
UNION                     -- eliminate duplicates
SELECT grp, a_sno FROM t  -- add first rows
ORDER  BY grp, sno;

Unlike the first solution, we don't get a last row with NULL here (caused by the correlated subquery).

Both should perform very well - especially with long chains / many branches. Result as desired:

SQL Fiddle (with added rows to demonstrate difficulty).

Undirected graph

If there are local minima that cannot be reached from the root with ascending traversal, the above solutions won't work. Consider Farhęg's solution in this case.




回答2:


I want to say another way, it may be useful, you can do it in 2 steps:

1. take the max(sno) per each group:

  select q.sno,
         row_number() over(order by q.sno) gn
  from(
    select distinct d.a_sno sno
    from data d
    where not exists (
      select b_sno
      from data
      where b_sno=d.a_sno
      and a_sno>d.a_sno
     )
    )q

result:

sno gn
7   1
14  2
15  3

2. use a recursive cte to find all related members in groups:

with recursive cte(sno,gn,path,cycle)as(
  select q.sno,
         row_number() over(order by q.sno) gn,
         array[q.sno],false
  from(
    select distinct d.a_sno sno
    from data d
    where not exists (
      select b_sno
      from data
      where b_sno=d.a_sno
      and a_sno>d.a_sno
     )
    )q
  union all
  select d.a_sno,c.gn,
         d.a_sno || c.path,
         d.a_sno=any(c.path)
  from data d
  join cte c on d.b_sno=c.sno 
  where not cycle
 )
 select distinct gn,sno from cte
 order by gn,sno

Result:

gn  sno
1   4
1   5
1   6
1   7
2   9
2   10
2   13
2   14
3   11
3   15

here is the demo of what I did.




回答3:


Here is a start that may give some ideas on an approach. The recursive query starts with a_sno of each record and then tries to follow the path of b_sno until it reaches the end or forms a cycle. The path is represented by an array of sno integers.

The unnest function will break the array into rows, so a sno value mapped to the path array such as:

4, {6, 5, 4}

will be transformed to a row for each value in the array:

4, 6
4, 5
4, 4

The array_agg then reverses the operation by aggregating the values back into a path, but getting rid of the duplicates and ordering.

Now each a_sno is associated with a path and the path forms the grouping. dense_rank can be used to map the grouping (cluster) to a numeric.

SELECT array_agg(DISTINCT map ORDER BY map) AS cluster
      ,sno
  FROM ( WITH RECURSIVE x(sno, path, cycle) AS (
             SELECT a_sno, ARRAY[a_sno], false FROM  data
             UNION ALL
             SELECT b_sno, path || b_sno, b_sno = ANY(path)
               FROM data, x
               WHERE a_sno = x.sno
                 AND NOT cycle
         )
         SELECT sno, unnest(path) AS map FROM x ORDER BY 1
       ) y
  GROUP BY sno
  ORDER BY 1, 2

Output:

   cluster    | sno 
--------------+-----
 {4,5,6,7}    |   4
 {4,5,6,7}    |   5
 {4,5,6,7}    |   6
 {4,5,6,7}    |   7
 {9,10,13,14} |   9
 {9,10,13,14} |  10
 {9,10,13,14} |  13
 {9,10,13,14} |  14
 {11,15}      |  11
 {11,15}      |  15
(10 rows)

Wrap it one more time for the ranking:

SELECT dense_rank() OVER(order by cluster) AS rank
      ,sno
  FROM (
    SELECT array_agg(DISTINCT map ORDER BY map) AS cluster
          ,sno
      FROM ( WITH RECURSIVE x(sno, path, cycle) AS (
                 SELECT a_sno, ARRAY[a_sno], false FROM  data
                 UNION ALL
                 SELECT b_sno, path || b_sno, b_sno = ANY(path)
                   FROM data, x
                   WHERE a_sno = x.sno
                     AND NOT cycle
             )
             SELECT sno, unnest(path) AS map FROM x ORDER BY 1
           ) y
      GROUP BY sno
      ORDER BY 1, 2
) z

Output:

 rank | sno 
------+-----
    1 |   4
    1 |   5
    1 |   6
    1 |   7
    2 |   9
    2 |  10
    2 |  13
    2 |  14
    3 |  11
    3 |  15
(10 rows)


来源:https://stackoverflow.com/questions/29734941/sql-grouping-interescting-overlapping-rows

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!