How to label “transitive groups” with SQL?

前端未结

关注

 2  1868

I have a table with ID pairs that are in a transitive relation t, that is, if \"A t B\" AND \"B t C\" then \"A t C\". Sample:

相关标签:

2条回答

無奈伤痛

2020-12-06 03:37
You can do this in Postgres; you cannot do this in all databases. Here is the query:
```
with 
    recursive cte(id1, id2) as (
     select id1, id2, 1 as level
     from t
     union all
     select t.id1, cte.id2, cte.level + 1
     from t join
          cte
          on t.id2 = cte.id1
  )
select id1, id2,
       dense_rank() over (order by grp) as label
from (select id1, id2,
             least(min(id2) over (partition by id1), min(id1) over (partition by id2)) as grp,
             level
      from cte
     ) t
where level = 1;
```
With the SQL Fiddle here.

You are walking through a tree structure in order to assign the label (cycles might pose problems with this particular version by the way). In Postgres, you can do this using an explicit recursive CTE. In SQL Server, you can do this with a CTE that is implicitly "recursive" (the key word is not used). In Oracle, you can do this with connect by.

The recursive CTE gets all pairs that are connected to each other. The main query then assigns the minimum value of id1 and id2 to the pair, to identify all pairs that are connected to each other. The final label is produced just by assigning a sequential value to the grp.

EDIT:

Egor makes a very good point. The above is assuming that the ids "descend" to the smaller values. The following version instead uses the highest level for each id for the grouping (which is really what is intended):
```
with 
    recursive cte(id1, id2) as (
     select id1, id2, 1 as level
     from t
     union all
     select t.id1, cte.id2, cte.level + 1
     from t join
          cte
          on t.id2 = cte.id1
    --  where not exists (select 1 from cte cte2 where cte2.id1 = t.id1 and cte2.id2 = t.id2) 
  ) 
select id1, id2,
       dense_rank() over (order by topvalue) as label
from (select id1, id2,
             first_value(id2) over (partition by id1 order by level desc) as topvalue,
             level
      from cte
     ) t
where level = 1;
```
EDIT II:

In response to Egor's second comment. This data is a little problematic with respect to the original problem. The following breaks it into two pieces:
```
with 
    recursive cte as (
     select id1, id2, id2 as last, id1||','||id2 as grp, 1 as level
     from t
     where id2 not in (select id1 from t)
     union all
     select t.id1, t.id2, cte.last, cte.grp, cte.level + 1
     from t join
          cte
          on t.id2 = cte.id1
    --  where not exists (select 1 from cte cte2 where cte2.id1 = t.id1 and cte2.id2 = t.id2) 
  ) 
select *
from cte;
```
But, it is not clear if that is what the original wanted. It would break the original into three groups that overlap, because there are three ids in the second column that are never in the first column. The question here is about commutativity.
0 讨论(0)
发布评论:

提交评论
- 加载中...

迷失自我

2020-12-06 03:38

Now, a new demand at 2013, I need to work with 10000 itens: using the @GordonLinoff's elegant solution (above), with 1000 itens need 1 second, with 2000 need 1 day... Not have a good performance. The problem of performance was also remembered here,

The @NealB algorithm

(this is the best solution, so fast!) See original and didactical description. Here the table T1 is the same as the question-text, and a second (temporary) table R is used to process and to show the results,

 CREATE TABLE R (
   id integer NOT NULL, -- PRIMARY KEY,
   label integer NOT NULL DEFAULT 0
 );
 CREATE FUNCTION t1r_labeler() RETURNS void AS $funcBody$
   DECLARE
      label1 integer;
      label2 integer;
      newlabel integer;
      t t1%rowtype;
   BEGIN
       DELETE FROM R;
       INSERT INTO R(id) 
           SELECT DISTINCT unnest(array[id1,id2]) 
           FROM T1 ORDER BY 1;
      newlabel:=0;
      FOR t IN SELECT * FROM t1
      LOOP   -- --  BASIC LABELING:  -- --
         SELECT label INTO label1 FROM R WHERE id=t.id1;
         SELECT label INTO label2 FROM R WHERE id=t.id2;
         IF label1=0 AND label2=0 THEN 
              newlabel:=newlabel+1;
              UPDATE R set label=newlabel WHERE ID in (t.id1,t.id2);
         ELSIF label1=0 AND label2!=0 THEN 
              UPDATE R set label=label2 WHERE ID=t.id1;
         ELSIF label1!=0 AND label2=0 THEN 
              UPDATE R set label=label1 WHERE ID=t.id2;
         ELSIF label1!=label2 THEN -- time consuming
              UPDATE tmp.R set label=label1 WHERE label = label2;
         END IF;
      END LOOP;
    END;
 $funcBody$ LANGUAGE plpgsql VOLATILE;

Preparing and running,

 -- same CREATE TABLE T1 (id1 integer, id2 integer);
 DELETE FROM T1; 
 INSERT INTO T1(id1,id2)  -- populate the standard input
 VALUES (1, 2), (1, 5), (4, 7), (7, 8), (9, 1);
   -- or  SELECT id1, id2 FROM table_with_1000000_items;

 SELECT t1r_labeler();       -- run
 SELECT * FROM R ORDER BY 2; -- show

Dealing with the worst case

The last condition, when label1!=label2, is the most time-consuming operation, and must be avoided or can be separated in cases of high connectivity, that are the worst ones.

To report some kind of alert you can count the proportion of times that the procedure is running the last condition, and/cor can separete the last update.

If you separate, you can analyse and deal a little better with them So, eliminating the last ELSIF and adding after the first loop your checks and this second loop:

      -- ... first loop and checks here ...
      FOR t IN SELECT * FROM tmp.t1 
      LOOP   -- --  MERGING LABELS:  -- --
         SELECT label INTO label1 FROM R WHERE id=t.id1;
         SELECT label INTO label2 FROM R WHERE id=t.id2;
         IF label1!=0 AND label2!=0 AND label1!=label2 THEN
             UPDATE R set label=label1 WHERE label=label2;
         END IF;
      END LOOP;
      -- ...

Example of worst case: a group with more than 1000 (connected) nodes into 10000 nodes with average length of "10 per labeled-group" (cores) and only few paths connecting cores.

Array-oriented algorithm

This other solution is slower (is a brute-force algorithm), but can be util when you need direct processing with arrays, and not need so fast solution (and not have "worst cases").

As @peter.petrov and @RBarryYoung suggest to use a more adequate data-structure... I was back to my arrays as "more adequate data-structure". After all, there are good speed-up (comparating with @GordonLinoff's algorithm) with the solution below (!).

The first step is to translate the table t1 of the question-text to a temporary one, transgroup1, where we can compute the new process,

 -- DROP table transgroup1;
 CREATE TABLE transgroup1 (
   id serial NOT NULL PRIMARY KEY,
   items integer[], -- two or more items in the transitive relationship
   dels integer[] DEFAULT array[]::integer[]
 );
 INSERT INTO transgroup1(items)
   SELECT array[id1, id2] FROM t1; -- now suppose t1 a 10000 items table;

them, with these two functions we can solve the problem,

 CREATE FUNCTION array_uunion(anyarray,anyarray) RETURNS anyarray AS $$
     -- ensures distinct items of a concatemation
    SELECT ARRAY(SELECT  unnest($1) UNION SELECT  unnest($2))
 $$ LANGUAGE sql immutable;


  CREATE FUNCTION transgroup1_loop() RETURNS void AS
  $BODY$
    DECLARE
       cp_dels integer[];
       i integer;
       max_i integer;
    BEGIN
       i:=1;
       max_i:=10; -- or 100 or more, but need some control to be secure
       LOOP
           UPDATE transgroup1
           SET items = array_uunion(transgroup1.items,t2.items),
               dels =  transgroup1.dels || t2.id
           FROM transgroup1 AS t1, transgroup1 AS t2
           WHERE transgroup1.id=t1.id AND t1.id>t2.id AND t1.items && t2.items;

           cp_dels := array(
               SELECT DISTINCT unnest(dels) FROM transgroup1
           ); -- ensures all itens to del
           EXIT WHEN i>max_i OR array_length(cp_dels,1)=0;

           DELETE FROM transgroup1 WHERE id IN (SELECT unnest(cp_dels));
           UPDATE transgroup1 SET dels=array[]::integer[];
           i:=i+1;
       END LOOP;
       UPDATE transgroup1         -- only to beautify
       SET items = ARRAY(SELECT unnest(items) ORDER BY 1 desc);
     END;
  $BODY$ LANGUAGE plpgsql VOLATILE;

of course, to run and see results, you can use

 SELECT transgroup1_loop(); -- not 1 day but some hours!     
 SELECT *, dense_rank() over (ORDER BY id) AS group from transgroup1;

resulting

 id |   items   | ssg_label | dels | group 
----+-----------+-----------+------+-------
  4 | {8,7,4}   | 1         | {}   |     1
  5 | {9,5,2,1} | 1         | {}   |     2

0 讨论(0)