I have a table with ID pairs that are in a transitive relation t, that is, if \"A t B\" AND \"B t C\" then \"A t C\". Sample:
<
You can do this in Postgres; you cannot do this in all databases. Here is the query:
with
recursive cte(id1, id2) as (
select id1, id2, 1 as level
from t
union all
select t.id1, cte.id2, cte.level + 1
from t join
cte
on t.id2 = cte.id1
)
select id1, id2,
dense_rank() over (order by grp) as label
from (select id1, id2,
least(min(id2) over (partition by id1), min(id1) over (partition by id2)) as grp,
level
from cte
) t
where level = 1;
With the SQL Fiddle here.
You are walking through a tree structure in order to assign the label (cycles might pose problems with this particular version by the way). In Postgres, you can do this using an explicit recursive
CTE. In SQL Server, you can do this with a CTE that is implicitly "recursive" (the key word is not used). In Oracle, you can do this with connect by
.
The recursive CTE gets all pairs that are connected to each other. The main query then assigns the minimum value of id1 and id2 to the pair, to identify all pairs that are connected to each other. The final label is produced just by assigning a sequential value to the grp
.
EDIT:
Egor makes a very good point. The above is assuming that the ids "descend" to the smaller values. The following version instead uses the highest level for each id for the grouping (which is really what is intended):
with
recursive cte(id1, id2) as (
select id1, id2, 1 as level
from t
union all
select t.id1, cte.id2, cte.level + 1
from t join
cte
on t.id2 = cte.id1
-- where not exists (select 1 from cte cte2 where cte2.id1 = t.id1 and cte2.id2 = t.id2)
)
select id1, id2,
dense_rank() over (order by topvalue) as label
from (select id1, id2,
first_value(id2) over (partition by id1 order by level desc) as topvalue,
level
from cte
) t
where level = 1;
EDIT II:
In response to Egor's second comment. This data is a little problematic with respect to the original problem. The following breaks it into two pieces:
with
recursive cte as (
select id1, id2, id2 as last, id1||','||id2 as grp, 1 as level
from t
where id2 not in (select id1 from t)
union all
select t.id1, t.id2, cte.last, cte.grp, cte.level + 1
from t join
cte
on t.id2 = cte.id1
-- where not exists (select 1 from cte cte2 where cte2.id1 = t.id1 and cte2.id2 = t.id2)
)
select *
from cte;
But, it is not clear if that is what the original wanted. It would break the original into three groups that overlap, because there are three ids in the second column that are never in the first column. The question here is about commutativity.
Now, a new demand at 2013, I need to work with 10000 itens: using the @GordonLinoff's elegant solution (above), with 1000 itens need 1 second, with 2000 need 1 day... Not have a good performance. The problem of performance was also remembered here,
(this is the best solution, so fast!)
See original and didactical description. Here the table T1
is the same as the question-text, and a second (temporary) table R
is used to process and to show the results,
CREATE TABLE R (
id integer NOT NULL, -- PRIMARY KEY,
label integer NOT NULL DEFAULT 0
);
CREATE FUNCTION t1r_labeler() RETURNS void AS $funcBody$
DECLARE
label1 integer;
label2 integer;
newlabel integer;
t t1%rowtype;
BEGIN
DELETE FROM R;
INSERT INTO R(id)
SELECT DISTINCT unnest(array[id1,id2])
FROM T1 ORDER BY 1;
newlabel:=0;
FOR t IN SELECT * FROM t1
LOOP -- -- BASIC LABELING: -- --
SELECT label INTO label1 FROM R WHERE id=t.id1;
SELECT label INTO label2 FROM R WHERE id=t.id2;
IF label1=0 AND label2=0 THEN
newlabel:=newlabel+1;
UPDATE R set label=newlabel WHERE ID in (t.id1,t.id2);
ELSIF label1=0 AND label2!=0 THEN
UPDATE R set label=label2 WHERE ID=t.id1;
ELSIF label1!=0 AND label2=0 THEN
UPDATE R set label=label1 WHERE ID=t.id2;
ELSIF label1!=label2 THEN -- time consuming
UPDATE tmp.R set label=label1 WHERE label = label2;
END IF;
END LOOP;
END;
$funcBody$ LANGUAGE plpgsql VOLATILE;
Preparing and running,
-- same CREATE TABLE T1 (id1 integer, id2 integer);
DELETE FROM T1;
INSERT INTO T1(id1,id2) -- populate the standard input
VALUES (1, 2), (1, 5), (4, 7), (7, 8), (9, 1);
-- or SELECT id1, id2 FROM table_with_1000000_items;
SELECT t1r_labeler(); -- run
SELECT * FROM R ORDER BY 2; -- show
Dealing with the worst case
The last condition, when label1!=label2
,
is the most time-consuming operation, and must be avoided or can be separated in cases of high connectivity, that are the worst ones.
To report some kind of alert you can count the proportion of times that the procedure is running the last condition, and/cor can separete the last update.
If you separate, you can analyse and deal a little better with them
So, eliminating the last ELSIF
and adding after the first loop your checks and this second loop:
-- ... first loop and checks here ...
FOR t IN SELECT * FROM tmp.t1
LOOP -- -- MERGING LABELS: -- --
SELECT label INTO label1 FROM R WHERE id=t.id1;
SELECT label INTO label2 FROM R WHERE id=t.id2;
IF label1!=0 AND label2!=0 AND label1!=label2 THEN
UPDATE R set label=label1 WHERE label=label2;
END IF;
END LOOP;
-- ...
Example of worst case: a group with more than 1000 (connected) nodes into 10000 nodes with average length of "10 per labeled-group" (cores) and only few paths connecting cores.
This other solution is slower (is a brute-force algorithm), but can be util when you need direct processing with arrays, and not need so fast solution (and not have "worst cases").
As @peter.petrov and @RBarryYoung suggest to use a more adequate data-structure... I was back to my arrays as "more adequate data-structure". After all, there are good speed-up (comparating with @GordonLinoff's algorithm) with the solution below (!).
The first step is to translate the table t1
of the question-text to a temporary one, transgroup1
, where we can compute the new process,
-- DROP table transgroup1;
CREATE TABLE transgroup1 (
id serial NOT NULL PRIMARY KEY,
items integer[], -- two or more items in the transitive relationship
dels integer[] DEFAULT array[]::integer[]
);
INSERT INTO transgroup1(items)
SELECT array[id1, id2] FROM t1; -- now suppose t1 a 10000 items table;
them, with these two functions we can solve the problem,
CREATE FUNCTION array_uunion(anyarray,anyarray) RETURNS anyarray AS $$
-- ensures distinct items of a concatemation
SELECT ARRAY(SELECT unnest($1) UNION SELECT unnest($2))
$$ LANGUAGE sql immutable;
CREATE FUNCTION transgroup1_loop() RETURNS void AS
$BODY$
DECLARE
cp_dels integer[];
i integer;
max_i integer;
BEGIN
i:=1;
max_i:=10; -- or 100 or more, but need some control to be secure
LOOP
UPDATE transgroup1
SET items = array_uunion(transgroup1.items,t2.items),
dels = transgroup1.dels || t2.id
FROM transgroup1 AS t1, transgroup1 AS t2
WHERE transgroup1.id=t1.id AND t1.id>t2.id AND t1.items && t2.items;
cp_dels := array(
SELECT DISTINCT unnest(dels) FROM transgroup1
); -- ensures all itens to del
EXIT WHEN i>max_i OR array_length(cp_dels,1)=0;
DELETE FROM transgroup1 WHERE id IN (SELECT unnest(cp_dels));
UPDATE transgroup1 SET dels=array[]::integer[];
i:=i+1;
END LOOP;
UPDATE transgroup1 -- only to beautify
SET items = ARRAY(SELECT unnest(items) ORDER BY 1 desc);
END;
$BODY$ LANGUAGE plpgsql VOLATILE;
of course, to run and see results, you can use
SELECT transgroup1_loop(); -- not 1 day but some hours!
SELECT *, dense_rank() over (ORDER BY id) AS group from transgroup1;
resulting
id | items | ssg_label | dels | group
----+-----------+-----------+------+-------
4 | {8,7,4} | 1 | {} | 1
5 | {9,5,2,1} | 1 | {} | 2