Removing duplicates from multiple self left joins

I am dynamically generating a query like below that creates different combinations of rules by left joining (any number of times) on itself and avoiding rules with some of the same attributes as part of the joins conditions e.g.

SELECT count(*) 
FROM rules AS t1 
LEFT JOIN rules AS t2
 ON t1.id != t2.id
 AND ...
LEFT JOIN rules AS t3
 ON t1.id != t2.id AND t1.id != t3.id AND t2.id != t3.id
 AND ...

I am currently removing duplicates by creating an array of ids from the joined rows then sorting and grouping by them:

SELECT sort(array[t1.id, t2.id, t3.id]) AS ids
...
GROUP BY ids

I would like to know if there is a better way of removing duplicate rows e.g.

t1.ID | t2.ID | t3.ID
---------------------
  A   |   B   |   C
  C   |   B   |   A

Should be

t1.ID | t2.ID | t3.ID
---------------------
  A   |   B   |   C

t1.ID | t2.ID | t3.ID
---------------------
  C   |   B   |   A

But not both.

EDIT: I would like to go from a permutation of rows to a combination rows.

I'd suggest rather than joining on !=, try joining on <=.

You will then have all combinations with t1.id > t2.id, t2.id > t3.id, and so on.

Rows will not be 'duplicates' because they are ordered sets, and any set containing equivalent members would necessarily result in the identical ordered set.

I think you mean you want to go from a permutation of rows to a combination rows?

If so, the select distinct answers are wrong. Select distinct will select distinct permutations. I think you have a pretty good way of doing it. The only thing I can think of, would be to concatenate the rules into a string and the sort it in place. It looks like you are using Postgresql and there is no function that does it in the built-in string functions.

If the amount of symbols were small you might be able to insert them into the array pre-sorted by inserting 'A' in index 1, 'B' into index 2, etc. Which might might the sort quicker ...

You need to get an order into your results in order to filter all duplicates out. This can be achieved by making sure that a<b<c. And once you have an order in your results, you can apply a distinct to the resultset.

` SELECT count(*) FROM rules AS t1

LEFT JOIN rules AS t2 ON t1.id != t2.id AND

LEFT JOIN rules AS t3 ON t1.id != t2.id AND t1.id != t3.id AND t2.id != t3.id ...

t1.id < t2.id and t2.id < t3.id ...

AND ...`

Difficult to understand exactly what you're trying to achieve, but to avoid the A-B-C C-B-A duplication, try this:

SELECT count(*) 
FROM rules AS t1 
LEFT JOIN rules AS t2
 ON t1.id **<** t2.id
 AND ...
LEFT JOIN rules AS t3
 ON t1.id **<** t2.id AND t1.id **<** t3.id AND t2.id **<** t3.id
 AND ...

That way, the answers are always ordered

来源：https://stackoverflow.com/questions/4242715/removing-duplicates-from-multiple-self-left-joins

标签

sql

optimization

duplicates

join

duplicate-removal