SQL Find duplicate with several field (no unique ID) WORK AROUND

I am trying to find duplicated vendors from a database using several fields from vendor table and vendor_address table. The thing is the more inner join I make the less the query is loosing potential results. While I don't have duplicate in vendor ID I'm look to find potential one that are similar.

Here is my query so far:

SELECT 
     o.vendor_id
    ,o.vndr_name_shrt_user
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL
    ,B.ADDRESS1
    ,SAME_ADDRESS_NB
    ,SAME_POSTAL_NB
    ,OC.SAME_SHORT_NAME
    ,oc.SAME_USER_NUM
FROM VENDOR o

JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID

INNER JOIN (
    SELECT vndr_name_shrt_user, COUNT(*) AS SAME_USER_NUM
    FROM VENDOR 
    WHERE COUNTRY = 'CANADA'
    AND VENDOR_STATUS = 'A'
    GROUP BY vndr_name_shrt_user
    HAVING COUNT(*) > 1
) oc on o.vndr_name_shrt_user = oc.vndr_name_shrt_user

INNER JOIN ( SELECT VENDOR_NAME_SHORT, COUNT(*) AS SAME_SHORT_NAME
    FROM VENDOR 
    WHERE COUNTRY = 'CANADA'
    AND VENDOR_STATUS = 'A'
    GROUP BY VENDOR_NAME_SHORT
    HAVING COUNT(*) > 1
) oc on o.VENDOR_NAME_SHORT = oc.VENDOR_NAME_SHORT

INNER JOIN (SELECT POSTAL, COUNT(*) AS SAME_POSTAL_NB
    FROM vendor_addr 
    WHERE COUNTRY = 'CANADA'
    AND COUNTRY ='CANADA'
    AND POSTAL != ' '
    GROUP BY POSTAL
    HAVING COUNT(*) > 1
) oc on b.POSTAL = oc.POSTAL

INNER JOIN (SELECT ADDRESS1, COUNT(*) AS SAME_ADDRESS_NB
    FROM ps_vendor_addr 
    WHERE COUNTRY = 'CANADA'
    AND COUNTRY ='CANADA'
    AND ADDRESS1 != ' '
    GROUP BY ADDRESS1
    HAVING COUNT(*) > 1
) oc on b.ADDRESS1 = oc.ADDRESS1   
WHERE O.COUNTRY ='CANADA' 
    AND B.COUNTY = 'CANADA';

Lets have some interesting data with chained duplicates on different attributes:

CREATE TABLE data ( ID, A, B, C ) AS
  SELECT 1, 1, 1, 1 FROM DUAL UNION ALL -- Related to #2 on column A
  SELECT 2, 1, 2, 2 FROM DUAL UNION ALL -- Related to #1 on column A, #3 on B & C, #5 on C
  SELECT 3, 2, 2, 2 FROM DUAL UNION ALL -- Related to #2 on columns B & C, #5 on C
  SELECT 4, 3, 3, 3 FROM DUAL UNION ALL -- Related to #5 on column A
  SELECT 5, 3, 4, 2 FROM DUAL UNION ALL -- Related to #2 and #3 on column C, #4 on A
  SELECT 6, 5, 5, 5 FROM DUAL;          -- Unrelated

Now, we can get some relationships using analytic functions (without any joins):

SELECT d.*,
       LEAST(
         FIRST_VALUE( id ) OVER ( PARTITION BY a ORDER BY id ),
         FIRST_VALUE( id ) OVER ( PARTITION BY b ORDER BY id ),
         FIRST_VALUE( id ) OVER ( PARTITION BY c ORDER BY id )
       ) AS duplicate_of
FROM   data d;

Which gives:

ID A B C DUPLICATE_OF
-- - - - ------------
 1 1 1 1            1
 2 1 2 2            1
 3 2 2 2            2
 4 3 3 3            4
 5 3 4 2            2
 6 5 5 5            6

But that doesn't pick up that #4 is related to #5 which is related to #2 and then to #1...

This can be found with a hierarchical query:

SELECT id, a, b, c,
       CONNECT_BY_ROOT( id ) AS duplicate_of
FROM   data
CONNECT BY NOCYCLE ( PRIOR a = a OR PRIOR b = b OR PRIOR c = c );

But that will give many, many duplicate rows (since it does not know where to start the hierarchy from so will chose each row in turn as the root) - instead you can use the first query to give the hierarchical query a starting point when the ID and DUPLICATE_OF values are the same:

SELECT id, a, b, c,
       CONNECT_BY_ROOT( id ) AS duplicate_of
FROM   (
  SELECT d.*,
         LEAST(
           FIRST_VALUE( id ) OVER ( PARTITION BY a ORDER BY id ),
           FIRST_VALUE( id ) OVER ( PARTITION BY b ORDER BY id ),
           FIRST_VALUE( id ) OVER ( PARTITION BY c ORDER BY id )
         ) AS duplicate_of
  FROM   data d
)
START WITH id = duplicate_of
CONNECT BY NOCYCLE ( PRIOR a = a OR PRIOR b = b OR PRIOR c = c );

Which gives:

ID A B C DUPLICATE_OF
-- - - - ------------
 1 1 1 1            1
 2 1 2 2            1
 3 2 2 2            1
 4 3 3 3            1
 5 3 4 2            1
 1 1 1 1            4
 2 1 2 2            4
 3 2 2 2            4
 4 3 3 3            4
 5 3 4 2            4
 6 5 5 5            6

There are still some rows are duplicated because of the local minima in the search that occurs a #4 ... which can be removed with a simple GROUP BY:

SELECT id, a, b, c,
       MIN( duplicate_of ) AS duplicate_of
FROM   (
  SELECT id, a, b, c,
         CONNECT_BY_ROOT( id ) AS duplicate_of
  FROM   (
    SELECT d.*,
           LEAST(
             FIRST_VALUE( id ) OVER ( PARTITION BY a ORDER BY id ),
             FIRST_VALUE( id ) OVER ( PARTITION BY b ORDER BY id ),
             FIRST_VALUE( id ) OVER ( PARTITION BY c ORDER BY id )
           ) AS duplicate_of
    FROM   data d
  )
  START WITH id = duplicate_of
  CONNECT BY NOCYCLE ( PRIOR a = a OR PRIOR b = b OR PRIOR c = c )
)
GROUP BY id, a, b, c;

Which gives the output:

ID A B C DUPLICATE_OF
-- - - - ------------
 1 1 1 1            1
 2 1 2 2            1
 3 2 2 2            1
 4 3 3 3            1
 5 3 4 2            1
 6 5 5 5            6

It seems as if your joins are a bit interesting, for more reasons than one. Firstly, you have inner joins, which will eliminate all but those which have all signs of duplications - this is something which you don't want. Additionally, you seem to have the same alias, oc, on all derived tables - that's not really gonna fly here, and you're not going to get very far with that.

Instead of doing it this way, I'd suggest that you have your basic query repeated for each of the duplication signs - as follows (I removed the same_address_nb and same_postal_nb fields, and you'll see why):

select 
    o.vendor_id
    ,o.vndr_name_shrt_user
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL
    ,B.ADDRESS1
    ,OC.SAME_SHORT_NAME
    ,oc.SAME_USER_NUM
from VENDOR o
JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID
WHERE O.COUNTRY ='CANADA'
AND B.COUNTY = 'CANADA'
AND ...

For each one of these duplication signs, you'll add a nested query to the ellipses shown above as follows - example shown using the duplicate in vndr_name_shrt_user:

select 
    o.vendor_id
    ,o.vndr_name_shrt_user
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL
    ,B.ADDRESS1
    ,OC.SAME_SHORT_NAME
    ,oc.SAME_USER_NUM
    ,'SAME_USER_NUM' as duplicateFlag
from VENDOR o
JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID
WHERE O.COUNTRY ='CANADA'
AND B.COUNTY = 'CANADA'
AND o.vndr_name_shrt_user in 
(
    SELECT 
        vndr_name_shrt_user
    FROM VENDOR 
    WHERE COUNTRY = o.country
    AND VENDOR_STATUS = 'A'
    GROUP BY vndr_name_shrt_user
    HAVING COUNT(*) > 1
)

You can UNION ALL these queries together and then see all of your duplicates.

As a side note, you had a check for the country = 'canada' twice in the last three derived table.

UPDATE: showing more than one duplicate flag

select 
    o.vendor_id
    ,o.vndr_name_shrt_user
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL
    ,B.ADDRESS1
    ,OC.SAME_SHORT_NAME
    ,oc.SAME_USER_NUM
    ,'SAME_USER_NUM' as duplicateFlag
from VENDOR o
JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID
WHERE O.COUNTRY ='CANADA'
AND B.COUNTY = 'CANADA'
AND o.vndr_name_shrt_user in 
(
    SELECT 
        vndr_name_shrt_user
    FROM VENDOR 
    WHERE COUNTRY = o.country
    AND VENDOR_STATUS = 'A'
    GROUP BY vndr_name_shrt_user
    HAVING COUNT(*) > 1
) 

UNION ALL

select 
    o.vendor_id
    ,o.vndr_name_shrt_user
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL
    ,B.ADDRESS1
    ,OC.SAME_SHORT_NAME
    ,oc.SAME_USER_NUM
    ,'VENDOR_NAME_SHORT' as duplicateFlag
from VENDOR o
JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID
WHERE O.COUNTRY ='CANADA'
AND B.COUNTY = 'CANADA'
AND o.VENDOR_NAME_SHORT in 
(
    SELECT 
        VENDOR_NAME_SHORT
    FROM VENDOR 
    WHERE COUNTRY = o.country
    AND VENDOR_STATUS = 'A'
    GROUP BY VENDOR_NAME_SHORT
    HAVING COUNT(*) > 1
)

来源：https://stackoverflow.com/questions/45863324/sql-find-duplicate-with-several-field-no-unique-id-work-around

标签

sql

database

Oracle

inner-join