search for cross-field duplicates in postgresql and bring back matched pairs

问题

I have a table of contacts. The table contains a mobile_phone column as well as a home_phone column. I'd like to fetch all pairs of duplicate contacts where a pair is two contacts sharing a phone number.

Note that if contact A's mobile_phone matches contact B's home_phone, this is also a duplicate. Here is an example of three contacts that should match.

contact_id|mobile_phone|home_phone|other columns such as email.......|...
-------------------------------------------------------------------------
111       |9748777777  |1112312312|..................................|...
112       |1112312312  |null      |..................................|...
113       |9748777777  |0001112222|..................................|...

Specifically, I would like to bring back a table where each row contains the contact_ids of the two matching contacts. For example,

||contact_id_a|contact_id_b||
||-------------------------||
||   145155   |   145999   ||
||   145158   |   145141   ||

With the help of @Erwin here enter link description here I was able to write a query close to what I am trying to achieve brings back a list of contact_ids of all contacts in the list that share a phone number with other contacts in the list.

SELECT c.contact_id
FROM   contacts c
WHERE  EXISTS (
   SELECT FROM contacts x
   WHERE (x.data->>'mobile_phone' is not null and x.data->>'mobile_phone' IN (c.data->>'mobile_phone', c.data->>'home_phone'))
       OR (x.data->>'home_phone' is not null and x.data->>'home_phone'   IN (c.data->>'mobile_phone', c.data->>'home_phone'))
   AND x.contact_id <> c.contact_id  -- except self
   );

The output only contains contact_ids like this...

||contact_id||
--------------
||  2341514 ||
||  345141  ||

I'd like to bring back the contact_ids of matching contacts in a single row as shown above.

回答1:

A simple query would be with the ARRAY overlap operator &&:

SELECT c1.contact_id AS a, c2.contact_id AS b
FROM   contacts c1
JOIN   contacts c2 ON c1.contact_id < c2.contact_id
WHERE  ARRAY [c1.mobile_phone, c1.home_phone] && ARRAY[c2.mobile_phone, c2.home_phone];

The condition c1.contact_id < c2.contact_id excludes self-joins and switched duplicates.

But this representation gets out of hand quickly if many contacts share the same number some way.

Aside: conditions of an [INNER] JOIN and WHERE conditions burn down doing exactly the same while no more than join_collapse_limit joins are involved. See:

Count on join of big tables with conditions is slow

回答2:

There is simplified schema to be shorter:

# with t(x,p1,p2) as (values(1,1,2),(2,2,null),(3,1,3),(4,2,5))
select array_agg(x), p
from t cross join lateral (values(t.p1),(t.p2)) as pp(p)
group by p;
┌───────────┬──────┐
│ array_agg │  p   │
├───────────┼──────┤
│ {2}       │ ░░░░ │
│ {1,3}     │    1 │
│ {3}       │    3 │
│ {4}       │    5 │
│ {1,2,4}   │    2 │
└───────────┴──────┘

It means: contacts 1 and 3 sharing phone 1, contacts 1,2 and 4 sharing phone 2, phone 3 is related only to contact 3, contact 4 is only one who have phone 5 and contact 2 have an empty phone. You can to filter the result for your specific requirements.

You also can to use array_agg(distinct x) to exclude duplicates if any.

回答3:

One simple solution is a self-join:

select c1.contact_id contact1, c2.contact_id contact2
from conctacts c1
inner join contacts c2
    on c1.contact_id < c2.contact_id
    and (
        least(c1.data->>'mobile_phone', c1.data->>'home_phone') = least(c2.data->>'mobile_phone', c2.data->>'home_phone')
        or greatest(c1.data->>'mobile_phone', c1.data->>'home_phone') = greatest(c2.data->>'mobile_phone', c2.data->>'home_phone')
    )

This gives you one row per pair of "duplicate" contact, with the contact that has the smallest id in the first column.

回答4:

How about this?

----- setup sample data
CREATE TABLE CUSTOMER (
   ID       INT PRIMARY KEY  NOT NULL,
   HOME     TEXT,
   MOBILE   TEXT    
);

INSERT INTO CUSTOMER (ID, HOME, MOBILE) VALUES (1, '123', NULL);
INSERT INTO CUSTOMER (ID, HOME, MOBILE) VALUES (2, '123', '123');
INSERT INTO CUSTOMER (ID, HOME, MOBILE) VALUES (3, '124', '123');
INSERT INTO CUSTOMER (ID, HOME, MOBILE) VALUES (4, NULL, '222');

----- find matches
WITH cte (ID, PHONE) AS (
 SELECT ID, HOME   FROM CUSTOMER WHERE HOME   <> '' 
 UNION
 SELECT ID, MOBILE FROM CUSTOMER WHERE MOBILE <> ''
)
SELECT DISTINCT c1.id, c2.id 
FROM 
    cte c1
    INNER JOIN cte c2   ON  c1.id < c2.id  AND  c1.PHONE = c2.PHONE

来源：https://stackoverflow.com/questions/63293191/search-for-cross-field-duplicates-in-postgresql-and-bring-back-matched-pairs

标签

sql

postgresql

duplicates