问题
I have a table of contacts. The table contains a mobile_phone column as well as a home_phone column. I'd like to fetch all pairs of duplicate contacts where a pair is two contacts sharing a phone number.
Note that if contact A's mobile_phone matches contact B's home_phone, this is also a duplicate. Here is an example of three contacts that should match.
contact_id|mobile_phone|home_phone|other columns such as email.......|...
-------------------------------------------------------------------------
111 |9748777777 |1112312312|..................................|...
112 |1112312312 |null |..................................|...
113 |9748777777 |0001112222|..................................|...
Specifically, I would like to bring back a table where each row contains the contact_ids of the two matching contacts. For example,
||contact_id_a|contact_id_b||
||-------------------------||
|| 145155 | 145999 ||
|| 145158 | 145141 ||
With the help of @Erwin here enter link description here I was able to write a query close to what I am trying to achieve brings back a list of contact_ids of all contacts in the list that share a phone number with other contacts in the list.
SELECT c.contact_id
FROM contacts c
WHERE EXISTS (
SELECT FROM contacts x
WHERE (x.data->>'mobile_phone' is not null and x.data->>'mobile_phone' IN (c.data->>'mobile_phone', c.data->>'home_phone'))
OR (x.data->>'home_phone' is not null and x.data->>'home_phone' IN (c.data->>'mobile_phone', c.data->>'home_phone'))
AND x.contact_id <> c.contact_id -- except self
);
The output only contains contact_ids like this...
||contact_id||
--------------
|| 2341514 ||
|| 345141 ||
I'd like to bring back the contact_ids of matching contacts in a single row as shown above.
回答1:
A simple query would be with the ARRAY overlap operator &&:
SELECT c1.contact_id AS a, c2.contact_id AS b
FROM contacts c1
JOIN contacts c2 ON c1.contact_id < c2.contact_id
WHERE ARRAY [c1.mobile_phone, c1.home_phone] && ARRAY[c2.mobile_phone, c2.home_phone];
The condition c1.contact_id < c2.contact_id
excludes self-joins and switched duplicates.
But this representation gets out of hand quickly if many contacts share the same number some way.
Aside: conditions of an [INNER] JOIN
and WHERE
conditions burn down doing exactly the same while no more than join_collapse_limit joins are involved. See:
- Count on join of big tables with conditions is slow
回答2:
There is simplified schema to be shorter:
# with t(x,p1,p2) as (values(1,1,2),(2,2,null),(3,1,3),(4,2,5))
select array_agg(x), p
from t cross join lateral (values(t.p1),(t.p2)) as pp(p)
group by p;
┌───────────┬──────┐
│ array_agg │ p │
├───────────┼──────┤
│ {2} │ ░░░░ │
│ {1,3} │ 1 │
│ {3} │ 3 │
│ {4} │ 5 │
│ {1,2,4} │ 2 │
└───────────┴──────┘
It means: contacts 1 and 3 sharing phone 1, contacts 1,2 and 4 sharing phone 2, phone 3 is related only to contact 3, contact 4 is only one who have phone 5 and contact 2 have an empty phone. You can to filter the result for your specific requirements.
You also can to use array_agg(distinct x)
to exclude duplicates if any.
回答3:
One simple solution is a self-join:
select c1.contact_id contact1, c2.contact_id contact2
from conctacts c1
inner join contacts c2
on c1.contact_id < c2.contact_id
and (
least(c1.data->>'mobile_phone', c1.data->>'home_phone') = least(c2.data->>'mobile_phone', c2.data->>'home_phone')
or greatest(c1.data->>'mobile_phone', c1.data->>'home_phone') = greatest(c2.data->>'mobile_phone', c2.data->>'home_phone')
)
This gives you one row per pair of "duplicate" contact, with the contact that has the smallest id in the first column.
回答4:
How about this?
----- setup sample data
CREATE TABLE CUSTOMER (
ID INT PRIMARY KEY NOT NULL,
HOME TEXT,
MOBILE TEXT
);
INSERT INTO CUSTOMER (ID, HOME, MOBILE) VALUES (1, '123', NULL);
INSERT INTO CUSTOMER (ID, HOME, MOBILE) VALUES (2, '123', '123');
INSERT INTO CUSTOMER (ID, HOME, MOBILE) VALUES (3, '124', '123');
INSERT INTO CUSTOMER (ID, HOME, MOBILE) VALUES (4, NULL, '222');
----- find matches
WITH cte (ID, PHONE) AS (
SELECT ID, HOME FROM CUSTOMER WHERE HOME <> ''
UNION
SELECT ID, MOBILE FROM CUSTOMER WHERE MOBILE <> ''
)
SELECT DISTINCT c1.id, c2.id
FROM
cte c1
INNER JOIN cte c2 ON c1.id < c2.id AND c1.PHONE = c2.PHONE
来源:https://stackoverflow.com/questions/63293191/search-for-cross-field-duplicates-in-postgresql-and-bring-back-matched-pairs