Match slightly different records in a field

问题

I have the below table HAVE. How can I go about getting results in "WANT" ? I'll appreciate ideas and I'm open to any fuzzy match algorithm out there

Have

ID Name   
1  Davi  
2  David 
3  DAVID
4  Micheal
5  Michael
6  Oracle
7  Tepper

WANT

ID Name mtch_ind  
1  Davi     1
2  David    1
3  DAVID    1
4  Micheal  2
5  Michael  2
6  Oracle   3
7 Tepper    4

TABLE DDL and record insert

CREATE TABLE HAVE (
  ID INTEGER,
  Name VARCHAR(10)
);

INSERT INTO data VALUES ('1', 'Davi');
INSERT INTO data VALUES ('2', 'David');
INSERT INTO data VALUES ('3', 'DAVID');
INSERT INTO data VALUES ('4', 'Micheal');
INSERT INTO data VALUES ('5', 'Michael');
INSERT INTO data VALUES ('6', 'Oracle');
INSERT INTO data VALUES ('7', 'Tepper');

回答1:

Here is the algo that I believe should work:

Step-1: Identify the close matches by using the Jaro Winkler nearest match with threshold math of 75% select h1.name h2.name, UTL_MATCH.JARO_WINKLER (h1.name,h2.name) as match_confidence from have h1 join have h2 on UTL_MATCH.JARO_WINKLER (h1.name,h2.name) > 0.75--considering 75 % match threshold. enter image description here

Step-2 : Pick h2.name where the match_confidence is maximum or one top row for similar records

for example [enter image description here][enter image description here]2

Step-3 : preform a dense rank operation on the new column to end up in the result you wanted.

Hope this works Note: First post on SO. I don't have access to the oracle at the moment.

回答2:

While this solution is a bit ugly, I came up with this approach. FYI, it's best to first convert uppercase DAVID to to David. Hopefully, someone may find this useful or come up with a better solution. Thanks

with table1 as (
SELECT ROW_NUMBER() OVER (ORDER BY firstID) as rowno,A.* FROM (
select 
t1.name
,t1.ID
, case when t1.ID>t1.Fid then fid else T1.ID end as FIRSTID 
, case when t1.ID>t1.Fid then T1.id else fID end as SECONDID 
, case when t1.ID>t1.Fid then t1.NAME else t1.FNAME end as FIRSTNAME
, case when t1.ID>t1.Fid then t1.FNAME  else t1.NAME  end as SECONDNAME
, case when count(*) over (partition by id) =1 then 'nodups' else 'dups' end as ID_chk
from (
SELECT h1.NAME,
 h1.ID, 
  h2.id as Fid,
  h2.name as Fname,
SYS.UTL_MATCH.JARO_WINKLER_SIMILARITY(h1.name,h2.name) as match1 
FROM (select 
NAME,
 ID from HAVE)h1 , (
select 
 NAME,
 ID FROM
have)
h2 where SYS.UTL_MATCH.JARO_WINKLER_SIMILARITY((h1.name),(h2.name)) > 75
order by h1.id
)
t1

)A
)
, no_dups as
(
select * from table1 where ID_chk='nodups'
)
,dups as
(
select * from table1 where ID_chk<>'nodups'
)
, dups_stp1 as
(
select * from dups
WHERE FIRSTID <>  SECONDID
)
, dups_stp2 as 
(
select rowno,ID,FIRSTID,SECONDNAME from dups_stp1 
where FIRSTID not in (select SECONDID from dups_stp1)
)

  select t2.ID,t3.NAME,rnk as mtch_ind  from (
select ID,SECONDNAME as NAME, dense_rank() OVER ( ORDER BY SECONDNAME asc)as rnk from (
select distinct ID, FIRSTID, SECONDNAME  from dups_stp2
union all 
select ID, FIRSTID, SECONDNAME  from no_dups
)t1 
)t2
inner join HAVE t3 on t2.ID=t3.ID
;

Reference https://www.decisivedata.net/blog/cleaning-messy-data-sql-part-1-fuzzy-matching-names

来源：https://stackoverflow.com/questions/61599416/match-slightly-different-records-in-a-field

标签

sql

Oracle

fuzzy-comparison