问题
I have the following details of the data:
Table 1: Table1 is of small in size around few records.
Table 2: Table2 is having 50 millions of rows.
Requirement: I need to match the any string column from table1 to table2 for example name column to name and get the percentage of matching (note column can be any, maybe address or any string column which have multiple words in a single cell).
Sample data:
create table table1(id int, name varchar(100), address varchar(200));
insert into table1 values(1,'Mario Speedwagon','H No 10 High Street USA');
insert into table1 values(2,'Petey Cruiser Jack','#1 Church Street UK');
insert into table1 values(3,'Anna B Sthesia','#101 No 1 B Block UAE');
insert into table1 values(4,'Paul A Molive','Main Road 12th Cross H No 2 USA');
insert into table1 values(5,'Bob Frapples','H No 20 High Street USA');
create table table2(name varchar(100), address varchar(200), email varchar(100));
insert into table2 values('Speedwagon Mario ','USA, H No 10 High Street','mario@gmail.com');
insert into table2 values('Cruiser Petey Jack','UK #1 Church Street','jack@gmail.com');
insert into table2 values('Sthesia Anna','UAE #101 No 1 B Block','Aanna@gmail.com');
insert into table2 values('Molive Paul','USA Main Road 12th Cross H No 2','APaul@gmail.com');
insert into table2 values('Frapples Bob ','USA H No 20 High Street','BobF@gmail.com');
Expected Result:
tbl1_Name tbl2_Name Percentage
--------------------------------------------------------
Mario Speedwagon Speedwagon Mario 100
Petey Cruiser Jack Cruiser Petey Jack 100
Anna B Sthesia Sthesia Anna around 80+
Paul A Molive Molive Paul around 80+
Bob Frapples Frapples Bob 100
Note: Above given is just sample data to understand, I have few records in table1 and 50 millions in table2 in actual senario.
My Try:
Step 1: As suggested by Shnugo have normalize data and stored in the same table's.
For table1:
ALTER TABLE table1 ADD Name_Normal VARCHAR(1000);
GO
--00:00:00 (5 row(s) affected)
UPDATE table1
SET Name_Normal=CAST('<x>' + REPLACE((SELECT LOWER(name) AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML)
.query(N'
for $fragment in distinct-values(/x/text())
order by $fragment
return $fragment
').value('.','nvarchar(1000)');
GO
For table2:
ALTER TABLE table2 ADD Name_Normal VARCHAR(1000);
GO
--01:59:03 (50000000 row(s) affected)
UPDATE table2
SET Name_Normal=CAST('<x>' + REPLACE((SELECT LOWER(name) AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML)
.query(N'
for $fragment in distinct-values(/x/text())
order by $fragment
return $fragment
').value('.','nvarchar(1000)');
GO
Step 2: Create Percentage calculation function using Levenshtein distance in Microsoft Sql Server
Step 3: Query to get the matching percentage.
--00:00:33 (23456 row(s) affected)
SELECT t.name AS [tbl1_Name],t1.name AS [tbl2_Name],
dbo.ufn_Levenshtein(t.Name_Normal,t1.Name_Normal) percentage
into #TempTable
FROM table2 t
INNER JOIN table1 t1
ON CHARINDEX(SOUNDEX(t.Name_Normal),SOUNDEX(t1.Name_Normal))>0
--00:00:00 (23456 row(s) affected)
SELECT *
FROM #TempTable
WHERE percentage >= 50
order by percentage desc;
Conclusion: Getting expected result but it's taking around 2 hours for normalizing table2 as mentioned in comment in above query. Any suggestion for better optimization at step 1 for table2?
回答1:
Have you tried looking into DQS (Data Quality Services)? Depends on your SQL version, it comes with the installation file. https://docs.microsoft.com/en-us/sql/data-quality-services/data-matching?view=sql-server-2017
来源:https://stackoverflow.com/questions/54743382/get-matching-string-with-the-percentage