How to find highly similar observations in another dataset using Spark

问题

I have two csv files. File 1:

D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot
2,66M,J,Rock,F,1995,201211.0
3,David,HM,Lee,M,,201211.0
6,66M,,Rock,F,,201211.0
0,David,H M,Lee,,1990,201211.0
3,Marc,H,Robert,M,2000,201211.0
6,Marc,M,Robert,M,,201211.0
6,Marc,MS,Robert,M,2000,201211.0
3,David,M,Lee,,1990,201211.0
5,Paul,ABC,Row,F,2008,201211.0
3,Paul,ACB,Row,,,201211.0
4,David,,Lee,,1990,201211.0
4,66,J,Rock,,1995,201211.0

File 2:

PID,FNAME,MNAME,LNAME,GENDER,DOB,FNAMELNAMEMNAMEGENDERDOB
S2,66M,J,Rock,F,1995,66MRockJF1995
S3,David,HM,Lee,M,1990,DavidLeeHMM1990
S0,Marc,HM,Robert,M,2000,MarcRobertHMM2000
S1,Marc,MS,Robert,M,2000,MarcRobertMSM2000
S6,Paul,Row,M,2008,PaulRowM2008
S7,Sam,O,Baby,F,2018,SamBabyOF2018

For example, I want to extract those highly similar observations in File 2 with MarcHRobertM2000 in File 1. My expected output will be:

S0,Marc,HM,Robert,M,2000,MarcRobertHMM2000
S1,Marc,MS,Robert,M,2000,MarcRobertMSM2000

I used the following code:

sqlContext.registerDataFrameAsTable(df2,'table')
query=""" SELECT PID, FNAMELNAMEMNAMEGENDERDOB, similarity(lower(FNAMELNAMEMNAMEGENDERDOB), 'MarcHRobertM2000') as sim
    FROM table
    WHERE sim>0.7 """
df=sqlContext.sql(query)

It looks like the similarity in SQL does not work in sqlcontext. I have no idea how to fix it. In addition, File 2 is big, around 5 GB so I did not use the fuzzywuzzy in python. And soundex is not satisfying. Could you help me? Thank you.

回答1:

you can use Levenshtein distance function to check the similarity.

Please refer to the below code

query=""" SELECT PID, FNAMELNAMEMNAMEGENDERDOB, levenshtein(FNAMELNAMEMNAMEGENDERDOB, 'MarcHRobertM2000') as sim
    FROM table
    WHERE sim < 4 """

Also please check https://medium.com/@mrpowers/fuzzy-matching-in-spark-with-soundex-and-levenshtein-distance-6749f5af8f28 for good read.

来源：https://stackoverflow.com/questions/58887715/how-to-find-highly-similar-observations-in-another-dataset-using-spark

标签

pyspark

apache-spark-sql