Calculating the cosine similarity between all the rows of a dataframe in pyspark

前端 未结 2 820
失恋的感觉
失恋的感觉 2020-12-04 23:07

I have a dataset containing workers with their demographic information like age gender,address etc and their work locations. I created an RDD from the dataset and converted

2条回答
  •  -上瘾入骨i
    2020-12-04 23:36

    About this issue, due to the fact that I'm working in a project with pyspark where I have to use cosine similarity, I have to say that the code of @MaFF is correct, indeed, I hesitated when I see his code, due to the fact he was using the dot product of the vectors' L2 Norm, and the theroy says: Mathematically, it is the ratio of the dot product of the vectors and the product of the magnitude of the two vectors.

    And here is my code adapted with the same results, so I came to the conclusion that SKLearn caculates tfidf in a different way, so if you try to replay this excersice using sklearn, you will get a different result.

    d = [{'id': '1', 'office': 'Delhi, Mumbai, Gandhinagar'}, {'id': '2', 'office': 'Delhi, Mandi'}, {'id': '3', 'office': 'Hyderbad, Jaipur'}]
    df_fussion = spark.createDataFrame(d)
    df_fussion = df_fussion.withColumn('office', F.split('office', ', '))
    
    
    from pyspark.ml.feature import HashingTF, IDF
    hashingTF = HashingTF(inputCol="office", outputCol="tf")
    tf = hashingTF.transform(df_fussion)
    
    idf = IDF(inputCol="tf", outputCol="feature").fit(tf)
    data = idf.transform(tf)   
    
    @udf
    def sim_cos(v1,v2):
        try:
            p = 2
            return float(v1.dot(v2))/float(v1.norm(p)*v2.norm(p))
        except:
            return 0
    
    result = data.alias("i").join(data.alias("j"), F.col("i.ID") < F.col("j.ID"))\
        .select(
            F.col("i.ID").alias("i"),
            F.col("j.ID").alias("j"),
            sim_cos("i.feature", "j.feature").alias("sim_cosine"))\
        .sort("i", "j")
    result.show()
    

    I also want to share with you some simply test that I did with simply vectors where the results are corrects:

    Kind regards,

提交回复
热议问题