How to calculate the similarity measure of text document?

早过忘川 提交于 2019-12-13 03:07:39

问题


I have CSV file that looks like:

idx         messages
112  I have a car and it is blue
114  I have a bike and it is red
115  I don't have any car
117  I don't have any bike

I would like to have the code that reads the file and performs the similarity difference.

I have looked into many posts regarding this such as 1 2 3 4 but either it is hard for me to understand or not exactly what I want.

based on some posts and webpages that saying "a simple and effective one is Cosine similarity" or "Universal sentence encoder" or "Levenshtein distance".

It would be great if you can provide your help with code that I can run in my side as well. Thanks


回答1:


I don't know that calculations like this can be vectorized particularly well, so looping is simple. At least use the fact that your calculation is symmetric and the diagonal is always 100 to cut down on the number of calculations you perform.

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz

K = len(df)
similarity = np.empty((K,K), dtype=float)

for i, ac in enumerate(df['messages']):
    for j, bc in enumerate(df['messages']):
        if i > j:
            continue
        if i == j:
            sim = 100
        else:
            sim = fuzz.ratio(ac, bc) # Use whatever metric you want here
                                     # for comparison of 2 strings.

        similarity[i, j] = sim
        similarity[j, i] = sim

df_sim = pd.DataFrame(similarity, index=df.idx, columns=df.idx)

Output: df_sim

id     112    114    115    117
id                             
112  100.0   78.0   51.0   50.0
114   78.0  100.0   47.0   54.0
115   51.0   47.0  100.0   83.0
117   50.0   54.0   83.0  100.0


来源:https://stackoverflow.com/questions/56509784/how-to-calculate-the-similarity-measure-of-text-document

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!