How to calculate the similarity measure of text document?

问题

I have CSV file that looks like:

idx         messages
112  I have a car and it is blue
114  I have a bike and it is red
115  I don't have any car
117  I don't have any bike

I would like to have the code that reads the file and performs the similarity difference.

I have looked into many posts regarding this such as 1 2 3 4 but either it is hard for me to understand or not exactly what I want.

based on some posts and webpages that saying "a simple and effective one is Cosine similarity" or "Universal sentence encoder" or "Levenshtein distance".

It would be great if you can provide your help with code that I can run in my side as well. Thanks

回答1:

I don't know that calculations like this can be vectorized particularly well, so looping is simple. At least use the fact that your calculation is symmetric and the diagonal is always 100 to cut down on the number of calculations you perform.

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz

K = len(df)
similarity = np.empty((K,K), dtype=float)

for i, ac in enumerate(df['messages']):
    for j, bc in enumerate(df['messages']):
        if i > j:
            continue
        if i == j:
            sim = 100
        else:
            sim = fuzz.ratio(ac, bc) # Use whatever metric you want here
                                     # for comparison of 2 strings.

        similarity[i, j] = sim
        similarity[j, i] = sim

df_sim = pd.DataFrame(similarity, index=df.idx, columns=df.idx)

Output: `df_sim`

id     112    114    115    117
id                             
112  100.0   78.0   51.0   50.0
114   78.0  100.0   47.0   54.0
115   51.0   47.0  100.0   83.0
117   50.0   54.0   83.0  100.0

来源：https://stackoverflow.com/questions/56509784/how-to-calculate-the-similarity-measure-of-text-document

标签

python