问题
I am writing an algorithm that checks how much a string is equal to another string. I am using Sklearn cosine similarity.
My code is:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
example_1 = ("I am okey", "I am okeu")
example_2 = ("I am okey", "I am crazy")
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(example_1)
result_cos = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
print(result_cos[0][1])
Running this code for example_1, prints 0.336096927276. Running it for example_2, it prints the same score. The result is the same in both cases because there is only one different word.
What I want is to get a higher score for example_1 because the different words "okey vs okeu" have only one different letter. In contrast in example_2 there are two completely different words "okey vs crazy".
How can my code take in consideration that in some cases the different words are not completely different?
回答1:
For short strings, Levenshtein distance will probably yield better results than cosine similarity based on words. The algorithm below is adapted from Wikibooks. Since this is a distance metric, smaller score is better.
def levenshtein(s1, s2):
if len(s1) < len(s2):
s1, s2 = s2, s1
if len(s2) == 0:
return len(s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1
deletions = current_row[j] + 1
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]/float(len(s1))
example_1 = ("I am okey", "I am okeu")
example_2 = ("I am okey", "I am crazy")
print(levenshtein(*example_1))
print(levenshtein(*example_2))
来源:https://stackoverflow.com/questions/47728069/sklearn-cosine-similarity-for-strings-python