I need to compare strings to decide whether they represent the same thing. This relates to case titles entered by humans where abbreviations and other small details may di
Another algorithm that you can consider is the Simon White Similarity:
def get_bigrams(string):
"""
Take a string and return a list of bigrams.
"""
if string is None:
return ""
s = string.lower()
return [s[i : i + 2] for i in list(range(len(s) - 1))]
def simon_similarity(str1, str2):
"""
Perform bigram comparison between two strings
and return a percentage match in decimal form.
"""
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = len(pairs1) + len(pairs2)
if union == 0 or union is None:
return 0
hit_count = 0
for x in pairs1:
for y in pairs2:
if x == y:
hit_count += 1
break
return (2.0 * hit_count) / union