问题
How do I get the probability of a string being similar to another string in Python?
I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standard Python and library.
e.g.
similar(\"Apple\",\"Appel\") #would have a high prob.
similar(\"Apple\",\"Mango\") #would have a lower prob.
回答1:
There is a built in.
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
Using it:
>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
回答2:
I think maybe you are looking for an algorithm describing the distance between strings. Here are some you may refer to:
- Hamming distance
- Levenshtein distance
- Damerau–Levenshtein distance
- Jaro–Winkler distance
回答3:
Solution #1: Python builtin
use SequenceMatcher from difflib
pros:
native python library, no need extra package.
cons: too limited, there are so many other good algorithms for string similarity out there.
>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
Solution #2: jellyfish library
its a very good library with good coverage and few issues.
it supports:
- Levenshtein Distance
- Damerau-Levenshtein Distance
- Jaro Distance
- Jaro-Winkler Distance
- Match Rating Approach Comparison
- Hamming Distance
pros:
easy to use, gamut of supported algorithms, tested.
cons: not native library.
example:
>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1
回答4:
Fuzzy Wuzzy
is a package that implements Levenshtein distance in python, with some helper functions to help in certain situations where you may want two distinct strings to be considered identical. For example:
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
100
回答5:
You can create a function like:
def similar(w1, w2):
w1 = w1 + ' ' * (len(w2) - len(w1))
w2 = w2 + ' ' * (len(w1) - len(w2))
return sum(1 if i == j else 0 for i, j in zip(w1, w2)) / float(len(w1))
回答6:
Package distance includes Levenshtein distance:
import distance
distance.levenshtein("lenvestein", "levenshtein")
# 3
回答7:
The builtin SequenceMatcher
is very slow on large input, here's how it can be done with diff-match-patch:
from diff_match_patch import diff_match_patch
def compute_similarity_and_diff(text1, text2):
dmp = diff_match_patch()
dmp.Diff_Timeout = 0.0
diff = dmp.diff_main(text1, text2, False)
# similarity
common_text = sum([len(txt) for op, txt in diff if op == 0])
text_length = max(len(text1), len(text2))
sim = common_text / text_length
return sim, diff
来源:https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings