What is a simple fuzzy string matching algorithm in Python?

前端未结

关注

 7  903

I\'m trying to find some sort of a good, fuzzy string matching algorithm. Direct matching doesn\'t work for me — this isn\'t too good because unless my strings are a 100% si

相关标签:

7条回答

不知归路

2020-12-13 01:27
Take a look at this python library, which SeatGeek open-sourced yesterday. Obviously most of these kinds of problems are very context dependent, but it might help you.
```
from fuzzywuzzy import fuzz

s1 = "the quick brown fox"
s2 = "the quick brown fox jumped over the lazy dog"
s3 = "the fast fox jumped over the hard-working dog"

fuzz.partial_ratio(s1, s2)
> 100

fuzz.token_set_ratio(s2, s3)
> 73
```
SeatGeek website

and Github repo
0 讨论(0)
发布评论:

提交评论
- 加载中...

佛祖请我去吃肉

2020-12-13 01:28

I like Drew's answer.

You can use difflib to find the longest match:

>>> a = 'The quick brown fox.'
>>> b = 'The quick brown fox jumped over the lazy dog.'
>>> import difflib
>>> s = difflib.SequenceMatcher(None, a, b)
>>> s.find_longest_match(0,len(a),0,len(b))
Match(a=0, b=0, size=19) # returns NamedTuple (new in v2.6)

Or pick some minimum matching threshold. Example:

>>> difflib.SequenceMatcher(None, a, b).ratio()
0.61538461538461542

0 讨论(0)

醉话见心

2020-12-13 01:30
If all you want to do is to test whether or not all the words in a string match another string, that's a one liner:
```
if not [word for word in b.split(' ') if word not in a.split(' ')]:
    print 'Match!'
```
If you want to score them instead of a binary test, why not just do something like:

((# of matching words) / (# of words in bigger string)) * ((# of words in smaller string) / (# of words in bigger string))

?

If you wanted to, you could get fancier and do fuzzy match on each string.
0 讨论(0)
发布评论:

提交评论
- 加载中...

灰色年华

2020-12-13 01:34

Levenshtein should work ok if you compare words (strings separated by sequences of stop charactes) instead of individual letters.

def ld(s1, s2):  # Levenshtein Distance
    len1 = len(s1)+1
    len2 = len(s2)+1
    lt = [[0 for i2 in range(len2)] for i1 in range(len1)]  # lt - levenshtein_table
    lt[0] = list(range(len2))
    i = 0
    for l in lt:
        l[0] = i
        i += 1
    for i1 in range(1, len1):
        for i2 in range(1, len2):
            if s1[i1-1] == s2[i2-1]:
                v = 0
            else:
                v = 1
            lt[i1][i2] = min(lt[i1][i2-1]+1, lt[i1-1][i2]+1, lt[i1-1][i2-1]+v)
    return lt[-1][-1]

str1 = "The quick brown fox"
str2 = "The quick brown fox jumped over the lazy dog"

print("{} words need to be added, deleted or replaced to convert string 1 into string 2".format(ld(str1.split(),str2.split())))

0 讨论(0)

天命终不由人

2020-12-13 01:35

You could modify the Levenshtein algorithm to compare words rather than characters. It's not a very complex algorithm and the source is available in many languages online.

Levenshtein works by comparing two arrays of chars. There is no reason that the same logic could not be applied against two arrays of strings.

0 讨论(0)
发布评论:

提交评论
- 加载中...
名媛妹妹

2020-12-13 01:35
I did this some time ago with C#, my previous question is here. There is starter algorith for your interest, you can easily transform it to python.
Ideas you should use writing your own algorithm is something like this:
- Have a list with original "titles" (words/sentences you want to match with).
- Each title item should have minimal match score on word/sentence, ignore title as well.
- You also should have global minimal match percentage of final result.
- You should calculate each word - word Levenshtein distance.
- You should increase total match weight if words are going in the same order (quick brown vs quick brown, should have definitively higher weight than quick brown vs. brown quick.)
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页