What is a simple fuzzy string matching algorithm in Python?

前端 未结 7 887
不知归路
不知归路 2020-12-13 00:55

I\'m trying to find some sort of a good, fuzzy string matching algorithm. Direct matching doesn\'t work for me — this isn\'t too good because unless my strings are a 100% si

相关标签:
7条回答
  • 2020-12-13 01:27

    Take a look at this python library, which SeatGeek open-sourced yesterday. Obviously most of these kinds of problems are very context dependent, but it might help you.

    from fuzzywuzzy import fuzz
    
    s1 = "the quick brown fox"
    s2 = "the quick brown fox jumped over the lazy dog"
    s3 = "the fast fox jumped over the hard-working dog"
    
    fuzz.partial_ratio(s1, s2)
    > 100
    
    fuzz.token_set_ratio(s2, s3)
    > 73
    

    SeatGeek website

    and Github repo

    0 讨论(0)
  • 2020-12-13 01:28

    I like Drew's answer.

    You can use difflib to find the longest match:

    >>> a = 'The quick brown fox.'
    >>> b = 'The quick brown fox jumped over the lazy dog.'
    >>> import difflib
    >>> s = difflib.SequenceMatcher(None, a, b)
    >>> s.find_longest_match(0,len(a),0,len(b))
    Match(a=0, b=0, size=19) # returns NamedTuple (new in v2.6)
    

    Or pick some minimum matching threshold. Example:

    >>> difflib.SequenceMatcher(None, a, b).ratio()
    0.61538461538461542
    
    0 讨论(0)
  • 2020-12-13 01:30

    If all you want to do is to test whether or not all the words in a string match another string, that's a one liner:

    if not [word for word in b.split(' ') if word not in a.split(' ')]:
        print 'Match!'
    

    If you want to score them instead of a binary test, why not just do something like:

    ((# of matching words) / (# of words in bigger string)) * ((# of words in smaller string) / (# of words in bigger string))

    ?

    If you wanted to, you could get fancier and do fuzzy match on each string.

    0 讨论(0)
  • 2020-12-13 01:34

    Levenshtein should work ok if you compare words (strings separated by sequences of stop charactes) instead of individual letters.

    def ld(s1, s2):  # Levenshtein Distance
        len1 = len(s1)+1
        len2 = len(s2)+1
        lt = [[0 for i2 in range(len2)] for i1 in range(len1)]  # lt - levenshtein_table
        lt[0] = list(range(len2))
        i = 0
        for l in lt:
            l[0] = i
            i += 1
        for i1 in range(1, len1):
            for i2 in range(1, len2):
                if s1[i1-1] == s2[i2-1]:
                    v = 0
                else:
                    v = 1
                lt[i1][i2] = min(lt[i1][i2-1]+1, lt[i1-1][i2]+1, lt[i1-1][i2-1]+v)
        return lt[-1][-1]
    
    str1 = "The quick brown fox"
    str2 = "The quick brown fox jumped over the lazy dog"
    
    print("{} words need to be added, deleted or replaced to convert string 1 into string 2".format(ld(str1.split(),str2.split())))
    
    0 讨论(0)
  • 2020-12-13 01:35

    You could modify the Levenshtein algorithm to compare words rather than characters. It's not a very complex algorithm and the source is available in many languages online.

    Levenshtein works by comparing two arrays of chars. There is no reason that the same logic could not be applied against two arrays of strings.

    0 讨论(0)
  • 2020-12-13 01:35

    I did this some time ago with C#, my previous question is here. There is starter algorith for your interest, you can easily transform it to python.

    Ideas you should use writing your own algorithm is something like this:

    • Have a list with original "titles" (words/sentences you want to match with).
    • Each title item should have minimal match score on word/sentence, ignore title as well.
    • You also should have global minimal match percentage of final result.
    • You should calculate each word - word Levenshtein distance.
    • You should increase total match weight if words are going in the same order (quick brown vs quick brown, should have definitively higher weight than quick brown vs. brown quick.)
    0 讨论(0)
提交回复
热议问题