String similarity metrics in Python

I want to find string similarity between two strings. This page has examples of some of them. Python has an implemnetation of Levenshtein algorithm. Is there a better algorithm, (and hopefully a python library), under these contraints.

I want to do fuzzy matches between strings. eg matches('Hello, All you people', 'hello, all You peopl') should return True
False negatives are acceptable, False positives, except in extremely rare cases are not.
This is done in a non realtime setting, so speed is not (much) of concern.
[Edit] I am comparing multi word strings.

Would something other than Levenshtein distance(or Levenshtein ratio) be a better algorithm for my case?

There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.

http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Here's a bit of the list:

Hamming distance
Levenshtein distance
Needleman-Wunch distance or Sellers Algorithm
and many more...

I realize it's not the same thing, but this is close enough:

>>> import difflib
>>> a = 'Hello, All you people'
>>> b = 'hello, all You peopl'
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.97560975609756095

You can make this as a function

def similar(seq1, seq2):
    return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9

>>> similar(a, b)
True
>>> similar('Hello, world', 'Hi, world')
False

This snippet will calculate the difflib, Levenshtein, Sørensen, and Jaccard similarity values for two strings. In the snippet below, I was iterating over a tsv in which the strings of interest occupied columns [3] and [4] of the tsv. (pip install python-Levenshtein and pip install distance):

import codecs, difflib, Levenshtein, distance

with codecs.open("titles.tsv","r","utf-8") as f:
    title_list = f.read().split("\n")[:-1]

    for row in title_list:

        sr      = row.lower().split("\t")

        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
        lev     = Levenshtein.ratio(sr[3], sr[4]) 
        sor     = 1 - distance.sorensen(sr[3], sr[4])
        jac     = 1 - distance.jaccard(sr[3], sr[4])

        print diffl, lev, sor, jac

I would use Levenshtein distance, or the so-called Damerau distance (which takes transpositions into account) rather than the difflib stuff for two reasons (1) "fast enough" (dynamic programming algo) and "whoooosh" (bit-bashing) C code is available and (2) well-understood behaviour e.g. Levenshtein satisfies the triangle inequality and thus can be used in e.g. a Burkhard-Keller tree.

Threshold: you should treat as "positive" only those cases where distance < (1 - X) * max(len(string1), len(string2)) and adjust X (the similarity factor) to suit yourself. One way of choosing X is to get a sample of matches, calculate X for each, ignore cases where X < say 0.8 or 0.9, then sort the remainder in descending order of X and eye-ball them and insert the correct result and calculate some cost-of-mistakes measure for various levels of X.

N.B. Your ape/apple example has distance 2, so X is 0.6 ... I would only use a threshold as low as 0.75 if I were desperately looking for something and had a high false-negative penalty

Is that what you mean?

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

look at http://docs.python.org/library/difflib.html#difflib.get_close_matches

I know this isn't the same but you can adjust the ratio to filter out strings that are not similar enough and return the closest match to the string you are looking for.

Perhaps you would be more interested in semantic similarity metrics.

https://www.google.com/search?client=ubuntu&channel=fs&q=semantic+similarity+string+match&ie=utf-8&oe=utf-8

I realize you said speed is not an issue but if you are processing a lot of the strings for your algorithm the below is very helpful.

def spellcheck(self, sentence):
    #return ' '.join([difflib.get_close_matches(word, wordlist,1 , 0)[0] for word in sentence.split()])
    return ' '.join( [ sorted( { Levenshtein.ratio(x, word):x for x in wordlist }.items(), reverse=True)[0][1] for word in sentence.split() ] )

Its about 20 times faster than difflib.

https://pypi.python.org/pypi/python-Levenshtein/

import Levenshtein

来源：https://stackoverflow.com/questions/1471153/string-similarity-metrics-in-python

标签

python

algorithm

string

levenshtein-distance