问题
I\'m programming a spellcheck program in Python. I have a list of valid words (the dictionary) and I need to output a list of words from this dictionary that have an edit distance of 2 from a given invalid word.
I know I need to start by generating a list with an edit distance of one from the invalid word(and then run that again on all the generated words). I have three methods, inserts(...), deletions(...) and changes(...) that should output a list of words with an edit distance of 1, where inserts outputs all valid words with one more letter than the given word, deletions outputs all valid words with one less letter, and changes outputs all valid words with one different letter.
I\'ve checked a bunch of places but I can\'t seem to find an algorithm that describes this process. All the ideas I\'ve come up with involve looping through the dictionary list multiple times, which would be extremely time consuming. If anyone could offer some insight, I\'d be extremely grateful.
回答1:
The thing you are looking at is called an edit distance and here is a nice explanation on wiki. There are a lot of ways how to define a distance between the two words and the one that you want is called Levenshtein distance and here is a DP implementation in python.
def levenshteinDistance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
And a couple of more implementations are here.
回答2:
Here is my version for Levenshtein distance
def edit_distance(s1, s2): m=len(s1)+1 n=len(s2)+1 tbl = {} for i in range(m): tbl[i,0]=i for j in range(n): tbl[0,j]=j for i in range(1, m): for j in range(1, n): cost = 0 if s1[i-1] == s2[j-1] else 1 tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost) return tbl[i,j] print(edit_distance("Helloworld", "HalloWorld"))
回答3:
#this calculates edit distance not levenstein edit distance
word1="rice"
word2="ice"
len_1=len(word1)
len_2=len(word2)
x =[[0]*(len_2+1) for _ in range(len_1+1)]#the matrix whose last element ->edit distance
for i in range(0,len_1+1): #initialization of base case values
x[i][0]=i
for j in range(0,len_2+1):
x[0][j]=j
for i in range (1,len_1+1):
for j in range(1,len_2+1):
if word1[i-1]==word2[j-1]:
x[i][j] = x[i-1][j-1]
else :
x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1
print x[i][j]
回答4:
The specific algorithm you describe is called Levenshtein distance. A quick Google throws up several Python libraries and recipes to calculate it.
回答5:
You need Minimum Edit Distance for this task.
Following is my version of MED a.k.a Levenshtein Distance.
def MED_character(str1,str2):
cost=0
len1=len(str1)
len2=len(str2)
#output the length of other string in case the length of any of the string is zero
if len1==0:
return len2
if len2==0:
return len1
accumulator = [[0 for x in range(len2)] for y in range(len1)] #initializing a zero matrix
# initializing the base cases
for i in range(0,len1):
accumulator[i][0] = i;
for i in range(0,len2):
accumulator[0][i] = i;
# we take the accumulator and iterate through it row by row.
for i in range(1,len1):
char1=str1[i]
for j in range(1,len2):
char2=str2[j]
cost1=0
if char1!=char2:
cost1=2 #cost for substitution
accumulator[i][j]=min(accumulator[i-1][j]+1, accumulator[i][j-1]+1, accumulator[i-1][j-1] + cost1 )
cost=accumulator[len1-1][len2-1]
return cost
回答6:
difflib in the standard library has various utilities for sequence matching, including the get_close_matches
method that you could use. It uses an algorithm adapted from Ratcliff and Obershelp.
From the docs
from difflib import get_close_matches
# Yields ['apple', 'ape']
get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
回答7:
Instead of going with Levenshtein distance algo use BK tree or TRIE, as these algorithms have less complexity then edit distance. A good browse over these topic will give a detailed description.
This link will help you more about spell checking.
来源:https://stackoverflow.com/questions/2460177/edit-distance-in-python