Python: find closest string (from a list) to another string

狂风中的少年 提交于 2019-12-18 10:16:44

问题


Let's say I have a string "Hello" and a list

words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo','question', 'Hallo', 'format']

How can I find the n words that are the closest to "Hello" and present in the list words ?

In this case, we would have ['hello', 'hallo', 'Hallo', 'hi', 'format'...]

So the strategy is to sort the list words from the closest word to the furthest.

I thought about something like this

word = 'Hello'
for i, item in enumerate(words):
    if lower(item) > lower(word):
      ...

but it's very slow in large lists.

UPDATE difflib works but it's very slow also. (words list has 630000+ words inside (sorted and one per line)). So checking the list takes 5 to 7 seconds for every search for closest word!


回答1:


Use difflib.get_close_matches.

>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> difflib.get_close_matches('Hello', words)
['hello', 'Hallo', 'hallo']

Please look at the documentation, because the function returns 3 or less closest matches by default.




回答2:


There is an awesome article with a complete source code (21 lines) provided by Peter Norvig on spelling correction.

http://norvig.com/spell-correct.html

The idea is to build all possible edits of your word,

hello - helo   - deletes    
hello - helol  - transpose    
hello - hallo  - replaces    
hello - heallo - inserts    


def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

Now, look up each of these edits in your list.

Peter's article is a great read and worth reading.




回答3:


Create a sorted list of your words and use the bisect module to identify the point in the sorted list where your word would fit according to the sorting order. Based on that position you can give the k nearest neighbours above and below to find the 2k closest words.




回答4:


maybe heap can help you .

you have a heap named Heap that until it's size is less than n , you insert words into the Heap using function close [shows this string is closer than another string or not] .

this method can help you when n is small :)

Heap = []
for word in words:
    if len(Heap)<n:
       Heap.insert(word)
    else
       if close(word,Heap[0]): # it means Heap[0] is the nth farthest word until now
             Heap.pop():
             Heap.insert(word)


来源:https://stackoverflow.com/questions/10018679/python-find-closest-string-from-a-list-to-another-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!