Using python efficiently to calculate hamming distances [closed]

半腔热情 提交于 2019-12-05 07:12:28
Matthew Franglen

The distance package in python provides a hamming distance calculator:

import distance

distance.levenshtein("lenvestein", "levenshtein")
distance.hamming("hamming", "hamning")

There is also a levenshtein package which provides levenshtein distance calculations. Finally difflib can provide some simple string comparisons.

There is more information and example code for all of these on this old question.

Your existing code is slow because you recalculate the file hash in the most inner loop, which means every file gets hashed many times. If you calculate the hash first then the process will be much more efficient:

files = ...
files_and_hashes = [(f, pHash.imagehash(f)) for f in files]
file_comparisons = [
    (hamming(first[0], second[0]), first, second)
    for second in files
    for first in files
    if first[1] != second[1]
]

This process fundamentally involves O(N^2) comparisons so to distribute this in a way suitable for a map reduce problem involves taking the complete set of strings and dividing them into B blocks where B^2 = M (B = number of string blocks, M = number of workers). So if you had 16 strings and 4 workers you would split the list of strings into two blocks (so a block size of 8). An example of dividing the work follows:

all_strings = [...]
first_8 = all_strings[:8]
last_8 = all_strings[8:]
compare_all(machine_1, first_8, first_8)
compare_all(machine_2, first_8, last_8)
compare_all(machine_3, last_8, first_8)
compare_all(machine_4, last_8, last_8)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!