Similarity of two Hexadecimal numbers

梦想的初衷 提交于 2019-12-11 00:01:40

问题


I am trying to find similar hashes (hexadecimal hash) using hamming and Levenshtein distance. Lets say two hashes are similar if their hamming distance is less than 10 (number of differing bits).

Hash 1= ffffff (base 16)
Hash 2= fffff0 (base 16)

The hamming distance between two hashes is 4. They are similar. Because,

Hash 1= 11111111 11111111 11111111 (base 2)
Hash 2= 11111111 11111111 11110000 (base 2)

I have 8 million such hashes. I am wondering what will be a suitable data structure for storing the 8 million hashes. I initially tried "Trie" but consider the following scenario,

Hash 1 = 0fabde (00001111 10101011 11011110)
Hash 2 = adcbfe (10101010 11001011 11111110)

The hamming distance is 7. So I cannot do prefix search.

I know that i can use XOR and Integer.bitCount() to get the number of differing bits, but I have one target hash and 8 million hashes to search against i.e Given a hash i have to find all the similar hashes in 8 million hashes that we have in repository.

Is there any way store the hashes effectively so that my search base is reduced?


回答1:


If the hashes are as small as shown, you can index them "directly" - that is, put them in a big array and do just do some math on the index.

It's fairly simple to generate only the indexes that may correspond to hashes that are within the requested hamming distance d, just XOR the key with all masks that contain up to d set bits (see below). Since there are 8 million hashes but only 16 million could exist, about half of the visited indexes are expected to be "useful" ie there will be something there to find.

To generate the masks, you can use the old NextBitPermutation trick, which has been posted on StackOverflow several times before, for example here. For java, just use the logical right shift and replace __builtin_ctz by numberOfTrailingZeros to get (not tested)

int t = v | (v - 1);
int w = (t + 1) | (((~t & -~t) - 1) >>> (Integer.numberOfTrailingZeros(v) + 1));

Here w would be the bit permutation after v.

The global structure would be something like (not tested)

for (int k = 1; k <= d; k++) {
    int diff = (1 << k) - 1;
    while (diff <= 0xFFFFFF) {
        if (hashes[key ^ diff])
            // do something with it
        diff = nextBitPermutation(diff);
    }
}


来源:https://stackoverflow.com/questions/40253731/similarity-of-two-hexadecimal-numbers

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!