Minhash implementation how to find hash functions for permutations

老子叫甜甜 提交于 2019-12-23 11:20:51

问题


I have a problem implementing minhashing. On paper and from reading I understand the concept, but my problem is the permutation "trick". Instead of permuting the matrix of sets and values the suggestion for implementation is: "pick k (e.g. 100) independent hash functions" and then the algorithm says:

for each row r 
    for each column c 
        if c has 1 in row r 
           for each hash function h_i  do
            if h_i(r) is a smaller value than M (i, c) then
            M(i, c) := h_i(r)

In different small examples and teaching book they only use two or three hash functions in the form of (h = a*x + b mod p). Thats easy to find, but how to do in practice, how can I find 100 of such independent functions.

In a Java example here there are generated hash values only from one hash function instead of multi hash functions, independent of the row index. Where is the difference ? My question is now how to find these independent hash functions or if there is an approach with only one hash function how to treat these values in the algorithm ?


回答1:


One simple way is using a parametric hash family such as Tabulation hashing functions(http://en.wikipedia.org/wiki/Tabulation_hashing)

In the book's example (a*x+b mod p) by choosing different sets of (a, b, p) you can have different hash function. One way to have independent hash functions is to choose (a, b, p) prime/co-prime and preferly large numbers




回答2:


As per iampat's answer, you could use tabulation hashing (http://en.wikipedia.org/wiki/Tabulation_hashing).

Another very efficient option that gives good results is to use a single good-quality hash function (such as FNV_1a) to produce a master-hash, and then modify that using 100 different combinations of XOR and bitroll.

To generate each hash, you take the master hash, bitroll it by a given distance, then XOR it with a given value. The bitroll and XOR values are randomly chosen for each of the 100 hash functions. See this discussion for more info.

Some people recommend a multiply instead of an XOR, in which case you may want to choose primes.



来源:https://stackoverflow.com/questions/18976924/minhash-implementation-how-to-find-hash-functions-for-permutations

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!