How many hash functions are required in a minhash algorithm

前端 未结 5 1296
花落未央
花落未央 2020-12-07 11:29

I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how m

5条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-07 12:22

    Just use 1 hash function! (and save the 1/(f ε^2) smallest values.)

    Check out this article for the state of the art practical and theoretical bounds. It has this nice graph (below), explaining why you probably want to use just one 2-independent hash function and save the k smallest values.

    When estimating set sizes the paper shows that you can get a relative error of approximately ε = 1/sqrt(f k) where f is the jaccard similarity and k is the number of values kept. So if you want error ε, you need k=1/(fε^2) or if your sets have similarity around 1/3 and you want a 10% relative error, you should keep the 300 smallest values.

提交回复
热议问题