How many hash functions are required in a minhash algorithm

前端未结

关注

 5  1296

花落未央 2020-12-07 11:29

I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how m

5条回答

刺人心 (楼主)

2020-12-07 12:22

Just use 1 hash function! (and save the 1/(f ε^2) smallest values.)

Check out this article for the state of the art practical and theoretical bounds. It has this nice graph (below), explaining why you probably want to use just one 2-independent hash function and save the k smallest values.

When estimating set sizes the paper shows that you can get a relative error of approximately ε = 1/sqrt(f k) where f is the jaccard similarity and k is the number of values kept. So if you want error ε, you need k=1/(fε^2) or if your sets have similarity around 1/3 and you want a 10% relative error, you should keep the 300 smallest values.

0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...