Number of buckets in LSH

社会主义新天地 提交于 2019-12-14 01:26:50

问题


In LSH, you hash slices of the documents into buckets. The idea is that these documents that fell into the same buckets will be potentially similar, thus a nearest neighbor, possibly.

For 40.000 documents, what is a good value (pretty much) for the number of buckets?

I have it as: number_of_buckets = 40.000/4 now, but I feel it can be reduced more.

Any ideas, please?


Relative: How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)?


回答1:


A common starting point is to use sqrt(n) buckets for n documents. You can try doubling and halving that and run some analysis to see what kind of document distributions you got. Naturally any other exponent can be tried as well, and even K * log(n) if you expect that the number of distinct clusters grows "slowly".

I don't think this is an exact science yet, belongs on the similar topic as choosing the optimal k for k-means clustering.




回答2:


I think it should be at least n. If it is less than that, let's say n/2, you ensure that for all bands, each document will have at least 1 possible similar document on average, due to collisions. So, your complexity when calculating the similarities will be at least O(n).

On the other hand, you will have to pass through the buckets at least K times, so that is O(K*B), being B your buckets. I believe the latter is faster, because it is just iterating over your data structure (namely a Dictionary of some kind) and counting the number of documents that hashed to each bucket.



来源:https://stackoverflow.com/questions/37171834/number-of-buckets-in-lsh

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!