问题
I'm reading this survey about LSH, in particular citing the last paragraph of section 2.2.1:
To improve the recall, L hash tables are constructed, and the items lying in the L (L ′ , L ′ < L) hash buckets h_1 (q), · · · , h_L (q) are retrieved as near items of q for randomized R-near neighbor search (or randomized c- approximate R-near neighbor search). To guarantee the precision, each of the L hash codes, y_i , needs to be a long code, which means that the total number of the buckets is too large to index directly. Thus, only the nonempty buckets are retained by resorting to convectional hashing of the hash codes h_l (x).
I have 3 questions:
- The bold sentence is not clear to me: what does it mean by "resorting to convenctional hashing of the hash codes
h_l (x)"? - Always about the bold sentence, I'm not sure that I got the problem: I totally understand that
h_l(x)can be a long code and so the number of possible buckets can be huge. For example, ifh_l(x)is a binary code andlengthish_l(x)'s length, then we have in totalL*2^lengthpossible buckets (since we useLhash tables)...is that correct? - Last question: once we find which bucket the query vector
qbelongs to, in order to find the nearest neighbor we have to use the original vectorqand the original distance metric? For example, let suppose that the original vectorqis in 128 dimensionsq=[1,0,12,...,14.3]^Tand it uses the euclidean distance in our application. Now suppose that our hashing function (supposing that L=1 for simplicity) used in LSH maps this vector to a binary space in 20 dimensionsy=[0100...11]^Tin order to decide which bucket assignqto. Soyhas the same index of the bucketB, and which already contains 100 vectors. Now, in order to find the nearest neighbor, we have to compareqwith all the others 100 128-dimensions vectors using euclidean distance. Is this correct?
回答1:
Approach they are using to improve recall constructs more hash tables and essentially stores multiple copies of the ID for each reference item, hence space cost is larger [4]. If there are a lot of empty buckets which increases the retrieval cost, the double-hash scheme or fast search algorithm in the Hamming space can be used to fast retrieve the hash buckets. I think in this case they are using double hash function to retrieve non-empty buckets.
No of buckets/memory cells [1][2][3] -> O(nL)
References:
[1] http://simsearch.yury.name/russir/03nncourse-hand.pdf
[2] http://joyceho.github.io/cs584_s16/slides/lsh-12.pdf
[3] https://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf
[4] http://research.microsoft.com/en-us/um/people/jingdw/Pubs%5CLTHSurvey.pdf
来源:https://stackoverflow.com/questions/37783227/non-empty-buckets-in-lsh