I\'m familiar with the LSH (Locality Sensitive Hashing) techniques of SimHash and MinHash. SimHash uses cosine similarity over real-valued data. MinHash calculates resemblan
This paper might give you some ideas on the two algorithms.
http://jmlr.org/proceedings/papers/v33/shrivastava14.pdf