minhash

Locality-sensitive hashing - Elasticsearch

☆樱花仙子☆ 提交于 2019-12-06 18:39:12
问题 is there any plugin allowing LSH on Elasticsearch? If yes, could you point me to the location and tell me a little how to use it? Thanks Edit: I found out that ES uses MinHash plugin. How could I compare documents to one another with this? What would be a good setting to find duplicates? 回答1: There is a Elasticsearch MinHash Plugin. You can use it to extract minhash value every time you index a document and query the document by minhash later. Install MinHash plugin: $ $ES_HOME/bin/plugin

Can you suggest a good minhash implementation?

房东的猫 提交于 2019-12-02 18:11:58
I am trying to look for a minhash open source implementation which I can leverage for my work. The functionality I need is very simple, given a set as input, the implementation should return its minhash. A python or C implementation would be preferred, just in case I need to hack it to work for me. Any pointers would be of great help. Regards. You should have a look at the following open source libraries, in order. All of them are in Python, and show how you can calculate document similarity using LSH/MinHash: lsh LSHHDC : Locality-Sensitive Hashing based High Dimensional Clustering MinHash

String similarity with OR condition in MinHash Spark ML

拥有回忆 提交于 2019-12-02 13:43:24
问题 I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm. val dataset1 = +-------------+----------+------+------+-----------------------+ | x'| y'| a'| b'| dataString(x'+y'+a')| +-------------+----------+------+------+-----------------------+ | John| Smith| 55649| 28200| John|Smith|55649| | Emma| Morales| 78439| 34200| Emma|Morales|78439| | Janet| Alvarado| 89488| 29103| Janet|Alvarado|89488| |

String similarity with OR condition in MinHash Spark ML

…衆ロ難τιáo~ 提交于 2019-12-02 06:43:43
I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm. val dataset1 = +-------------+----------+------+------+-----------------------+ | x'| y'| a'| b'| dataString(x'+y'+a')| +-------------+----------+------+------+-----------------------+ | John| Smith| 55649| 28200| John|Smith|55649| | Emma| Morales| 78439| 34200| Emma|Morales|78439| | Janet| Alvarado| 89488| 29103| Janet|Alvarado|89488| | Elizabeth| K| 36935| 38101| Elizabeth|K|36935| | Cristin| Cruz| 75716| 70015| Cristin|Cruz|75716| | Jack|

Choosing between SimHash and MinHash for a production system

会有一股神秘感。 提交于 2019-11-30 18:23:24
I'm familiar with the LSH (Locality Sensitive Hashing) techniques of SimHash and MinHash. SimHash uses cosine similarity over real-valued data. MinHash calculates resemblance similarity over binary vectors. But I can't decide which one would be better to use. I am creating a backend system for a website to find near duplicates of semi-structured text data. For example, each record will have a title, location, and a brief text description (<500 words). Specific language implementation aside, which algorithm would be best for a greenfield production system? Simhash is faster (very fast) and

Choosing between SimHash and MinHash for a production system

空扰寡人 提交于 2019-11-29 17:54:39
问题 I'm familiar with the LSH (Locality Sensitive Hashing) techniques of SimHash and MinHash. SimHash uses cosine similarity over real-valued data. MinHash calculates resemblance similarity over binary vectors. But I can't decide which one would be better to use. I am creating a backend system for a website to find near duplicates of semi-structured text data. For example, each record will have a title, location, and a brief text description (<500 words). Specific language implementation aside,

Generating Random Hash Functions for LSH Minhash Algorithm

寵の児 提交于 2019-11-29 02:40:25
I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment). In order to do that, I've been generating random numbers a, b, and c (from the range 1 - 2001) for each of the 240 hash functions. Then, my hash function returns h = ((a*x) + b) % c, where h is the return value and x is one of the integers run through it. Is this an efficient implementation of random hashing, or is there a more common/acceptable way to do it? This post was asking

Generating Random Hash Functions for LSH Minhash Algorithm

六月ゝ 毕业季﹏ 提交于 2019-11-27 16:58:29
问题 I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment). In order to do that, I've been generating random numbers a, b, and c (from the range 1 - 2001) for each of the 240 hash functions. Then, my hash function returns h = ((a*x) + b) % c, where h is the return value and x is one of the integers run through it. Is this an efficient