minhash

All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster

。_饼干妹妹 提交于 2020-08-09 13:35:23
问题 I run into problems when calling Spark's MinHashLSH's approxSimilarityJoin on a dataframe of (name_id, name) combinations. A summary of the problem I try to solve: I have a dataframe of around 30 million unique (name_id, name) combinations for company names. Some of those names refer to the same company, but are (i) either misspelled, and/or (ii) include additional names. Performing fuzzy string matching for every combination is not possible. To reduce the number of fuzzy string matching

LSH Spark stucks forever at approxSimilarityJoin() function

微笑、不失礼 提交于 2020-01-03 01:11:11
问题 I am trying to implement LSH spark to find nearest neighbours for each user on very large datasets containing 50000 rows and ~5000 features for each row. Here is the code related to this. MinHashLSH mh = new MinHashLSH().setNumHashTables(3).setInputCol("features") .setOutputCol("hashes"); MinHashLSHModel model = mh.fit(dataset); Dataset<Row> approxSimilarityJoin = model .approxSimilarityJoin(dataset, dataset, config.getJaccardLimit(), "JaccardDistance"); approxSimilarityJoin.show(); The job

k-means using signature matrix generated from minhash

与世无争的帅哥 提交于 2019-12-23 12:14:42
问题 I have used minhash on documents and their shingles to generate a signature matrix from these documents. I have verified that the signature matrices are good as comparing jaccard distances of known similar documents (say, two articles about the same sports team or two articles about the same world event) give correct readings. My question is: does it make sense to use this signature matrix to perform k-means clustering? I've tried using the signature vectors of documents and calculating the

Minhash implementation how to find hash functions for permutations

老子叫甜甜 提交于 2019-12-23 11:20:51
问题 I have a problem implementing minhashing. On paper and from reading I understand the concept, but my problem is the permutation "trick". Instead of permuting the matrix of sets and values the suggestion for implementation is: "pick k (e.g. 100) independent hash functions" and then the algorithm says: for each row r for each column c if c has 1 in row r for each hash function h_i do if h_i(r) is a smaller value than M (i, c) then M(i, c) := h_i(r) In different small examples and teaching book

How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)?

一曲冷凌霜 提交于 2019-12-22 05:42:13
问题 I am implementing a near-neighbor search application which will find similar documents. So far I have read a good portion of LSH related materials (theory behind LSH is some kind of confusing and I am not able to comphrened it 100% yet). My code is able to compute the signature matrix using the minhash functions (I am close to the end). I also apply the banding strategy on the signature matrix. However I am not able to understand how to hash signature vectors (of columns) in a band into

Storing the result of Minhash

徘徊边缘 提交于 2019-12-11 11:54:19
问题 The result is a fixed number of arrays, let's say lists (all of the same length) in python. One could see it as a matrix too, so in c I would use an array, where every cell would point to another array. How to do it in Python? A list where every item is a list or something else? I thought of a dictionary, but the keys are trivial, 1, 2, ..., M, so I am not sure if that is the pythonic way to go here. I am not interested in the implementation, I am interested in which approach I should follow,

UDF to check is non zero vector, not working after CountVectorizer through spark-submit

有些话、适合烂在心里 提交于 2019-12-11 07:24:27
问题 As per this question, I am applying udf to filter empty vector after CountVectorizer. val tokenizer = new RegexTokenizer().setPattern("\\|").setInputCol("dataString").setOutputCol("dataStringWords") val vectorizer = new CountVectorizer().setInputCol("dataStringWords").setOutputCol("features") val pipelineTV = new Pipeline().setStages(Array(tokenizer, vectorizer)) val modelTV = pipelineTV.fit(dataset1) val isNoneZeroVector = udf({v: Vector => v.numNonzeros > 0}, DataTypes.BooleanType) val

Node.js / javascript minhash module that outputs a similar hashstring for similar text

∥☆過路亽.° 提交于 2019-12-11 02:05:15
问题 I am looking for a node.js / Javascript module that applies the minhash algorithm to a string or bigger text, and returns me an "identifying" or "characteristic" Bytestring or Hexstring for that text. If I apply the algorithm to another similar text string, the hash string should also be similar. Does a module like that exist already? The modules I was examining so far had only the possibility to compare texts directly and calculating some kind of jaccard similarity in numbers directly to the

Can you suggest a good minhash implementation?

流过昼夜 提交于 2019-12-09 04:07:06
问题 I am trying to look for a minhash open source implementation which I can leverage for my work. The functionality I need is very simple, given a set as input, the implementation should return its minhash. A python or C implementation would be preferred, just in case I need to hack it to work for me. Any pointers would be of great help. Regards. 回答1: You should have a look at the following open source libraries, in order. All of them are in Python, and show how you can calculate document

LSH Spark stucks forever at approxSimilarityJoin() function

老子叫甜甜 提交于 2019-12-09 02:16:26
I am trying to implement LSH spark to find nearest neighbours for each user on very large datasets containing 50000 rows and ~5000 features for each row. Here is the code related to this. MinHashLSH mh = new MinHashLSH().setNumHashTables(3).setInputCol("features") .setOutputCol("hashes"); MinHashLSHModel model = mh.fit(dataset); Dataset<Row> approxSimilarityJoin = model .approxSimilarityJoin(dataset, dataset, config.getJaccardLimit(), "JaccardDistance"); approxSimilarityJoin.show(); The job gets stuck at approxSimilarityJoin() function and never goes beyond it. Please let me know how to solve