Can you suggest a good minhash implementation?

问题

I am trying to look for a minhash open source implementation which I can leverage for my work.

The functionality I need is very simple, given a set as input, the implementation should return its minhash.

A python or C implementation would be preferred, just in case I need to hack it to work for me.

Any pointers would be of great help.

Regards.

回答1:

You should have a look at the following open source libraries, in order. All of them are in Python, and show how you can calculate document similarity using LSH/MinHash:

lsh
LSHHDC : Locality-Sensitive Hashing based High Dimensional Clustering
MinHash

回答2:

Take a look at the datasketch library. It supports serialization and merging. It is implemented in pure python with no external dependency. The Go version has the exact same functionalities.

回答3:

In case you are interested in studying the minhash algorithm, here is a very simple implementation with some discussion.

To generate a MinHash signature for a set, we create a vector of length $N$ in which all values are set to positive infinity. We also create $N$ functions that take an input integer and permute that value. The $i^{th}$ function will be solely responsible for updating the $i^{th}$ value in the vector. Given these values, we can compute the minhash signature of a set by passing each value from the set through each of the $N$ permutation functions. If the output of the $i^{th}$ permutation function is lower than the $i^{th}$ value of the vector, we replace the value with the output from the permutation function (this is why the hash is known as the "min-hash"). Let's implement this in Python:

from scipy.spatial.distance import cosine
from random import randint
import numpy as np

# specify the length of each minhash vector
N = 128
max_val = (2**32)-1

# create N tuples that will serve as permutation functions
# these permutation values are used to hash all input sets
perms = [ (randint(0,max_val), randint(0,max_val)) for i in range(N)]

# initialize a sample minhash vector of length N
# each record will be represented by its own vec
vec = [float('inf') for i in range(N)]

def minhash(s, prime=4294967311):
  '''
  Given a set `s`, pass each member of the set through all permutation
  functions, and set the `ith` position of `vec` to the `ith` permutation
  function's output if that output is smaller than `vec[i]`.
  '''
  # initialize a minhash of length N with positive infinity values
  vec = [float('inf') for i in range(N)]

  for val in s:

    # ensure s is composed of integers
    if not isinstance(val, int): val = hash(val)

    # loop over each "permutation function"
    for perm_idx, perm_vals in enumerate(perms):
      a, b = perm_vals

      # pass `val` through the `ith` permutation function
      output = (a * val + b) % prime

      # conditionally update the `ith` value of vec
      if vec[perm_idx] > output:
        vec[perm_idx] = output

  # the returned vector represents the minimum hash of the set s
  return vec

That's all there is to it! To demonstrate how we might use this implementation, let's take just a simple example:

import numpy as np

# specify some input sets
data1 = set(['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'datasets'])
data2 = set(['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'documents'])

# get the minhash vectors for each input set
vec1 = minhash(data1)
vec2 = minhash(data2)

# divide both vectors by their max values to scale values {0:1}
vec1 = np.array(vec1) / max(vec1)
vec2 = np.array(vec2) / max(vec2)

# measure the similarity between the vectors using cosine similarity
print( ' * similarity:', 1 - cosine(vec1, vec2) )

This returns ~.9 as a measurement of the similarity between these vectors.

While we compare just two minhash vectors above, we can compare them much more simply by using a "Locality Sensitive Hash". To do so, we can build a dictionary that maps each sequence of $W$ MinHash vector components to a unique identifier for the set from which the MinHash vector was constructed. For example, if W = 4 and we have a set S1 from which we derive a MinHash vector [111,512,736,927,817...], we would add the S1 identifier to each sequence of four MinHash values in that vector:

d[111-512-736-927].append('S1')
d[512-736-927-817].append('S1')
...

Once we do this for all sets, we can examine each key in the dictionary, and if that key has multiple distinct set id's, we have reason to believe those sets are similar. Indeed, the greater the number of times a pair of set id's occurs within a single value in the dictionary, the greater the similarity between the two sets. Processing our data in this way, we can reduce the complexity of identifying all pairs of similar sets to roughly linear time!

回答4:

I would suggest you this library, especially if you need persistence. Here, you can use redis to store/retrieve all your data.

You have the option to select a redis database, or to simply use built-in in-memory python dictionaries.

Performances using redis, at least if redis server is running on your local machine, are almost identical to those achieved via standard python dictionaries.

You only need to specify a config dictionary such as

config = {"redis": {"host": 'localhost', "port": '6739', "db": 0}}

and pass it as an argument to the LSHash class constructor.

来源：https://stackoverflow.com/questions/14533420/can-you-suggest-a-good-minhash-implementation

标签

python

hash

minhash