k-means using signature matrix generated from minhash

问题

I have used minhash on documents and their shingles to generate a signature matrix from these documents. I have verified that the signature matrices are good as comparing jaccard distances of known similar documents (say, two articles about the same sports team or two articles about the same world event) give correct readings.

My question is: does it make sense to use this signature matrix to perform k-means clustering?

I've tried using the signature vectors of documents and calculating the euclidean distance of these vectors inside the iterative kmeans algorithm and I always get nonsense for my clusters. I know there should be two clusters (my data set is a few thousands articles about either sports or business) and in the end my two clusters are always just random. I'm convinced that the randomness of hashing words into integers is going to skew the distance function every time and overpower similar hash values in two signature matrices.

[Edited to highlight the question]

回答1:

TL;DR

Short answer: No, it doesn't make sense to use the signature matrix for K-means clustering. At least, not without significant manipulation.

Some explanations

I'm coming at this after a few days of figuring out how to do the same thing (text clustering) myself. I might be wrong, but my perception is that you're making the same mistake I was: using MinHash to build an [n_samples x n_perms] matrix, then using this as a features matrix X on which you run k-means.

I'm guessing you're doing something like:

# THIS CODE IS AN EXAMPLE OF WRONG! DON'T IMPLEMENT!
import numpy as np
import MinHash
from sklearn.cluster import KMeans
# Get your data. 
data = get_your_list_of_strings_to_cluster()
n_samples = len(data)
# Minhash all the strings
n_perms = 128
minhash_values = np.zeros((n_samples, n_perms), dtype='uint64')
minhashes = []
for index, string in enumerate(data):
    minhash = MinHash(num_perm=n_perms)
    for gram in ngrams(string, 3):
         minhash.update("".join(gram).encode('utf-8'))
     minhash_values[index, :] = minhash.hashvalues
# Compute clusters
clusterer = KMeans(n_clusters=8)
clusters = clusterer.fit_predict(minhash_values)

This will behave horribly because of the fateful flaw - the minhash_values array is not a feature matrix. Each row is basically a list of features (hashes) which appear in that sample of text... but they're not column-aligned so features are scattered into the wrong dimensions.

To turn that into a feature matrix, you'd have to look at all the unique hashes in minhash_values then create a matrix which is [n_samples x n_unique_hashes], (n_unique_hashes is the number of unique features found) setting it to 1 where the text sample contains that feature, 0 elsewhere. Typically this matrix would be large and sparse. You could then cluster on that.

Alternative way of text clustering

What an unbelievable hassle though! Fortunately, scikit-learn is there to help. It provides some very easy to use and scalable vectorisers:

So your problem becomes easily solved:

# Imports
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import KMeans

# Get your data
data = get_your_list_of_strings_to_cluster()

# Get your feature matrix
text_features = HashingVectorizer(analyzer="word").fit_transform(data)

# Compute clusters
clusterer = KMeans(n_clusters=2)
clusters = clusterer.fit_predict(text_features)

And there you go. From there:

Fine tune your vectoriser (try TfidfVectorizer too, tweak the input params, etc),
Try other clusterers (f/ex I find HDBSCAN miles better than kmeans - quicker, more robust, more accurate, less tuning).

Hope this helps.

Tom

来源：https://stackoverflow.com/questions/46867085/k-means-using-signature-matrix-generated-from-minhash

标签

python

text

cluster-analysis

k-means

minhash