Jaccard's distance matrix with tensorflow

问题

I would like to compute a distance matrix using the Jaccard distance. And do so as fast as possible. I used to use scikit-learn's pairwise_distances function. But scikit-learn doesn't plan to support GPU, and there's even a known bug that makes the function slower when run in parallel.

My only constraint is that the resulting distance matrix can then be fed to scikit-learn's DBSCAN clustering algorithm. I was thinking about implementing the computation with tensorflow but couldn't find a nice and simple way to do it.

PS: I have reasons to precompute the distance matrix instead of letting DBSCAN do it as needed.

回答1:

Hej I was facing the same problem.

Given the idea that the jaccard similarity is the ratio of true postives (tp) to the sum of true positives, false negatives (fn) and false positives (fp), I came up with this solution:

    def jaccard_distance(self):
        tp = tf.reduce_sum(tf.mul(self.target, self.prediction), 1)
        fn = tf.reduce_sum(tf.mul(self.target, 1-self.prediction), 1)
        fp = tf.reduce_sum(tf.mul(1-self.target, self.prediction), 1)
        return 1 - (tp / (tp + fn + fp))

Hope this helps!

回答2:

I am not a tensorflow expert, but here is the solution I got. As far as I know, the only ways in tensorflow to do a computation on all-pairs of a list is to do a matrix multiplication or use the broadcasting rules, this solution uses both at some point.

So let's assume we have an input boolean matrix of n_samples rows, one per set, and n_features columns, one per possible element. A value True in the i-th row, j-th column means the i-th set contains the element j. Just like scikit-learn's pairwise_distances expect. We can then proceed as follow.

Cast the matrix to numbers, getting 1 for True and 0 for False.
Multiply the matrix by its own transpose. This produce a matrix where each element M[i][j] contains size of the intersection between the i-th and j-th sets.
Compute a cardv vector that contains the cardinality of all the sets by summing the input matrix by rows.
Make a row and a column vector from cardv.
Compute 1 - M / (cardvrow + cardvcol - M). The broadcasting rules will do all the work when adding a row and a column vector.

This algorithm as a whole seems a bit hack-ish, but it works and produce results within a reasonable margin from the result computed by scikit-learn's pairwise_distances function. A better algorithm should probably make a single pass on every pair of input vectors and compute only half of the matrix as it is symmetric. Any improvement is welcome.

setsin = tf.placeholder(tf.bool, shape=(N, M))
sets = tf.cast(setsin, tf.float16)
mat = tf.matmul(sets, sets, transpose_b=True, name="Main_matmul")
#mat = tf.cast(mat, tf.float32, name="Upgrade_mat")
#sets = tf.cast(sets, tf.float32, name="Upgrade_sets")
cardinal = tf.reduce_sum(sets, 1, name="Richelieu")
cardinalrow = tf.expand_dims(cardinal, 0)
cardinalcol = tf.expand_dims(cardinal, 1)

mat = 1 - mat / (cardinalrow + cardinalcol - mat)

I used float16 type as it seems much faster than float32. Casting to float32 might only be useful if the cardinals are large enough to make them inaccurate or if more precision is needed when performing the division. But even when the casts are needed, it seems to be still relevant to do the matrix multiplication as float16.

来源：https://stackoverflow.com/questions/43261072/jaccards-distance-matrix-with-tensorflow

标签

python

tensorflow

distance